In [1]:
from sklearn.metrics import jaccard_similarity_score
import numpy as np

In [2]:
a = np.array([0,1,1,1,0,0,1,1,0,0])
b = np.array([1,1,0,1,0,0,1,0,0,1])

print jaccard_similarity_score(a,b)

0.6


### The Jaccard similiarity score is a relatively simple method. 
### Let's suppose the number of check-ins reflect the preference level that users have for each venue. Then we need to take check-in frequencies into account, so the score of each venue in our user vector will become:
### #(check-ins at said venue)/#(check-ins total by said user)
### i.e. user a checked in 8 times at 5 different venues; user b checked in 10 times at 8 different venues. Let's suppose we have 10 venues in total. Our user checkin-in vector will look something like the following

In [3]:
a = np.array([0, 
              1./8, 
              1./8, 
              3./8, 
              0, 
              0, 
              1./8, 
              2./8, 
              0, 
              0])
b = np.array([1./10, 
              2./10, 
              1./10, 
              1./10, 
              0, 
              1./10, 
              1./10,  
              2./10, 
              0,
              1./10])

### Referring to the "Similarity of asymmetric binary attributes" section on https://en.wikipedia.org/wiki/Jaccard_index, we'll disregard the venues that neither a nor b has checked in. Then we'll add up all the venue scores that both a and b have checked in and divide it by 2.

### i.e. in this particular case, say the 10 venues are listed as venues 0~9. Since a and b both checked in at 1, 2, 3, 6, and 7, the similarity score between a and b will be the summation of all the scores at those venues over 2
### (1/8+2/10+1/8+1/10+3/8+1/10+1/8+1/10+2/8+2/10)/2
### Let's see what numpy functions will help us with that

In [4]:
print 'Non-zero index of a',np.nonzero(a)
print 'Non-zero index of b',np.nonzero(b)
print 'Non-zero index of both',np.intersect1d(np.nonzero(a), np.nonzero(b))
print 'Sum for a at non-zero index of both',np.sum(a[np.intersect1d(np.nonzero(a), np.nonzero(b))])
print 'Sum for b at non-zero index of both',np.sum(b[np.intersect1d(np.nonzero(a), np.nonzero(b))])
print 'Similarity score between a and b',(np.sum(a[np.intersect1d(np.nonzero(a), np.nonzero(b))])+
                                          np.sum(b[np.intersect1d(np.nonzero(a), np.nonzero(b))]))/2

Non-zero index of a (array([1, 2, 3, 6, 7]),)
Non-zero index of b (array([0, 1, 2, 3, 5, 6, 7, 9]),)
Non-zero index of both [1 2 3 6 7]
Sum for a at non-zero index of both 1.0
Sum for b at non-zero index of both 0.7
Similarity score between a and b 0.85


### Now let's put things in a function

In [5]:
def get_sim_index_based_on_freq(user1=None, user2=None):
    if user1.ndim!=1 or user2.ndim!=1:
        print 'Input arrays must be 1-dimensional'
        return
    elif user1.shape!=user2.shape:
        print 'Input arrays must have the same length'
        return
    else:
        index_to_sum = np.intersect1d(np.nonzero(user1), np.nonzero(user2))
    return float(np.sum(user1[index_to_sum])+np.sum(user2[index_to_sum]))/2

### Check to see if it works

In [6]:
print 'The similarity index based on check-in frequency between a and b is',get_sim_index_based_on_freq(a, b)

The similarity index based on check-in frequency between a and b is 0.85
