In [1]:
import pandas as pd
import numpy as np

---
### About this chapter: 
The purpose of this notebook is to implement user similarities using Pearson and Jaccard measures.

---
### Step 1 - Import data:

In [2]:
user_ratings_df = pd.read_csv('data/user_ratings.csv')
user_ratings_df.head(5)

Unnamed: 0,id,user_id,movie_id,rating,rating_timestamp,type
0,517025,40679,3890160,10.0,2017-09-12 22:20:49-04,explicit
1,517026,40679,4034228,8.0,2017-02-17 01:00:48-05,explicit
2,517027,40679,4540710,8.0,2017-03-29 09:37:45-04,explicit
3,517028,40679,4550098,8.0,2017-02-17 02:50:43-05,explicit
4,517029,40679,4633694,7.0,2019-02-25 14:01:52-05,explicit


---
### Define Pearson Similarity function:
See page 162 and 174 of the text for a different implementation.

In [4]:
# gets pearson similarity between two users based on their ratings:
def pearson_sim(user_ratings, user_a, user_b):
    # get the mean ratings for each user and standardize the score against it:
    user_ratings['mean_rating'] = user_ratings.groupby('user_id')['rating'].transform('mean')
    user_ratings['standardized_rating'] = user_ratings['rating'] - user_ratings['mean_rating']
    
    # get each individual user's ratings:
    user_a_ratings_df = user_ratings[user_ratings['user_id'] == user_a]
    user_b_ratings_df = user_ratings[user_ratings['user_id'] == user_b]
    
    # inner join each dataframe on the "rated items" to get the common items between both:
    joined_df = pd.merge(left=user_a_ratings_df,
                        right=user_b_ratings_df,
                        how='inner',
                        left_on='movie_id',
                        right_on='movie_id',
                        suffixes=['_a', '_b'])
    
    # calculate denomiators and numerators according to the formula:
    normalized_rating_a = joined_df['standardized_rating_a']
    normalized_rating_b = joined_df['standardized_rating_b']
    numerator = sum(normalized_rating_a*normalized_rating_b)
    denominator = np.sqrt(sum(normalized_rating_a**2))*np.sqrt(sum(normalized_rating_b**2))
    pearson_sim = np.round(numerator/denominator, 2)
    
    return pearson_sim

In [6]:
pearson_sim(user_ratings_df, 400002, 63767)

0.88

Note: the scores above check out with the output of the moviegeeks website. 

---
### Define Jaccard Similarity function:

In [23]:
def jaccard_sim(user_ratings, user_a, user_b):
    # get sets of movies for each user:
    user_a_rated_items = set(user_ratings[user_ratings['user_id']==user_a]['movie_id'])
    user_b_rated_items = set(user_ratings[user_ratings['user_id']==user_b]['movie_id'])
    
    # get intersection of both sets (items both users have rated):
    items_intersection = user_a_rated_items & user_b_rated_items
    
    # get union of both sets (items either or both users have rated):
    items_union = user_a_rated_items | user_b_rated_items
    
    # calculate similarity
    jaccard_sim = np.round(len(items_intersection) / len(items_union), 2)
    
    return jaccard_sim

In [26]:
jaccard_sim(user_ratings_df, 400002, 1157)

0.28

Note: the scores above check out with the output of the moviegeeks website. 

---
**Note:** because Pearson similarity takes ratings into consideration during calculations, the similarity measure returned is in effect as measure of "taste similarity" or "rating behavior" similarity. The advantage of this is that once we calculate scores, we can use said scores to find out which items one user has rated/watched that the other hasn't and then recommend them to the user (set difference). Moreover, we can use either user's rating to estimate how the other user will rate the recommended movies.

On the other hand, the Jaccard similarity, because is a unary measure, only takes into account which movies the users being compared have watched/bought, regardless of their ratings for them. Because of this, we can say that the Jaccard measure is in effect a measure of the "consumption" similarity between users. 