## User-Based Collaborative Recommender 

In [1]:
import pandas as pd
import numpy as np
from scipy import sparse
import pickle
from sklearn.metrics.pairwise import cosine_similarity

### Data

In [2]:
df_ratings = pd.read_csv('../data/ratings_title.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating,title,genres,year
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
2,7,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
3,15,1,2.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
4,17,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995


In [3]:
df_movies = pd.read_csv('../data/clean_content.csv')
df_movies.head()

Unnamed: 0,movie_id,title,genres,year,tmdb_id,imdb_id,tmdb_rating,tmdb_votes,imdb_rating,imdb_votes,body,sentiment_score,weighted_rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,862,tt0114709,7.7,5415,8.3,956821,led woody andys toy live happily room andys bi...,0.8625,1.60957
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,8844,tt0113497,6.9,2413,7.0,334566,sibling judy peter discover enchanted board ga...,0.3612,0.616703
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,15602,tt0113228,6.5,92,6.6,26930,family wedding reignites ancient feud nextdoor...,0.9081,0.111535
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,31357,tt0114885,6.1,34,5.9,10784,cheated mistreated stepped woman holding breat...,0.9725,-0.327571
4,5,Father of the Bride Part II (1995),Comedy,1995,11862,tt0113041,5.7,173,6.0,37433,george bank recovered daughter wedding receive...,0.6486,-0.844647


In [4]:
df_ratings.isnull().sum()

userId     0
movieId    0
rating     0
title      0
genres     0
year       0
dtype: int64

In [5]:
df_ratings.shape

(100836, 6)

In [6]:
df_ratings.dtypes

userId       int64
movieId      int64
rating     float64
title       object
genres      object
year         int64
dtype: object

In [7]:
df_ratings.nunique()

userId      610
movieId    9724
rating       10
title      9719
genres      951
year        107
dtype: int64

### User-Item Matrix

Create a user-item interaction matrix.

In [8]:
user_item = df_ratings.pivot_table(values = 'rating', index = 'userId', columns= 'title')  
user_item

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,,,,,,,,,,,...,,,,,,,,,,
607,,,,,,,,,,,...,,,,,,,,,,
608,,,,,,,,,,,...,,,,,,4.5,3.5,,,
609,,,,,,,,,,,...,,,,,,,,,,


The above matrix has a lot of NaNs. This is because not all the users have watched all the movies. So there is no interaction betwen the user and the item.

The following code normalizes the user-item matrix. Not all users rate the same way. Some users may be rating movies more harshly or more linently than others. By subtracting each user's average rating from the actual rating, we can compare movie ratings on the same level for all the users. 

Normalize user-item matrix

In [9]:
norm_user_item = user_item.subtract(user_item.mean(axis=1), axis = 'rows')

## User-User Similarity Matrix

Compute user to user similarity using `cosine_similarity`.

In [10]:
user_similarity = cosine_similarity(sparse.csr_matrix(norm_user_item.fillna(0)))

In [11]:
#Convert the simmilarity matrix inta a dataframe
df_user_similarity = pd.DataFrame(user_similarity, index=user_item.index, columns=user_item.index)

As mentioned before, a user's ratings prediction in collaborative filtering is a compounded result of ratings given by similar users. A positive threshold of similarity is set, and users with siimilarity scores below this threshold do not contribute in the prediction of rating. A larger dataset with more user-item interaction data than this can have a similarity threshold close to overall average rating across all users. Since we are working with a small data of 610 users, the similarity threshold is lower than the average rating to have enough similar users. 

In [12]:
#Find the top similar users and their similarity score to a target user
user = 569

user_similarity_threshold = 0.1

#Get similar users and their similarity score 
similar_users = df_user_similarity[df_user_similarity[user] > user_similarity_threshold][user].sort_values(ascending=False)[1:]
similar_users

userId
81     0.281145
134    0.262071
243    0.256571
37     0.155364
588    0.150000
349    0.148739
237    0.141773
485    0.135926
25     0.135781
486    0.129558
5      0.128364
458    0.126163
94     0.125344
444    0.120537
26     0.113159
41     0.112905
602    0.111719
386    0.110974
130    0.101131
Name: 569, dtype: float64

## Item Recommendation 

Now that we have users that are most relevant to the target user, we now make movie recommendations for target user. The following code extracts the target user row from user-item interaction matrix and the coulmns without NaNs, indicating these movies have been watched by target user. 

In [13]:
#Movies watched by target user & the user rating
target_user_movies = norm_user_item[norm_user_item.index == user].dropna(axis = 1, how = 'all')
target_user_movies

title,Ace Ventura: Pet Detective (1994),Aladdin (1992),Batman (1989),Batman Forever (1995),Beauty and the Beast (1991),Clear and Present Danger (1994),Cliffhanger (1993),Dances with Wolves (1990),Die Hard: With a Vengeance (1995),Dumb & Dumber (Dumb and Dumber) (1994),Forrest Gump (1994),GoldenEye (1995),Jurassic Park (1993),"Net, The (1995)",Pulp Fiction (1994),Speed (1994),Star Trek: Generations (1994),Stargate (1994),True Lies (1994),"Usual Suspects, The (1995)"
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
569,0.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.0,1.0,-1.0,-1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,-1.0,-1.0


The zeros in the matrix above signify that the user has rated those movies with the user's average rating. Movies with negative interaction terms were rated low by the user and movies with positive interaction terms were rated high by the user.

The following code extracts rows with similar users from user-item interaction matrix, and columns of movies seen by atleast one similar user. 

In [16]:
#Movies watched by similar users
similar_users_movies = norm_user_item[norm_user_item.index.isin(similar_users.index)].dropna(axis=1, how = 'all')

The following code extracts movies watched by similar users but not by the target user. 

In [17]:
for column in target_user_movies.columns:
    if column in similar_users_movies.columns:
        similar_users_movies.drop(column, axis=1, inplace=True)

### Weighted Average

Now that we have similar users and movies watched by the similar users and not by the target user, we will be able to make recommendations based on similar users and their interactions (ratings). This is achieved by taking a weighted average of ratings given by similar users, weights being their similarity scores with the target user. 

In [18]:
movie_score = {}
#Loop through the movies seen by similar users
for movie in similar_users_movies.columns:
    
    #Extract the ratings given by each user to the movie
    movie_rating = similar_users_movies[movie]
    
    #Variable to calculate numerator of the weighted average
    numerator = 0
    
    #Variable to calculate the denominator of the weighted average
    denominator = 0
    
    #Loop through the similar users for that movie
    for user in similar_users.index:
        
        #If the similar user has seen the movie, avoid row with a NaN
        if pd.notnull(movie_rating[user]):
            
            #Weighted score is the product of user similarity score and movie rating by the similar user
            weighted_score = similar_users[user] * movie_rating[user]
            numerator += weighted_score
            denominator += similar_users[user]
    
    #Weighted average score of a movie
    movie_score[movie] = numerator / denominator

#Save the movie and the similarity score in a dataframe
movie_score = pd.DataFrame(movie_score.items(), columns=['title', 'similarity_score'])
user_rec = pd.merge(df_movies[['title','year']], movie_score[['title', 'similarity_score']], how='inner')

Movies watched by userId `569`.

In [19]:
df_ratings[df_ratings['userId'] == 569].head(20)

Unnamed: 0,userId,movieId,rating,title,genres,year
760,569,50,3.0,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,1995
1480,569,231,3.0,Dumb & Dumber (Dumb and Dumber) (1994),Adventure|Comedy,1994
2102,569,296,5.0,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,1994
2250,569,316,4.0,Stargate (1994),Action|Adventure|Sci-Fi,1994
2415,569,349,4.0,Clear and Present Danger (1994),Action|Crime|Drama|Thriller,1994
2727,569,356,3.0,Forrest Gump (1994),Comedy|Drama|Romance|War,1994
3405,569,480,4.0,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,1993
4105,569,590,4.0,Dances with Wolves (1990),Adventure|Drama|Western,1990
4287,569,592,3.0,Batman (1989),Action|Crime|Thriller,1989
20105,569,588,4.0,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,1992


Top 10 movies recommended for userId `569`.

In [20]:
user_rec.sort_values(by=['similarity_score', 'year'], ascending=False).reset_index(drop=True).drop(columns=['year']).head(10)

Unnamed: 0,title,similarity_score
0,Wild Tales (2014),1.746544
1,Prisoners (2013),1.746544
2,Horrible Bosses (2011),1.746544
3,No Country for Old Men (2007),1.746544
4,Along Came Polly (2004),1.746544
5,50 First Dates (2004),1.746544
6,Kill Bill: Vol. 2 (2004),1.746544
7,Anger Management (2003),1.746544
8,Duplex (2003),1.746544
9,Kill Bill: Vol. 1 (2003),1.746544


Recommendations by collaborative filtering algorithm are a mixture of comedy, thriller and crime movies. It is also worth mentioning that movies recommended by content based filtering algorithm for the same user were different. The first three content-based recommendations were Batman movies, but here the first three recommendations are comedy, thriller and crime. This is the main difference between content and collaborative filtering. The relevance in movies is determined by relevance in interaction of users with these items. User 569 has rated one of the Batman movies 5 and the other as 3. And since we normalize user ratings to make a apples-to-apples comparison, the mean rating for the user is 4 and that puts the two Batman movies on either side of like & dislike spectrum. Movies falling under comedy, thriller and crime genre are rated average or above average by the user. That explains first three recommendations. A quick look at the viewing history of the user shows that the user has watched more action movies than thriller or comedy, and should have been the top movie recommendations. This implies small user-item interaction data hurts the filtering algorithm. 

## Inferences

This recommendation engine is built purely on the user ratings, and does not get into the details of why these items are related. This model works on the assumption that people (users) share interests (similar ratings) on certain things (movies), they are more likely to share interests on other things as well. Instead of calculating the distance between the items like in content-based recommender, this model calculates the distance between the users, and uses this distance to come up with recommendations for a user. This recommender system benefits from large datasets, and its perfomrance keeps improving with the addition of users. Another advantage of collaborative filtering is that it is diverse in recommending items to users and does not restrict recommendations on user history. Additionally, the model recommendations are not bound by item attiributes or subject-matter expertise. 

Some of the limitations with this model include scalability. It is true that the model benefits from more user data, but it also gets more expensive. On the other hand, model suffers from data sparsity as it solely depends on interaction data. Additionally, it sufferes from the'cold start' problem, which occures when a new item is introduced and there is not enough data and connections amongst users on these items. Content-based filtering uses historical data and user profile data to solve this issue, but collaborative filtering is not ineffective in such cases until enough data has been collected. 

A third way of filtering information combines these two approaches and ovrecomes their individual weaknesses. 