# Movie Recommendation System with Collaborative Filtering


![netflix-background-9](https://user-images.githubusercontent.com/33485020/108069438-5ee79d80-7089-11eb-8264-08fdda7e0d11.jpg)


A `movie recommendation system` is a system whose objective is to predict and compile a list of movies that a user is likely to watch. Recommendation systems have gained much popularity in recent years and have been developed and implemented for various commercial use cases.

For example,
* Netflix uses recommendation systems to recommend movies or television programs for individual users
* Amazon uses recommendation systems to predict and display a list of products that the customer is likely to buy
* Spotify uses music recommendation systems to provide new songs for its listeners  

Recommendation systems have a strong potentional to be used in a variery of other areas as well, however they play a major role in the e-commerce and media businesses, as they can directly impact the revenue and user engagement

There are broadly 3 types of recommendation systems:
1. `Popularity Based:` This is a basic system in which movies/shows which are rated high will be recommended to all the users in a certain demographic region. Eg: Netflix Top Trending will show top 10 movies trending in that particular country to ever user. 

2. `Content Based:` The general idea is that if a user liked an item with certain properites then he/she are more likely to like similar items. Eg: Movies are recommended based on they Cast, Story, Genre, Plot, Director and many more fields. 

3. `Collaborative Filtering:` This is a more advanced system in which the algorithm tries to find similar users/articles and then recommends items based on this similarity. Eg: If one person likes movies A, B, and C and another person likes movies A, B, and D, it is likely that the first person will buy item D and the other person will buy item C, since they share many similarities with each other


In this Notebook, we'll be focusing on the advanced Collaborative Filtering recommender. Lets get started!

### Data Preparation 

In [1]:
#Importing the required packages

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
#Reading the data sets

df1 = pd.read_csv(r'https://raw.githubusercontent.com/Uttkarsh14/Movie-Recommendation-Engine/main/movies.csv')
df2 = pd.read_csv(r'https://raw.githubusercontent.com/Uttkarsh14/Movie-Recommendation-Engine/main/ratings.csv')

In [3]:
df = df2.merge(df1, left_on='movieId', right_on='movieId', how='left')
df

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,Split (2017),Drama|Horror|Thriller
100832,610,168248,5.0,1493850091,John Wick: Chapter Two (2017),Action|Crime|Thriller
100833,610,168250,5.0,1494273047,Get Out (2017),Horror
100834,610,168252,5.0,1493846352,Logan (2017),Action|Sci-Fi


In [4]:
#Removing columns which will not be used 

del df['timestamp']
del df['genres']

In [5]:
df.head()

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,1,3,4.0,Grumpier Old Men (1995)
2,1,6,4.0,Heat (1995)
3,1,47,5.0,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,"Usual Suspects, The (1995)"




We are now going to transform the DF into a movie-to-user matrix, where each row represents one movie and the columns correspond to different users. 

In [6]:
user_movie_matrix = pd.pivot_table(df, values = 'rating', index='movieId', columns = 'userId')
user_movie_matrix

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,4.5,,,,...,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,,,,,,4.0,,4.0,,,...,,4.0,,5.0,3.5,,,2.0,,
3,4.0,,,,,5.0,,,,,...,,,,,,,,2.0,,
4,,,,,,3.0,,,,,...,,,,,,,,,,
5,,,,,,5.0,,,,,...,,,,3.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,,,,,,,,,,,...,,,,,,,,,,
193583,,,,,,,,,,,...,,,,,,,,,,
193585,,,,,,,,,,,...,,,,,,,,,,
193587,,,,,,,,,,,...,,,,,,,,,,


In [7]:
user_movie_matrix = user_movie_matrix.fillna(0)
user_movie_matrix.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculating similarites 

Similarity based methods determine the most similar objects with the highest values as it implies they live in closer neighborhoods.
There are multiple similarity based metrics
* Pearson’s correlation
* Spearman’s correlation
* Kendall’s Tau
* Cosine similarity
* Jaccard similarity

However, for our case we'll be using `Pearson’s correlation` because we have a sparse matrix, which means that most movies are not rated (have a rating of 0). hence we will center all the ratings to 0 so that the default rating becomes 0.

![1_4JSKpD-YjekoSMxHdTdCOg](https://user-images.githubusercontent.com/33485020/108067673-25ae2e00-7087-11eb-9c79-57972f8a424b.png)


In [8]:
#user-based collaborative filtering

user_user_matrix = user_movie_matrix.corr(method='pearson')
user_user_matrix

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.019400,0.053056,0.176920,0.120866,0.104418,0.143793,0.128547,0.055268,-0.000298,...,0.066256,0.149942,0.186978,0.056530,0.134412,0.121981,0.254200,0.262241,0.085434,0.098719
2,0.019400,1.000000,-0.002594,-0.003804,0.013183,0.016257,0.021567,0.023750,-0.003448,0.061880,...,0.198549,0.010888,-0.004030,-0.005345,-0.007919,0.011299,0.005813,0.032730,0.024373,0.089329
3,0.053056,-0.002594,1.000000,-0.004556,0.001887,-0.004577,-0.005634,0.001703,-0.003111,-0.005501,...,0.000150,-0.000585,0.011211,-0.004822,0.003678,-0.003246,0.012885,0.008096,-0.002963,0.015962
4,0.176920,-0.003804,-0.004556,1.000000,0.121018,0.065719,0.100602,0.054235,0.002417,0.015615,...,0.072848,0.114287,0.281866,0.039699,0.065493,0.164831,0.115118,0.116861,0.023930,0.062523
5,0.120866,0.013183,0.001887,0.121018,1.000000,0.294138,0.101725,0.426576,-0.004185,0.023471,...,0.061912,0.414931,0.095394,0.254117,0.141077,0.090158,0.145764,0.122607,0.258289,0.040372
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.121981,0.011299,-0.003246,0.164831,0.090158,0.047506,0.172499,0.081913,0.057989,0.054877,...,0.153892,0.084208,0.224637,0.035251,0.106752,1.000000,0.115999,0.188354,0.052385,0.093851
607,0.254200,0.005813,0.012885,0.115118,0.145764,0.142169,0.173293,0.178133,0.003257,-0.004809,...,0.080034,0.187588,0.173025,0.126267,0.101138,0.115999,1.000000,0.258245,0.142533,0.098518
608,0.262241,0.032730,0.008096,0.116861,0.122607,0.137954,0.305439,0.175912,0.086229,0.048373,...,0.136316,0.174069,0.164479,0.133734,0.144896,0.188354,0.258245,1.000000,0.109563,0.248944
609,0.085434,0.024373,-0.002963,0.023930,0.258289,0.207124,0.084494,0.421627,-0.003937,0.014983,...,0.029664,0.331053,0.046000,0.232115,0.089810,0.052385,0.142533,0.109563,1.000000,0.033713


As we can see from the above matrix, higher +ve values means more similar and -ve values mean not similar. 
Eg: Similarty of user 1 and user 2 is 0.019400 but similarity of user1 and user3 is 0.053056, which means User1 is more similar to User3. 

### Recommending movies for sample userId: 2

We will go ahead and try to come with a movie list for UserId = 2

In [9]:
#Extarcing top 10 similar users for User2 by sorting them in descending order based on their similarties

user_user_matrix.loc[2].sort_values(ascending=False).head(10)

userId
2      1.000000
366    0.297982
417    0.277366
378    0.273342
550    0.252051
189    0.240668
528    0.238262
461    0.237457
495    0.235147
435    0.231771
Name: 2, dtype: float64

In [10]:
#Converting the above data into a DF and removing the user2 itself (A user will be always similar to iteself with a rating of 1)

df_2 = pd.DataFrame(user_user_matrix.loc[2].sort_values(ascending=False).head(10))
df_2 = df_2.reset_index()
df_2.columns = ['userId', 'similarity']

In [11]:
df_2 = df_2.drop((df_2[df_2['userId'] ==2]).index)
df_2

Unnamed: 0,userId,similarity
1,366,0.297982
2,417,0.277366
3,378,0.273342
4,550,0.252051
5,189,0.240668
6,528,0.238262
7,461,0.237457
8,495,0.235147
9,435,0.231771


In [12]:
#Now we are creating a new DF which has all the similar users and their rated movies

final_df = df_2.merge(df, left_on='userId', right_on='userId', how='left')
final_df

Unnamed: 0,userId,similarity,movieId,rating,title
0,366,0.297982,110,4.0,Braveheart (1995)
1,366,0.297982,589,4.0,Terminator 2: Judgment Day (1991)
2,366,0.297982,1036,4.0,Die Hard (1988)
3,366,0.297982,1089,5.0,Reservoir Dogs (1992)
4,366,0.297982,2028,4.0,Saving Private Ryan (1998)
...,...,...,...,...,...
596,435,0.231771,68954,4.0,Up (2009)
597,435,0.231771,74458,4.5,Shutter Island (2010)
598,435,0.231771,79132,4.5,Inception (2010)
599,435,0.231771,80463,4.5,"Social Network, The (2010)"


Here we are calculating a new metric (score) by multiplying the user similarity with the movie rating. The basic idea here is that the most similar user will have a higher possisbilty of suggesting better movies. So a movie which is rated high by the most similar user will have a higher order in our recommendation. 

In [13]:
final_df['score'] = final_df['similarity']*final_df['rating']
final_df

Unnamed: 0,userId,similarity,movieId,rating,title,score
0,366,0.297982,110,4.0,Braveheart (1995),1.191930
1,366,0.297982,589,4.0,Terminator 2: Judgment Day (1991),1.191930
2,366,0.297982,1036,4.0,Die Hard (1988),1.191930
3,366,0.297982,1089,5.0,Reservoir Dogs (1992),1.489912
4,366,0.297982,2028,4.0,Saving Private Ryan (1998),1.191930
...,...,...,...,...,...,...
596,435,0.231771,68954,4.0,Up (2009),0.927083
597,435,0.231771,74458,4.5,Shutter Island (2010),1.042968
598,435,0.231771,79132,4.5,Inception (2010),1.042968
599,435,0.231771,80463,4.5,"Social Network, The (2010)",1.042968


In [14]:
#Creating a df for all the movies which are already watched by our target user2

watched_df = df[df['userId'] == 2]
watched_df

Unnamed: 0,userId,movieId,rating,title
232,2,318,3.0,"Shawshank Redemption, The (1994)"
233,2,333,4.0,Tommy Boy (1995)
234,2,1704,4.5,Good Will Hunting (1997)
235,2,3578,4.0,Gladiator (2000)
236,2,6874,4.0,Kill Bill: Vol. 1 (2003)
237,2,8798,3.5,Collateral (2004)
238,2,46970,4.0,Talladega Nights: The Ballad of Ricky Bobby (2...
239,2,48516,4.0,"Departed, The (2006)"
240,2,58559,4.5,"Dark Knight, The (2008)"
241,2,60756,5.0,Step Brothers (2008)


Here we are going to remove already watched movies from our recommendation as we can not suggest the same movie again!

In [15]:
cond = final_df['movieId'].isin(watched_df['movieId'])
final_df.drop(final_df[cond].index, inplace = True) 

In [16]:
recommended_df = final_df.sort_values(by = 'score', ascending = False)['title'].head(10)
recommended_df = recommended_df.reset_index()
del recommended_df['index']

### Here is the list of top 10 recommended movies for user 2!

In [17]:
recommended_df

Unnamed: 0,title
0,Reservoir Dogs (1992)
1,"Truman Show, The (1998)"
2,"Matrix, The (1999)"
3,Trainspotting (1996)
4,"Godfather, The (1972)"
5,The Butterfly Effect (2004)
6,"Clockwork Orange, A (1971)"
7,"Godfather: Part II, The (1974)"
8,"Shining, The (1980)"
9,"Lord of the Rings: The Return of the King, The..."
