## Hands-on implementation of collaborative filtering


In this notebook, we are going to address the user-based collaborative filtering where based on ratings given by the various users to the respective movie will be analyzed and a new set of movies will be recommended to the new user who may have initially queried the movie. 

The dataset used for this task is having information in two files that is movies.csv titles and ratings.csv. Movies.csv contains the MovieID, titles, genres of the movies and ratings.csv contains movieId and ratings. Out of these features, we will use movie titles and ratings. 

The following steps are to be taken to build the recommendation system using collaborative filtering.    

Let's start with importing all the necessary library

In [1]:
# Pandas for Data handling
import pandas as pd

# Numpy for numerical operations
import numpy as np

####  Load and read the data

The data is stored in two separate CSV files so we will separately load those all and visualize the top 5 rows. 

In [2]:
# movies title dataset
movies_title = pd.read_csv('movies.csv')
movies_title.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
# movies ratings dataset
movies_ratings = pd.read_csv('ratings.csv')
movies_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Now after loading the dataset we will merge these two files.

In [4]:
# Merging the titles and ratings
title_ratings = pd.merge(movies_title, movies_ratings)

# Visualize top 5 rows
title_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


Let’s check the shape of the complete merged dataset. 

In [5]:
# Check the shapes
title_ratings.shape

(100836, 6)

#### Prepare the data for recommendation

As we are going to use only the title and ratings of the movies, the rest of all columns will be dropped.

In [6]:
# Dropping the irrelevant columns
title_ratings.drop(['genres','timestamp'], axis=1,inplace=True)

Now we will create a pivot table to identify the interaction between movies by each user.

In [7]:
# Pivot table
UserRatings = title_ratings.pivot_table(index=['userId'],columns=['title'],values='rating')
print("Before: ",UserRatings.shape)
UserRatings = UserRatings.dropna(thresh=10, axis=1).fillna(0,axis=1)
print("After: ",UserRatings.shape)
UserRatings.head()

Before:  (610, 9719)
After:  (610, 2269)


title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Building correlation matrix

Now we will build a correlation between movies using the Pearson correlation approach. 

In [8]:
# Pearson correlations
relation_metrix = UserRatings.corr(method='pearson')
relation_metrix.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",1.0,0.063117,-0.023768,0.143482,0.011998,0.087931,0.224052,0.034223,0.009277,0.008331,...,0.017477,0.03247,0.134701,0.153158,0.101301,0.049897,0.003233,0.187953,0.062174,0.353194
(500) Days of Summer (2009),0.063117,1.0,0.142471,0.273989,0.19396,0.148903,0.142141,0.159756,0.135486,0.200135,...,0.374515,0.178655,0.068407,0.414585,0.355723,0.252226,0.216007,0.053614,0.241092,0.125905
10 Cloverfield Lane (2016),-0.023768,0.142471,1.0,-0.005799,0.112396,0.006139,-0.016835,0.031704,-0.024275,0.272943,...,0.242663,0.099059,-0.023477,0.272347,0.241751,0.195054,0.319371,0.177846,0.096638,0.002733
10 Things I Hate About You (1999),0.143482,0.273989,-0.005799,1.0,0.24467,0.223481,0.211473,0.011784,0.091964,0.043383,...,0.243118,0.104858,0.13246,0.091853,0.158637,0.281934,0.050031,0.121029,0.130813,0.110612
"10,000 BC (2008)",0.011998,0.19396,0.112396,0.24467,1.0,0.234459,0.119132,0.059187,-0.025882,0.089328,...,0.260261,0.087592,0.094913,0.184521,0.242299,0.240231,0.094773,0.088045,0.203002,0.083518


To retrieve the similar movies that users want to explore we will write the user-defined function which will take the movie title and rating of the movie and by analyzing this query a suitable list of movies will be recommended.

In [9]:
def get_similar(movie_name,rating):
    similar_ratings = relation_metrix[movie_name]*(rating-2.5)
    similar_ratings = similar_ratings.sort_values(ascending=False)
    return similar_ratings

#### Generating similar movies  

Now in this step using the above function we will supply a list of action movies and corresponding ratings to generate similar movies.

In [10]:
# getting similar movie
movies= [("Skyfall (2012)", 5), ("Mission: Impossible III (2006)", 4)]
similar_movies = pd.DataFrame()
for movie,rating in movies:
    similar_movies = similar_movies.append(get_similar(movie,rating),ignore_index = True)

  similar_movies = similar_movies.append(get_similar(movie,rating),ignore_index = True)
  similar_movies = similar_movies.append(get_similar(movie,rating),ignore_index = True)


We have successfully generated similar movies to Skyfall and Mission Impossible, below let’s print the top 10 movies. 

In [11]:
# Top 10 movies that are similar to queries
similar_movies.sum().sort_values(ascending=False).head(10)

title
Skyfall (2012)                                 2.972980
Mission: Impossible III (2006)                 2.288300
Quantum of Solace (2008)                       2.089675
Casino Royale (2006)                           2.073860
Mission: Impossible - Ghost Protocol (2011)    1.972638
Prometheus (2012)                              1.933262
X-Men: First Class (2011)                      1.882690
Star Trek Into Darkness (2013)                 1.838459
Zombieland (2009)                              1.822665
Taken (2008)                                   1.816208
dtype: float64

As we can see the list that we have generated is quite relevant to the movie that we have queried. 

So this is how we can implement the recommendation system using collaborative filtering in python using real-world datasets.