# Movie Recommendation using Collaborative Filtering


Collaborative Filtering is a popular technique used in recommender systems, including those employed by platforms like Netflix and Amazon. 

It works based on the idea of leveraging the preferences and behaviors of a large group of users to make recommendations to individual users.

Let us obtain the ratings dataset first

In [49]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [50]:
Ratings_Data = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv")
Ratings_Data

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


Now lets analyze rating column in this dataset

In [51]:
Ratings_Data.rating.describe()

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

Ratings are in the range of 0.5 to 5, with step granularity as 0.5 

Now lets create a user vs movie matrix in terms of ratings users provide to movies so as to model user behavorial patterns w.r.t movies

In [52]:
matrix = Ratings_Data.pivot(values='rating', index='userId', columns='movieId')

matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


This is a sparse matrix. 

NaNs can be replaced with 0

In [53]:
matrix = matrix.fillna(0)

We recommend based on which movie the user rated as highest 

In [83]:
#Lets assume userId is 500

userId = 500
matrix[matrix.index == userId]

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
500,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [84]:
# Sorting movies based on user ratings. 0 mean movie isn't rated, thus these can be recommended 

movie_ratings_by_user = matrix[matrix.index == userId].iloc[0].sort_values(ascending=False)
movie_ratings_by_user

movieId
1282      5.0
1175      5.0
3114      5.0
2997      5.0
2355      5.0
         ... 
4470      0.0
4471      0.0
4473      0.0
4474      0.0
193609    0.0
Name: 500, Length: 9724, dtype: float64

In [85]:
highest_rated_movies = list(movie_ratings_by_user[movie_ratings_by_user == movie_ratings_by_user.iloc[0]].index)
highest_rated_movies

[1282, 1175, 3114, 2997, 2355, 176, 4306, 3355, 1747, 2542, 1784, 2700, 2858]

We have a list of movies that the user rated as highest. 

Before moving on to collaborative filtering, let us make sure to convert list of movie items into names

In [58]:
movies = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv")

In [59]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [63]:
def get_movie_names_from_id(movies_list,movies):
    return list(movies[movies.movieId.isin(movies_list)].title)

In [69]:
highest_rated_movies

[141, 32, 1356]

In [70]:
get_movie_names_from_id(highest_rated_movies,movies)

['Twelve Monkeys (a.k.a. 12 Monkeys) (1995)',
 'Birdcage, The (1996)',
 'Star Trek: First Contact (1996)']

### Item-based Collaborative Filtering

We will now perform Item-based Collaborative Filtering based on these movies to find other movies not watched by the user (rating as 0) most similar to this highest rated list of movies based on cosine similarity as distance measure.

In [71]:
def recommend_movie_item_cf(Ratings_Data,userId,k,movies):

    matrix = Ratings_Data.pivot(values='rating', index='userId', columns='movieId')
    matrix = matrix.fillna(0)
    
    #highest_rated_movie = matrix[matrix.index == userId].iloc[0].sort_values(ascending=False).index[0]
    movie_ratings_by_user = matrix[matrix.index == userId].iloc[0].sort_values(ascending=False)
    highest_rated_movies = list(movie_ratings_by_user[movie_ratings_by_user == movie_ratings_by_user.iloc[0]].index)
        
    highest_rated_movies_position = [matrix.columns.get_loc(x) for x in highest_rated_movies]
    
    # Calculate cosine similarity
    cos_sim = cosine_similarity(matrix.transpose())

    Recommended_movies_df = pd.DataFrame()

    for highest_rated_movie_position in highest_rated_movies_position:

        Similar_to_movieID = pd.DataFrame(zip(list(matrix.columns),list(cos_sim[highest_rated_movie_position])))
        Similar_to_movieID = Similar_to_movieID.rename({0:'MovieID',1:'Cosine_similarity_with_MovieID'},axis=1)
        #print(Similar_to_movieID.sort_values('Cosine_similarity_with_MovieID',ascending=False).drop(highest_rated_movie_position))
        Recommended_movies_df = pd.concat([Recommended_movies_df,Similar_to_movieID.sort_values('Cosine_similarity_with_MovieID',ascending=False).drop(highest_rated_movie_position)])
        
    Recommended_movies_df = Recommended_movies_df.sort_values(by='Cosine_similarity_with_MovieID',ascending=False)

    Recommended_movies_df = Recommended_movies_df.merge(pd.DataFrame(movie_ratings_by_user.reset_index()),right_on='movieId',left_on='MovieID')
    Recommended_movies_df = Recommended_movies_df.rename({userId:'User_Rating'},axis=1).drop('movieId',axis=1)
    Recommended_movies_if_not_watched_df = Recommended_movies_df[Recommended_movies_df.User_Rating==0]
    Top_k_Recommended_movies_if_not_watched_df = list(Recommended_movies_if_not_watched_df.MovieID.iloc[0:k])

    highest_rated_movie_names = get_movie_names_from_id(highest_rated_movies,movies)
    Recommended_movie_names_list_for_User_500 = get_movie_names_from_id(Top_k_Recommended_movies_if_not_watched_df,movies)


    return highest_rated_movie_names, Recommended_movie_names_list_for_User_500


In [99]:
No_of_movies_to_recommend = 20
userId=150

highest_rated_movie_names_by_user_500, Recommended_movie_names_list_for_User_500 = recommend_movie_item_cf(Ratings_Data=Ratings_Data,userId=userId,k=No_of_movies_to_recommend,movies = movies)

In [100]:
highest_rated_movie_names_by_user_500

['Twelve Monkeys (a.k.a. 12 Monkeys) (1995)',
 'Birdcage, The (1996)',
 'Star Trek: First Contact (1996)']

In [101]:
Recommended_movie_names_list_for_User_500

['Toy Story (1995)',
 'Seven (a.k.a. Se7en) (1995)',
 'Usual Suspects, The (1995)',
 'Braveheart (1995)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Pulp Fiction (1994)',
 'Shawshank Redemption, The (1994)',
 'Star Trek: Generations (1994)',
 'Fugitive, The (1993)',
 'Jurassic Park (1993)',
 'Terminator 2: Judgment Day (1991)',
 'Batman (1989)',
 'Silence of the Lambs, The (1991)',
 'Trainspotting (1996)',
 'Terminator, The (1984)',
 'Star Trek VI: The Undiscovered Country (1991)',
 'Star Trek II: The Wrath of Khan (1982)',
 'Star Trek III: The Search for Spock (1984)',
 'Star Trek IV: The Voyage Home (1986)',
 'Mars Attacks! (1996)']

In [103]:
print("Since you like :\n\n"+'\n'.join(str(item) for item in highest_rated_movie_names_by_user_500))
print("\nYou might also like :\n\n"+'\n'.join(str(item) for item in Recommended_movie_names_list_for_User_500))

Since you like :

Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
Birdcage, The (1996)
Star Trek: First Contact (1996)

You might also like :

Toy Story (1995)
Seven (a.k.a. Se7en) (1995)
Usual Suspects, The (1995)
Braveheart (1995)
Star Wars: Episode IV - A New Hope (1977)
Pulp Fiction (1994)
Shawshank Redemption, The (1994)
Star Trek: Generations (1994)
Fugitive, The (1993)
Jurassic Park (1993)
Terminator 2: Judgment Day (1991)
Batman (1989)
Silence of the Lambs, The (1991)
Trainspotting (1996)
Terminator, The (1984)
Star Trek VI: The Undiscovered Country (1991)
Star Trek II: The Wrath of Khan (1982)
Star Trek III: The Search for Spock (1984)
Star Trek IV: The Voyage Home (1986)
Mars Attacks! (1996)


### We have now built a working recommender system based on user behavior towards movies. 

Advantage of collaborative filtering is that we don't require explicit features on users or movies, and unsaid patterns can be detected better (for e.g. Barbie and Oppenheimer are different genres, but users behavior might include watching them both due to "Barbenheimer")