# Movie Recommendation using Collaborative Filtering


Collaborative Filtering is a popular technique used in recommender systems, including those employed by platforms like Netflix and Amazon. 

It works based on the idea of leveraging the preferences and behaviors of a large group of users to make recommendations to individual users.

Let us obtain the ratings dataset first

In [21]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [22]:
Ratings_Data = pd.read_csv("ratings_small.csv")
Ratings_Data

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


Now lets analyze rating column in this dataset

In [23]:
Ratings_Data.rating.describe()

count    100004.000000
mean          3.543608
std           1.058064
min           0.500000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

Ratings are in the range of 0.5 to 5, with step granularity as 0.5 

Now lets create a user vs movie matrix in terms of ratings users provide to movies so as to model user behavorial patterns w.r.t movies

In [24]:
matrix = Ratings_Data.pivot(values='rating', index='userId', columns='movieId')

matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,,,,,,4.0,,,,,...,,,,,,,,,,
668,,,,,,,,,,,...,,,,,,,,,,
669,,,,,,,,,,,...,,,,,,,,,,
670,4.0,,,,,,,,,,...,,,,,,,,,,


This is a sparse matrix. 

NaNs can be replaced with 0

In [25]:
matrix = matrix.fillna(0)

We recommend based on which movie the user rated as highest 

In [26]:
#Lets assume userId is 500

userId = 500
matrix[matrix.index == userId]

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
500,2.0,1.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# Sorting movies based on user ratings. 0 mean movie isn't rated, thus these can be recommended 

movie_ratings_by_user = matrix[matrix.index == userId].iloc[0].sort_values(ascending=False)
movie_ratings_by_user

movieId
7669      5.0
38061     5.0
356       5.0
2324      5.0
2139      4.5
         ... 
3848      0.0
3847      0.0
3846      0.0
3845      0.0
163949    0.0
Name: 500, Length: 9066, dtype: float64

In [28]:
highest_rated_movies = list(movie_ratings_by_user[movie_ratings_by_user == movie_ratings_by_user.iloc[0]].index)
highest_rated_movies

[7669, 38061, 356, 2324]

We have a list of movies that the user rated as highest. 

Before moving on to collaborative filtering, let us make sure to convert list of movie items into names

In [29]:
links_df = pd.read_csv("links_small.csv", dtype={'imdbId': str})

In [30]:
movies_metadata_df = pd.read_csv("movies_metadata.csv")

  movies_metadata_df = pd.read_csv("movies_metadata.csv")


In [31]:
def get_movie_names_from_id(Recommended_movies_list,links_df,movies_metadata_df):
    IMDB_Recommended_movies_list = list(links_df[links_df.movieId.isin(Recommended_movies_list)].imdbId)
    IMDB_Recommended_movies_list = ["tt" + str(x) for x in IMDB_Recommended_movies_list]
    return list(movies_metadata_df[movies_metadata_df.imdb_id.isin(IMDB_Recommended_movies_list)].title)

In [32]:
highest_rated_movies

[7669, 38061, 356, 2324]

In [33]:
get_movie_names_from_id(highest_rated_movies,links_df,movies_metadata_df)

['Forrest Gump', 'Life Is Beautiful', 'Kiss Kiss Bang Bang']

Note : For 4 Movie IDs, we only got 3 Movie Names. This is because of data issue : one of the IDs is missing in movies_metadata_df

### Item-based Collaborative Filtering

We will now perform Item-based Collaborative Filtering based on these movies to find other movies not watched by the user (rating as 0) most similar to this highest rated list of movies based on cosine similarity as distance measure.

In [34]:
def recommend_movie_item_cf(Ratings_Data,userId,k,links_df,movies_metadata_df):

    matrix = Ratings_Data.pivot(values='rating', index='userId', columns='movieId')
    matrix = matrix.fillna(0)
    
    #highest_rated_movie = matrix[matrix.index == userId].iloc[0].sort_values(ascending=False).index[0]
    movie_ratings_by_user = matrix[matrix.index == userId].iloc[0].sort_values(ascending=False)
    highest_rated_movies = list(movie_ratings_by_user[movie_ratings_by_user == movie_ratings_by_user.iloc[0]].index)
        
    highest_rated_movies_position = [matrix.columns.get_loc(x) for x in highest_rated_movies]
    
    # Calculate cosine similarity
    cos_sim = cosine_similarity(matrix.transpose())

    Recommended_movies_df = pd.DataFrame()

    for highest_rated_movie_position in highest_rated_movies_position:

        Similar_to_movieID = pd.DataFrame(zip(list(matrix.columns),list(cos_sim[highest_rated_movie_position])))
        Similar_to_movieID = Similar_to_movieID.rename({0:'MovieID',1:'Cosine_similarity_with_MovieID'},axis=1)
        #print(Similar_to_movieID.sort_values('Cosine_similarity_with_MovieID',ascending=False).drop(highest_rated_movie_position))
        Recommended_movies_df = pd.concat([Recommended_movies_df,Similar_to_movieID.sort_values('Cosine_similarity_with_MovieID',ascending=False).drop(highest_rated_movie_position)])
        
    Recommended_movies_df = Recommended_movies_df.sort_values(by='Cosine_similarity_with_MovieID',ascending=False)

    Recommended_movies_df = Recommended_movies_df.merge(pd.DataFrame(movie_ratings_by_user.reset_index()),right_on='movieId',left_on='MovieID')
    Recommended_movies_df = Recommended_movies_df.rename({userId:'User_Rating'},axis=1).drop('movieId',axis=1)
    Recommended_movies_if_not_watched_df = Recommended_movies_df[Recommended_movies_df.User_Rating==0]
    Top_k_Recommended_movies_if_not_watched_df = list(Recommended_movies_if_not_watched_df.MovieID.iloc[0:k])

    highest_rated_movie_names = get_movie_names_from_id(highest_rated_movies,links_df,movies_metadata_df)
    Recommended_movie_names_list_for_User_500 = get_movie_names_from_id(Top_k_Recommended_movies_if_not_watched_df,links_df,movies_metadata_df)


    return highest_rated_movie_names, Recommended_movie_names_list_for_User_500


In [35]:
No_of_movies_to_recommend = 10

highest_rated_movie_names_by_user_500, Recommended_movie_names_list_for_User_500 = recommend_movie_item_cf(Ratings_Data=Ratings_Data,userId=500,k=No_of_movies_to_recommend,links_df=links_df,movies_metadata_df=movies_metadata_df)

In [36]:
highest_rated_movie_names_by_user_500

['Forrest Gump', 'Life Is Beautiful', 'Kiss Kiss Bang Bang']

In [37]:
Recommended_movie_names_list_for_User_500

['Pulp Fiction',
 'Speed',
 'True Lies',
 'The Fugitive',
 "Schindler's List",
 'Terminator 2: Judgment Day',
 'Dances with Wolves',
 'Back to the Future',
 'Camelot',
 'I Am a Fugitive from a Chain Gang']

In [38]:
print("Since user "+str(userId)+" has most highly rated :\n\n"+'\n'.join(str(item) for item in highest_rated_movie_names_by_user_500))
print("\nRecommended New unwatched Movies :\n\n"+'\n'.join(str(item) for item in Recommended_movie_names_list_for_User_500))

Since user 500 has most highly rated :

Forrest Gump
Life Is Beautiful
Kiss Kiss Bang Bang

Recommended New unwatched Movies :

Pulp Fiction
Speed
True Lies
The Fugitive
Schindler's List
Terminator 2: Judgment Day
Dances with Wolves
Back to the Future
Camelot
I Am a Fugitive from a Chain Gang


### We have now built a working recommender system.

However, movie genres between highest rated and recommended do not always align. For e.g. all of user 500's highest rated movies are Comedy, some not so comical movies like "Schindler's List" and "Terminator 2: Judgment Day" are recommended. 

This is because only movies are compared for similarity only in terms of user ratings - features like genre are not considered anywhere

Thus, we will also try out Content Based Filtering to consider all meta features of a movie