## Movie Recommender using Collaborative Filtering
In this project I shall be building a movie recommender that uses collaborative filtering. In this instance, this means that movies are recommended via user ratings. To recommend movies to a certain user, they must have rated a number of movies they have seen. Then, the system will look for other users that have a similar set of ratings for the same movies. It will then recommend movies from the most similar users which the user has not seen.

I shall be using the 100k MovieLens dataset which contains roughly 100,000 user ratings for around 10,000 movies.

In [1]:
import numpy as np
import pandas as pd

In [2]:
ratings = pd.read_csv("ratings.csv")
movies = pd.read_csv("movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
movie_ratings = pd.merge(movies, ratings)
movie_ratings = movie_ratings.drop(["timestamp", "genres"], axis=1)
movie_ratings.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
1,1,Toy Story (1995),5,4.0
2,1,Toy Story (1995),7,4.5
3,1,Toy Story (1995),15,2.5
4,1,Toy Story (1995),17,4.5


With the two datasets merged, I now have the columns I need to create a matrix of user ratings against movie titles.

In [5]:
user_ratings = movie_ratings.pivot_table(index=["userId"], columns=["title"], values="rating")
print(user_ratings.shape)
user_ratings.head()

(610, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


This matrix shows the ratings given by 610 users for 9719 movies. Where a user has not rated a movie, the rating value is NaN. We need to clean this up.

In [6]:
user_ratings = user_ratings.dropna(thresh=20, axis=1)
user_ratings = user_ratings.fillna(0, axis=1)
user_ratings.shape

(610, 1297)

All NaN ratings have now been replaced with ratings of 0, and I have also dropped any movies for which less than 20 users have given ratings for it. This means there are now 1297 movies remaining.

In [7]:
pearson_matrix = user_ratings.corr()
pearson_matrix.head()

title,(500) Days of Summer (2009),10 Things I Hate About You (1999),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),13 Going on 30 (2004),"13th Warrior, The (1999)",1408 (2007),2001: A Space Odyssey (1968),2012 (2009),...,Young Frankenstein (1974),Young Guns (1988),Zack and Miri Make a Porno (2008),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(500) Days of Summer (2009),1.0,0.273989,0.148903,0.142141,0.159756,0.297152,0.072835,0.226574,0.113616,0.274272,...,0.066077,0.073476,0.374515,0.414585,0.355723,0.252226,0.216007,0.053614,0.241092,0.125905
10 Things I Hate About You (1999),0.273989,1.0,0.223481,0.211473,0.011784,0.321071,0.215828,0.06947,0.085974,0.064187,...,0.144038,0.152333,0.243118,0.091853,0.158637,0.281934,0.050031,0.121029,0.130813,0.110612
101 Dalmatians (1996),0.148903,0.223481,1.0,0.285112,0.119843,0.188467,0.004213,0.159777,0.110844,0.090231,...,0.177214,0.033582,0.114968,0.067134,0.113224,0.184324,0.054024,0.047804,0.156932,0.078734
101 Dalmatians (One Hundred and One Dalmatians) (1961),0.142141,0.211473,0.285112,1.0,0.134037,0.218406,0.135894,0.227193,0.10223,0.112334,...,0.180318,0.143006,0.120302,0.08365,0.171654,0.27426,0.077594,0.085606,0.24882,0.171118
12 Angry Men (1957),0.159756,0.011784,0.119843,0.134037,1.0,-0.027672,0.08476,0.189497,0.195909,0.236037,...,0.135876,0.139655,0.104518,0.241435,0.144652,0.122107,0.056742,-0.001708,0.074306,0.102744


This correlation matrix (using the Pearson method) shows the similarly between every pair of movies based on the user ratings.

In [8]:
def recommend_movies(user_ratings, num_recommendations=10):
    
    similar_movies = pd.DataFrame()
    
    for movie_title, rating in user_ratings:
        similar_ratings = pearson_matrix[movie_title] * (rating-2.5)
        similar_ratings = similar_ratings.sort_values(ascending=False)
        similar_movies = similar_movies.append(similar_ratings, ignore_index=True)
    
    recommendations = similar_movies.sum().sort_values(ascending=False)
    movies_already_watched = [i[0] for i in user_ratings]
    recommendations = recommendations.drop(movies_already_watched)
    
    return recommendations[:num_recommendations]

Now I have my method for recommended movies to a user, based on their own ratings given so far. The method goes through each of the user's movie ratings, finds the recommendations for each movie based on it's rating, then combines all the recommendations into a single list of recommended movies. The number of recommendations given can be supplied as an argument, but is 10 by default. The recommendations do not include any movies the user has already seen.

Let's try this method out with a few examples. 

In [9]:
list(user_ratings.columns)

['(500) Days of Summer (2009)',
 '10 Things I Hate About You (1999)',
 '101 Dalmatians (1996)',
 '101 Dalmatians (One Hundred and One Dalmatians) (1961)',
 '12 Angry Men (1957)',
 '13 Going on 30 (2004)',
 '13th Warrior, The (1999)',
 '1408 (2007)',
 '2001: A Space Odyssey (1968)',
 '2012 (2009)',
 '21 Grams (2003)',
 '21 Jump Street (2012)',
 '25th Hour (2002)',
 '27 Dresses (2008)',
 '28 Days (2000)',
 '28 Days Later (2002)',
 '28 Weeks Later (2007)',
 '300 (2007)',
 '3:10 to Yuma (2007)',
 '40-Year-Old Virgin, The (2005)',
 '50 First Dates (2004)',
 '6th Day, The (2000)',
 '8 Mile (2002)',
 'A.I. Artificial Intelligence (2001)',
 'About Schmidt (2002)',
 'About a Boy (2002)',
 'Abyss, The (1989)',
 'Ace Ventura: Pet Detective (1994)',
 'Ace Ventura: When Nature Calls (1995)',
 'Adaptation (2002)',
 'Addams Family Values (1993)',
 'Addams Family, The (1991)',
 'Adjustment Bureau, The (2011)',
 'Adventures in Babysitting (1987)',
 'Adventures of Buckaroo Banzai Across the 8th Dimensio

This is just so I can see the list of movies in the system that I can try giving ratings for. 

In [10]:
user_1_ratings = [("Alien (1979)", 5), ("Finding Nemo (2003)", 3), ("Harry Potter and the Prisoner of Azkaban (2004)", 5), 
          ("Hot Fuzz (2007)", 1), ("Meet the Fockers (2004)", 2)]

recommend_movies(user_1_ratings)

Harry Potter and the Chamber of Secrets (2002)                                                    1.844621
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)    1.799909
Harry Potter and the Goblet of Fire (2005)                                                        1.704027
Aliens (1986)                                                                                     1.687595
Incredibles, The (2004)                                                                           1.528027
Harry Potter and the Order of the Phoenix (2007)                                                  1.524120
Shrek (2001)                                                                                      1.426586
Shrek 2 (2004)                                                                                    1.415013
Star Wars: Episode V - The Empire Strikes Back (1980)                                             1.413035
Terminator, The (1984)               

In [11]:
user_2_ratings = [("Incredibles, The (2004)", 5), ("Bridget Jones's Diary (2001)", 1), ("Emperor's New Groove, The (2000)", 4), 
          ("Inside Out (2015)", 4), ("Man of Steel (2013)", 2)]

recommend_movies(user_2_ratings)

Ratatouille (2007)                                               2.013212
Finding Nemo (2003)                                              1.896571
Toy Story 3 (2010)                                               1.845058
Monsters, Inc. (2001)                                            1.841448
Pirates of the Caribbean: The Curse of the Black Pearl (2003)    1.809898
Zootopia (2016)                                                  1.775912
Batman Begins (2005)                                             1.714044
Cars (2006)                                                      1.669771
Spider-Man 2 (2004)                                              1.654738
V for Vendetta (2006)                                            1.631591
dtype: float64

In [12]:
user_3_ratings = [("Mummy, The (1999)", 2), ("Ratatouille (2007)", 4), ("Scream (1996)", 5), 
          ("Robin Hood (1973)", 1), ("Scary Movie (2000)", 4)]

recommend_movies(user_3_ratings)

Me, Myself & Irene (2000)              1.430969
Scary Movie 2 (2001)                   1.334665
American Pie 2 (2001)                  1.317684
Others, The (2001)                     1.300623
Liar Liar (1997)                       1.251182
Bruce Almighty (2003)                  1.235053
Scream 2 (1997)                        1.222922
There's Something About Mary (1998)    1.199792
Whole Nine Yards, The (2000)           1.187437
Road Trip (2000)                       1.181927
dtype: float64

Looking at the 3 examples above, it's safe to say that the system recommends suitable films to users based on their current movie ratings. 