# Film Recommendation Program


Note: This is taken from a DataQuest video tutorial. The video can be found here: https://www.youtube.com/watch?v=eyEabQRBMQA

The data file for this project is found here: https://files.grouplens.org/datasets/movielens/ml-25m.zip
            

## 0. Introduction

I like movies. I also have a lot of friends who like movies. Unfortunately because tastes can vary so much from person to person, it can be very hard to recommend movies to someone else. Just because there's a film you happen to like a lot, it doesn't mean that they'll like it also. So instead of taking the time to get to know someone and figure out what their tastes are, then carefully thinking about what other movies they might like, lets just create a robot that can recommend a list of movies for us.

The purpose of this project is to end up with a widget that will take in the name (or partial name of a film), then output a list of films that are similar to the input. Similarity will be determined by a few criteria, the largest of which is by ratings from a large group of film watchers. We assume that someone who really likes our input film will also really like films like our input, and so we can create a score based upon the ratings of the watchers who liked our input. 

(TODO: Add criteria based upon other factors: genre/year/director/tags/etc)

## 1. Importing the Data

Once we've downloaded the data from the source above, we can take a look at the list of movies that we will be working with:

In [20]:
import pandas as pd

movies = pd.read_csv("movies.csv")

In [21]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


So we have more than 62 thousand film titles to sort through, but unfortunately in order to make searching this list easier we'll need to clean up the data. So let's remove any special characters from the titles using regular expressions:

In [22]:
import re

def clean_title(title):
    return re.sub("[^a-zA-Z0-9 ]", "", title)

In [23]:
movies["clean_title"] = movies["title"].apply(clean_title)

In [24]:
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


## 2. Searching the Data

Alright, that's much better. Now we can create a search widget so that users won't have to know the exact format in our database to find the film they're looking for. 

To do this we'll use the TF-IDF Vectorizer that's included in scikit-learn to create a matrix of data to know what words appear in what titles. We then create the Search function to rank the titles by their cosine similarity to our input, then output the top 5 results based upon their score. We then create a Python widget to collect the input and record the output. 

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2)) # Using an ngram_range to check for pairs of words along with individual ones

tfidf = vectorizer.fit_transform(movies["clean_title"])

In [26]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices][::-1]
    return results

In [27]:
import ipywidgets as widgets
from IPython.display import display

movie_input = widgets.Text(
    value="Toy Story",
    description="Movie Title:",
    disabled=False
)
movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            display(search(title))
            
movie_input.observe(on_type, names='value')

display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## 3. Adding the Ratings

Okay, so now that we are able to search through our data and find what movies people are watching, the next step is to figure out who likes movies anyways. In the source data there is the Ratings csv file, which has a list of ratings made by users in the database. It has some useful information:

In [28]:
ratings = pd.read_csv("ratings.csv")

In [29]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

## 4. Finding Similar Users

Now let's find the list of users who rated each movie highly. In this case we'll use those who rated a movie more than 4 out of 5, which by most standards means they liked it. We'll create an array of those users simply:

In [30]:
movie_id = 1 # This is the id for the first movie in our data, which is Toy Story in this case

In [31]:
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

In [32]:
similar_users

array([    36,     75,     86, ..., 162527, 162530, 162533], dtype=int64)

Now that we have the list of users who rated our input highly, we can find what other films those people rated highly. We will pull out all the ratings in the system that meet these criteria:

In [33]:
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]

In [34]:
similar_user_recs

5101            1
5105           34
5111          110
5114          150
5127          260
            ...  
24998854    60069
24998861    67997
24998876    78499
24998884    81591
24998888    88129
Name: movieId, Length: 1358326, dtype: int64

So we have the list of all the ratings that meet our criteria, but there's over 1.3 million ratings, so we need to pare that down a bit. Let's pull out the films (by movieId) that have the highest ratio of people who like the recommended film to everyone who liked the input film. This way we can see what films are more universally liked by the users who like the input.

In [35]:
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

similar_user_recs = similar_user_recs[similar_user_recs > .1] # We'll cap it at a 10% ratio just to keep the list a little shorter

In [36]:
similar_user_recs

1        1.000000
318      0.445607
260      0.403770
356      0.370215
296      0.367295
           ...   
953      0.103053
551      0.101195
1222     0.100876
745      0.100345
48780    0.100186
Name: movieId, Length: 113, dtype: float64

## 5. Accounting for Taste

Now we come to the part where taste in movies becomes an issue. If we're looking for users who liked our input film, they might just like it because it's a good movie, not necessarily because they have similar tastes as us. This could mean that if we look for recommendations after watching Toy Story, we might get an output of Casablanca because they're both great movies. Unfortunately Casablanca is nothing like Toy Story, so we need to figure out a way to find the users with similar taste to our input.

To do this we'll take the list of films in our previous list and see how they're rated based upon the entire database of users. This way we can ensure the percentage of people in our similar_user set who liked a recommended movie is larger than the percentage across the entire database. This will downplay some movies that are generally thought of as good even if you don't appreciate the genre (this is like enjoying Toy Story even if you don't usually like animated films).

In [37]:
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]

In [38]:
all_users_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

In [41]:
all_users_recs

318      0.342220
296      0.284674
2571     0.244033
356      0.235266
593      0.225909
           ...   
551      0.040918
50872    0.039111
745      0.037031
78499    0.035131
2355     0.025091
Name: movieId, Length: 113, dtype: float64

Now that we have the precentages across our entire user database, we can compare the two groups:

In [42]:
rec_percentages = pd.concat([similar_user_recs, all_users_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

In [43]:
rec_percentages

Unnamed: 0,similar,all
1,1.000000,0.124728
32,0.160711,0.100293
34,0.130555,0.052229
47,0.225909,0.144469
50,0.275604,0.200513
...,...,...
59315,0.104593,0.054269
60069,0.170640,0.076307
68954,0.159172,0.064944
78499,0.152960,0.035131


So it's easy enough to see that there are some differences in the percentage of people "similar" to us and the entire group who liked a recommended film. Let's create a score that will let us rank our recommended films for easy viewing:

In [44]:
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
rec_percentages = rec_percentages.sort_values("score", ascending=False)

In [45]:
rec_percentages

Unnamed: 0,similar,all,score
1,1.000000,0.124728,8.017414
3114,0.280648,0.053706,5.225654
2355,0.110539,0.025091,4.405452
78499,0.152960,0.035131,4.354038
4886,0.235147,0.070811,3.320783
...,...,...,...
2858,0.216724,0.167634,1.292845
296,0.367295,0.284674,1.290232
79132,0.166817,0.131384,1.269693
4973,0.142501,0.112405,1.267747


And let's combine this dataframe with our movies.csv file so that we can look at titles instead of guessing at id numbers:

In [46]:
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
0,1.0,0.124728,8.017414,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
3021,0.280648,0.053706,5.225654,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
2264,0.110539,0.025091,4.405452,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bugs Life A 1998
14813,0.15296,0.035131,4.354038,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
4780,0.235147,0.070811,3.320783,4886,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,Monsters Inc 2001
580,0.216618,0.067513,3.208539,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,Aladdin 1992
6258,0.228139,0.072268,3.156862,6377,Finding Nemo (2003),Adventure|Animation|Children|Comedy,Finding Nemo 2003
587,0.1794,0.059977,2.99115,595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX,Beauty and the Beast 1991
8246,0.203504,0.068453,2.972889,8961,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,Incredibles The 2004
359,0.253411,0.085764,2.954762,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,Lion King The 1994


## 6. Putting it All Together

We're basically done now! All that's left is to put all the code into a compact space and then create a widget that lets you search up a film name. 

In [47]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

In [48]:
movie_name_input = widgets.Text(
    value="Toy Story",
    description="Movie Title:",
    disabled=False
)

recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))
            
movie_name_input.observe(on_type, names="value")

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## 7. Conclusion and Further Steps

So it looks like this works pretty well! We can create a list of films that look pretty similar to our input, which should give a curious film-watcher a solid starting point for more movies. 

There are a few more things we can add to this to create a more accurate list of recommendations, such as adding in the genre of the film to the score criteria or taking into account the release date. 