# Movie Recommender

Adapted from Ibtesam Ahmed's "Getting Started with a Movie Recommendation System" notebook on Kaggle (https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system).

In this notebook, I will build a baseline content-based movie recommender using the [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata). The goal of this recommender is to recommendation movies on based on a Letterboxd user's top rated movies using the [Letterboxd API](https://api-docs.letterboxd.com/).

**So let's go!**

> *  **Content Based Filtering**: They suggest similar items based on a particular item. This system uses movie metadata, such as genre, director, plot keywords, actors, production company, etc., to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.

Let's load the data now.

## Imports

In [1]:
import pandas as pd 
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## EDA

Datasets taken from [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata).

In [2]:
df1 = pd.read_csv('data/tmdb_5000_credits.csv')
df2 = pd.read_csv('data/tmdb_5000_movies.csv')

In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


In [4]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

`homepage` and `tagline` have empty entries.

In [5]:
df1.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [6]:
df2.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


The first "credits" dataset contains the following features:

| feature | description |
| ------- | ----------- |
title | Title of the movie
movie_id | A unique identifier for each movie.
cast | Name of lead and supporting actors
crew | Name of director, editor, composer, writer etc.

The second "movies" dataset has the following features:

| feature | description |
| ------- | ----------- |
budget | Budget in which the movie was made.
genre | Genres of the movie, (action, comedy, thriller etc.)
homepage | Link to the homepage of the movie.
id | movie_id as in the first dataset.
keywords | Keywords or tags related to the movie.
original_language | Language in which the movie was made.
original_title | Title of the movie before translation or adaptation.
overview | Brief description of the movie.
popularity | A numeric quantity specifying the movie popularity.
production_companies | Production house of the movie.
production_countries | Country in which it was produced.
release_date | Date on which it was released.
revenue | Worldwide revenue generated by the movie.
runtime | Running time of the movie in minutes.
status | "Released" or "Rumored".
tagline | Movie's tagline.
title | Title of the movie.
vote_average | Average ratings the movie received.
vote_count | The count of votes received.

Let's join the two dataset on the 'id' column


### Merge datasets

In [7]:
df1 = df1.drop(columns=['title'])
df1.columns = ['id','cast','crew']

In [8]:
df = df1.merge(df2, on='id')
df.head()

Unnamed: 0,id,cast,crew,budget,genres,homepage,keywords,original_language,original_title,overview,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


# **Content Based Filtering**
In this recommender system the content of the movie (overview, cast, crew, keyword, tagline etc) is used to find its similarity with other movies. Then the movies that are most likely to be similar are recommended.

![](https://miro.medium.com/v2/resize:fit:400/format:webp/1*BME1JjIlBEAI9BV5pOO5Mg.png)

## **Credits, Genres, Keywords, Production Company Based Recommender**
It goes without saying that the quality of our recommender would be increased with the usage of better metadata. That is exactly what we are going to do in this section. We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genres, plot keywords, and the movie's production companies.

From the cast, crew and keywords features, we need to extract the three most important actors, the director and the keywords associated with that movie. Right now, our data is present in the form of "stringified" lists , we need to convert it into a safe and usable structure

In [9]:
# Parse the stringified features into their corresponding python objects
features = ['cast', 'crew', 'keywords', 'genres', 'production_companies']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

Next, we'll write functions that will help us to extract the required information from each feature.

In [10]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(crew):
    for member in crew:
        if member['job'] == 'Director':
            return member['name']
    return np.nan

In [11]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_top_n(x, n=3):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        # Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > n:
            names = names[:n]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [12]:
# Define new director, cast, genres and keywords features that are in a suitable form.
df['director'] = df['crew'].apply(get_director)
df['cast'] = df['cast'].apply(get_top_n, n=4)

features = ['keywords', 'genres', 'production_companies']
for feature in features:
    df[feature] = df[feature].apply(get_top_n)

In [13]:
# Print the new features of the first 5 films
df[['title', 'cast', 'director', 'keywords', 'genres', 'production_companies']].head()

Unnamed: 0,title,cast,director,keywords,genres,production_companies
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]","[Ingenious Film Partners, Twentieth Century Fo..."
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]","[Walt Disney Pictures, Jerry Bruckheimer Films..."
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]","[Columbia Pictures, Danjaq, B24]"
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman, A...",Christopher Nolan,"[dc comics, crime fighter, terrorist]","[Action, Crime, Drama]","[Legendary Pictures, Warner Bros., DC Entertai..."
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton,...",Andrew Stanton,"[based on novel, mars, medallion]","[Action, Adventure, Science Fiction]",[Walt Disney Pictures]


The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them. This is done so that our vectorizer doesn't count the Tom of "Tom Hanks" and "Tom Cruise" as the same.

In [14]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [15]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres', 'production_companies']

cbf_df = df.copy()
cbf_df = cbf_df[['id', 'title', 'director', 'cast', 'genres', 'keywords', 'production_companies']]

for feature in features:
    cbf_df[feature] = df[feature].apply(clean_data)

In [16]:
cbf_df

Unnamed: 0,id,title,director,cast,genres,keywords,production_companies
0,19995,Avatar,jamescameron,"[samworthington, zoesaldana, sigourneyweaver, ...","[action, adventure, fantasy]","[cultureclash, future, spacewar]","[ingeniousfilmpartners, twentiethcenturyfoxfil..."
1,285,Pirates of the Caribbean: At World's End,goreverbinski,"[johnnydepp, orlandobloom, keiraknightley, ste...","[adventure, fantasy, action]","[ocean, drugabuse, exoticisland]","[waltdisneypictures, jerrybruckheimerfilms, se..."
2,206647,Spectre,sammendes,"[danielcraig, christophwaltz, léaseydoux, ralp...","[action, adventure, crime]","[spy, basedonnovel, secretagent]","[columbiapictures, danjaq, b24]"
3,49026,The Dark Knight Rises,christophernolan,"[christianbale, michaelcaine, garyoldman, anne...","[action, crime, drama]","[dccomics, crimefighter, terrorist]","[legendarypictures, warnerbros., dcentertainment]"
4,49529,John Carter,andrewstanton,"[taylorkitsch, lynncollins, samanthamorton, wi...","[action, adventure, sciencefiction]","[basedonnovel, mars, medallion]",[waltdisneypictures]
...,...,...,...,...,...,...,...
4798,9367,El Mariachi,robertrodriguez,"[carlosgallardo, jaimedehoyos, petermarquardt,...","[action, crime, thriller]","[unitedstates–mexicobarrier, legs, arms]",[columbiapictures]
4799,72766,Newlyweds,edwardburns,"[edwardburns, kerrybishé, marshadietlein, cait...","[comedy, romance]",[],[]
4800,231617,"Signed, Sealed, Delivered",scottsmith,"[ericmabius, kristinbooth, crystallowe, geoffg...","[comedy, drama, romance]","[date, loveatfirstsight, narration]","[frontstreetpictures, museentertainmententerpr..."
4801,126186,Shanghai Calling,danielhsia,"[danielhenney, elizacoupe, billpaxton, alanruck]",[],[],[]


We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

In [17]:
def create_soup(x):
    return f"{x['director']} {' '.join(x['cast'])} {' '.join(x['genres'])} {' '.join(x['keywords'])} {' '.join(x['production_companies'])}"
cbf_df['soup'] = cbf_df.apply(create_soup, axis=1)

In [18]:
cbf_df['soup']

0       jamescameron samworthington zoesaldana sigourn...
1       goreverbinski johnnydepp orlandobloom keirakni...
2       sammendes danielcraig christophwaltz léaseydou...
3       christophernolan christianbale michaelcaine ga...
4       andrewstanton taylorkitsch lynncollins samanth...
                              ...                        
4798    robertrodriguez carlosgallardo jaimedehoyos pe...
4799    edwardburns edwardburns kerrybishé marshadietl...
4800    scottsmith ericmabius kristinbooth crystallowe...
4801    danielhsia danielhenney elizacoupe billpaxton ...
4802    brianherzlinger drewbarrymore brianherzlinger ...
Name: soup, Length: 4803, dtype: object

In [19]:
# Import CountVectorizer and create the count matrix
count_vec = CountVectorizer(stop_words='english')
count_matrix = count_vec.fit_transform(cbf_df['soup'])
count_matrix

<4803x16882 sparse matrix of type '<class 'numpy.int64'>'
	with 58834 stored elements in Compressed Sparse Row format>

In [25]:
# Compute the Cosine Similarity matrix based on the count_matrix
cosine_sim = cosine_similarity(count_matrix)
cosine_sim[:5]

array([[1.        , 0.21428571, 0.14285714, ..., 0.        , 0.        ,
        0.        ],
       [0.21428571, 1.        , 0.14285714, ..., 0.        , 0.        ,
        0.        ],
       [0.14285714, 0.14285714, 1.        , ..., 0.        , 0.        ,
        0.        ],
       [0.07142857, 0.07142857, 0.14285714, ..., 0.07412493, 0.        ,
        0.        ],
       [0.15430335, 0.23145502, 0.23145502, ..., 0.        , 0.        ,
        0.        ]])

In [21]:
# Reset index of our main DataFrame and construct reverse mapping as before
cbf_df = cbf_df.reset_index()
indices = pd.Series(cbf_df.index, index=cbf_df['title'])

We can now reuse our **get_recommendations()** function by passing in the new **cosine_sim** matrix as your second argument.

In [22]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return cbf_df['title'].iloc[movie_indices]

In [23]:
get_recommendations('Star Wars', cosine_sim)

1990                              The Empire Strikes Back
1490                                   Return of the Jedi
229          Star Wars: Episode III - Revenge of the Sith
230          Star Wars: Episode II - Attack of the Clones
233             Star Wars: Episode I - The Phantom Menace
53      Indiana Jones and the Kingdom of the Crystal S...
1006                   Indiana Jones and the Last Crusade
1697                 Indiana Jones and the Temple of Doom
2085                              Raiders of the Lost Ark
4401                                  The Helix... Loaded
Name: title, dtype: object

In [24]:
get_recommendations('The Avengers', cosine_sim)

7                  Avengers: Age of Ultron
26              Captain America: Civil War
79                              Iron Man 2
169     Captain America: The First Avenger
85     Captain America: The Winter Soldier
174                    The Incredible Hulk
31                              Iron Man 3
68                                Iron Man
126                   Thor: The Dark World
129                                   Thor
Name: title, dtype: object