# Movie Recommender using Content Filtering
In this project I shall be building a movie recommender that uses content filtering. This means that movies are recommended based on how similar their content is (cast, crew, description, keywords, etc...). I shall be using the TMDB movie dataset which contains metadata about roughly 5000 movies. 

In [1]:
import numpy as np
import pandas as pd
from ast import literal_eval

In [2]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

In [3]:
movies = movies[["id", "keywords", "title", "overview"]]
movies.head()

Unnamed: 0,id,keywords,title,overview
0,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,A cryptic message from Bond’s past sends him o...
3,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"John Carter is a war-weary, former military ca..."


In [4]:
movies = pd.merge(movies, credits)
movies.drop(["movie_id", "id"], inplace=True, axis=1)
print(movies.shape)
movies.head()

(4809, 5)


Unnamed: 0,keywords,title,overview,cast,crew
0,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,A cryptic message from Bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"John Carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


We now have a suitable dataset to work with, by merging the two .csv files that the TMDB dataset comes with. The values for cast, crew and keywords are stored as stringified dictionaries. I'll have to parse these as Python dictionary objects in order to work with them. 

In [5]:
dupes = movies.title.duplicated()
movies.title[dupes]

973            The Host
974            The Host
975            The Host
1363             Batman
1364             Batman
1365             Batman
3653    Out of the Blue
3654    Out of the Blue
3655    Out of the Blue
Name: title, dtype: object

In [6]:
movies = movies.drop(index=[973, 974, 975, 1363, 1364, 1365, 3653, 3654, 3655])
movies = movies.reset_index()
movies.shape

(4800, 6)

In order to avoid confusion later, I'm removing any duplicate movie titles. Here I've found 9 duplicates that need to be removed.

In [7]:
movies["overview"] = movies.overview.apply(lambda x: str(x).lower())
movies.head()

Unnamed: 0,index,keywords,title,overview,cast,crew
0,0,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",Avatar,"in the 22nd century, a paraplegic marine is di...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,1,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,2,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",Spectre,a cryptic message from bond’s past sends him o...,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,3,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",The Dark Knight Rises,following the death of district attorney harve...,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,4,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",John Carter,"john carter is a war-weary, former military ca...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In the end, I want all words in the dataset to be lowercase. For now I've just converted the overview column to lowercase since it's just a string that can easily be changed.

I also want to extract the most useful information out of the keywords, cast and crew columns. I shall apply a functions to each of these columns to get the specific data I need.

In [8]:
def extract_top_actors(cast):
    cast_list = literal_eval(cast)
    top_actors = " "
    n = 3 if len(cast_list) >= 3 else len(cast_list)
    for i in range(n):
        actor = cast_list[i]["name"].replace(" ", "").lower()
        top_actors += actor + " "
    return top_actors

This function is for retrieving up to the first 3 actors listed in the cast. This is because the actors listed at the top are always the biggest roles in the movie.

In [9]:
def extract_top_keywords(keywords):
    keywords_list = literal_eval(keywords)
    top_keywords = ""
    n = 3 if len(keywords_list) >= 3 else len(keywords_list)
    for i in range(n):
        keyword = keywords_list[i]["name"].replace(" ", "").lower()
        top_keywords += keyword + " "
    return top_keywords

Similarly this function also retrieves up to the first 3 keywords found in the keywords column. 

In [10]:
def extract_director(crew):
    crew_list = literal_eval(crew)
    for crew in crew_list:
        if crew["job"] == "Director":
            director = crew["name"].replace(" ", "").lower()
            return director
    return ""

Finally, this last function will retrieve the director of each movie. The director is usually the most important role for how good a movie is. If a user likes a movie from a certain director, they may like other movies that they have directed. 

In [11]:
movies["cast"] = movies["cast"].apply(extract_top_actors)
movies["keywords"] = movies["keywords"].apply(extract_top_keywords)
movies["crew"] = movies["crew"].apply(extract_director)
movies.head()

Unnamed: 0,index,keywords,title,overview,cast,crew
0,0,cultureclash future spacewar,Avatar,"in the 22nd century, a paraplegic marine is di...",samworthington zoesaldana sigourneyweaver,jamescameron
1,1,ocean drugabuse exoticisland,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha...",johnnydepp orlandobloom keiraknightley,goreverbinski
2,2,spy basedonnovel secretagent,Spectre,a cryptic message from bond’s past sends him o...,danielcraig christophwaltz léaseydoux,sammendes
3,3,dccomics crimefighter terrorist,The Dark Knight Rises,following the death of district attorney harve...,christianbale michaelcaine garyoldman,christophernolan
4,4,basedonnovel mars medallion,John Carter,"john carter is a war-weary, former military ca...",taylorkitsch lynncollins samanthamorton,andrewstanton


In [12]:
movies["word_soup"] = movies["overview"] + movies["cast"] + movies["crew"] + " " + movies["keywords"]
movies["word_soup"][10]

'superman returns to discover his 5-year absence has allowed lex luthor to walk free, and that those he was closest too felt abandoned and have moved on. luthor plots his ultimate revenge that could see millions killed and change the face of the planet forever, as well as ridding himself of the man of steel. brandonrouth kevinspacey katebosworth bryansinger savingtheworld dccomics invulnerability '

After applying the three functions written above to the keywords, cast and crew columns, I finally merge these three columns together into one big string. I will then use the CountVectoriser class from scikit-learn to transform the text into a Bag of Words model. 

In [13]:
movies = movies[["title", "word_soup"]]
movies.head()

Unnamed: 0,title,word_soup
0,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,Spectre,a cryptic message from bond’s past sends him o...
3,The Dark Knight Rises,following the death of district attorney harve...
4,John Carter,"john carter is a war-weary, former military ca..."


In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

cv = CountVectorizer(stop_words="english")
cv_matrix = cv.fit_transform(movies["word_soup"])
similarity_matrix = cosine_similarity(cv_matrix, cv_matrix)

After transforming the word_soup column with the CountVectoriser class, I then create a cosine similarity matrix, which compares the cosine similarity between all pairs of movies. This will be used for choosing which movies to recommend later.

In [15]:
title_to_index = pd.Series(movies.index, index=movies['title'])
title_to_index[:10]

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
Spider-Man 3                                5
Tangled                                     6
Avengers: Age of Ultron                     7
Harry Potter and the Half-Blood Prince      8
Batman v Superman: Dawn of Justice          9
dtype: int64

The recommender method will take a movie title as an argument, and return a list of recommended movies to watch based on the movie given. However in order to lookup a movie's similaity vector with other movies, I need the movie's index and not the title. Therefore this series will help me map between movie titles and their respective index. 

In [16]:
def recommend_movies(movie_title, num_recommendations=10):
    
    if movie_title not in title_to_index.index:
        print("Movie title not found. Please try another.")
        return None
    index = title_to_index[movie_title]
    simalirity_scores = list(enumerate(similarity_matrix[index]))
    simalirity_scores = sorted(simalirity_scores, key=lambda x: x[1], reverse=True)
    simalirity_scores = simalirity_scores[1:num_recommendations+1]
    movie_title_indices = [i[0] for i in simalirity_scores]
    return movies.loc[movie_title_indices, "title"]

Finally I have my recommender method. It takes a movie title as an argument, then finds that movies corresponding index. Using the index, it looks up the movie's similarity vector to all other movies, sorts that vector into descending order, then recommends the top movies in that sorted list. The number of movies recommended is the top 10 most similar by default, but the user to can also specify a custom amount of movies they want recommended.

Here are some examples below:

In [17]:
recommend_movies("Avengers: Age of Ultron", 5)

16                  The Avengers
79                    Iron Man 2
26    Captain America: Civil War
68                      Iron Man
41                 Green Lantern
Name: title, dtype: object

In [18]:
recommend_movies("Spectre", 20)

1343                          Never Say Never Again
11                                Quantum of Solace
29                                          Skyfall
1717                                     Safe Haven
4069                          From Russia with Love
3161                                    Thunderball
3335                           Diamonds Are Forever
3284                                       Restless
4336                                         Dr. No
2748                                Harriet the Spy
3350                    The Man with the Golden Gun
277                                   Casino Royale
698                             The Prince of Egypt
1131                                Licence to Kill
2326            Diary of a Wimpy Kid: Rodrick Rules
1743                                      Octopussy
147                                 Die Another Day
1200                           The Living Daylights
418     Cats & Dogs 2 : The Revenge of Kitty Galore
1769        

In [19]:
recommend_movies("Tangled", 10)

3130                    Of Horses and Men
3989                   The Black Stallion
4677                          Malevolence
2309                       Out of Inferno
950                        The Negotiator
39                           TRON: Legacy
4656                           Horse Camp
1511    Dreamer: Inspired By a True Story
498                               Hidalgo
403                      Last Action Hero
Name: title, dtype: object