## Content-based recommenders
#### Credits, Genres, and Keywords Based Recommender
The quality of our recommender would be increased with the usage of better metadata and by capturing more of the finer details. That is precisely what we are going to do in this notebook. We will build a recommender system based on the following metadata: the 4 top actors, the director, related genres, and the movie plot keywords.

In [1]:
# importing packages
import numpy as np
import pandas as pd

In [2]:
# reading input files
#https://www.kaggle.com/tmdb/tmdb-movie-metadata
df_credits = pd.read_csv("tmdb_5000_credits.csv")
df_movies = pd.read_csv("tmdb_5000_movies.csv")
df_credits.shape, df_movies.shape

((4803, 4), (4803, 20))

In [3]:
df_movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
df_credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [5]:
df_credits.rename(columns = {"movie_id": "id"}, inplace = True)
df_movies_merge = df_movies.merge(df_credits[['id', 'cast', 'crew']], on = 'id')
df_movies_merge.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [6]:
df_movies_merge.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'cast', 'crew'],
      dtype='object')

In [7]:
df_movies_merge.drop(columns = ['homepage', 'status', 'production_countries', 'title'], inplace = True)
df_movies_merge.head(2)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


From our new features, cast, crew, and keywords, we need to extract the three most important actors, the director and the keywords associated with that movie.

But first things first, our data is present in the form of "stringified" lists. We need to convert them into a way that is usable for us.

In [8]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['genres', 'cast', 'crew', 'keywords']
for feature in features:
    df_movies_merge[feature] = df_movies_merge[feature].apply(literal_eval)

In [9]:
type(df_movies_merge['crew'])

pandas.core.series.Series

In [10]:
df_movies_merge['genres'].head(2)

0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
1    [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
Name: genres, dtype: object

In [11]:
for i in df_movies_merge['genres'].head(1):
    print(i)

[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 878, 'name': 'Science Fiction'}]


Next, we write functions that will help us to extract the required information from each feature.

First, we'll import the NumPy package to get access to its `NaN` constant. Next, we can use it to write the `get_director()` function. Get the director's name from the crew feature. If the director is not listed, return `NaN`.

Next, we will write a function that will return the top 3 elements or the entire list, whichever is more. Here the list refers to the `cast`, `keywords`, and `genres`.

In [12]:
def get_director(crew):
    for i in crew:
        if i["job"] == "Director":
            return i["name"]
    return np.nan

def get_list(col):
    if isinstance(col, list):
        names = [i['name'] for i in col]
        if len(names) > 4:
            names = names[:4]
        return names
    return []

df_movies_merge['director'] = df_movies_merge['crew'].apply(get_director)
df_movies_merge['director'][1]

features = ['cast', 'keywords', 'genres']
for feature in features:
    df_movies_merge[feature] = df_movies_merge[feature].apply(get_list)
    print(df_movies_merge[feature][1])

['Johnny Depp', 'Orlando Bloom', 'Keira Knightley', 'Stellan Skarsgård']
['ocean', 'drug abuse', 'exotic island', 'east india trading company']
['Adventure', 'Fantasy', 'Action']


In [13]:
# Print the new features of the first 3 films
df_movies_merge[['original_title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,original_title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weave...",James Cameron,"[culture clash, future, space war, space colony]","[Action, Adventure, Fantasy, Science Fiction]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Gore Verbinski,"[ocean, drug abuse, exotic island, east india ...","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Sam Mendes,"[spy, based on novel, secret agent, sequel]","[Action, Adventure, Crime]"


The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them.

Removing the spaces between words is an important preprocessing step. It is done so that our vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same. After this processing step, the aforementioned actors will be represented as "johnnydepp" and "johnnygalecki" and will be distinct to our vectorizer.

Another good example where the model might output the same vector representation is "bread jam" and "traffic jam". Hence, it is better to strip off any space that is present.

The below function will exactly do that for us:

In [14]:
def clean_data(col):
    if isinstance(col, list):
        return [str.lower(i.replace(" ", "")) for i in col]
    else:
        if isinstance(col, str):
            return str.lower(col.replace(" ", ""))
        else:
            return ''

In [15]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df_movies_merge[feature] = df_movies_merge[feature].apply(clean_data)

We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

The create_soup function will simply join all the required columns by a space. This is the final preprocessing step, and the output of this function will be fed into the word vector model.

In [16]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

# Create a new soup feature
df_movies_merge['soup'] = df_movies_merge.apply(create_soup, axis=1)
df_movies_merge[['soup']].head(2)

Unnamed: 0,soup
0,cultureclash future spacewar spacecolony samwo...
1,ocean drugabuse exoticisland eastindiatradingc...


The next steps are the same as what we did with our `plot description based recommender`. One key difference is that we use the `CountVectorizer()` instead of `TF-IDF`. This is because we do not want to down-weight the actor/director's presence if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense to down-weight them in this context.

The major difference between `CountVectorizer()` and `TF-IDF` is the inverse document frequency (IDF) component which is present in later and not in the former.

In [17]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words = 'english')
countMatrix = count.fit_transform(df_movies_merge['soup'])
countMatrix.shape

(4803, 14242)

In [20]:
df_countMatrix = pd.DataFrame(countMatrix.A, columns = count.get_feature_names())
df_countMatrix.head()

Unnamed: 0,17thcentury,18thcentury,1910s,1950s,1960s,1970s,1990s,1995,19thcentury,3d,...,ólafurdarriólafsson,óscarjaenada,ølgaard,đỗthịhảiyến,špelacolja,юлияснигирь,پیمانمعادی,卧底肥妈,绝地奶霸,超级妈妈
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


From the above output, we can see that there are 73,881 vocabularies in the metadata that we fed to it.

Next, we will use the `cosine_similarity` to measure the distance between the embeddings.

We can then reuse our `get_recommendations()` function by passing in the new cosine_sim2 matrix as our second argument.

In [18]:
# Compute the Cosine Similarity matrix based on the countMatrix
from sklearn.metrics.pairwise import cosine_similarity

cosineSimilarity = cosine_similarity(countMatrix, countMatrix)

# Reset index of our main DataFrame and construct reverse mapping
df_movies_merge = df_movies_merge.reset_index()
indices = pd.Series(df_movies_merge.index, index = df_movies_merge['original_title'])
indices[:10]

original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
Spider-Man 3                                5
Tangled                                     6
Avengers: Age of Ultron                     7
Harry Potter and the Half-Blood Prince      8
Batman v Superman: Dawn of Justice          9
dtype: int64

In [19]:
def get_recommendations(movie, cosineSim):
    # get index of the movie
    idx = indices[movie]
    
    # fetch scores of similar movies
    simScores = sorted(list(enumerate(cosineSim[indices[idx]])), key = lambda x: x[1], reverse=True)
    
    # fetch top 10 similar movies based on scores
    simScores = simScores[1:11]
    
    # fetch movie titles
    movieIndices = [i[0] for i in simScores]
    return df_movies_merge['original_title'].iloc[movieIndices]

get_recommendations('Avatar', cosineSimilarity)

466                        The Time Machine
47                  Star Trek Into Darkness
94                  Guardians of the Galaxy
206                     Clash of the Titans
4401                    The Helix... Loaded
10                         Superman Returns
14                             Man of Steel
46               X-Men: Days of Future Past
61                        Jupiter Ascending
85      Captain America: The Winter Soldier
Name: original_title, dtype: object

In [20]:
get_recommendations('The Dark Knight', cosineSimilarity)

3          The Dark Knight Rises
119                Batman Begins
4638    Amidst the Devil's Wings
1196                The Prestige
3332                 Harry Brown
4099                 Harsh Times
2398                      Hitman
3359                 In Too Deep
1503                      Takers
1986                      Faster
Name: original_title, dtype: object

In [21]:
get_recommendations('Pirates of the Caribbean: At World\'s End', cosineSimilarity)

12             Pirates of the Caribbean: Dead Man's Chest
199     Pirates of the Caribbean: The Curse of the Bla...
13                                        The Lone Ranger
17            Pirates of the Caribbean: On Stranger Tides
32                                    Alice in Wonderland
262     The Lord of the Rings: The Fellowship of the Ring
71                  The Mummy: Tomb of the Dragon Emperor
1932                                               Sheena
786                                          西游记之孙悟空三打白骨精
75                                             Waterworld
Name: original_title, dtype: object

Great! We see that our recommender has been successful in capturing more information due to more metadata and has given us better recommendations. There are, of course, numerous ways of experimenting with this system to improve recommendations.

Some suggestions:

- Other crew members: other crew member names, such as screenwriters and producers, could also be included.
- The increasing weight of the director: to give more weight to the director, he or she could be mentioned multiple times in the soup to increase the similarity scores of movies with the same director.