## Content-based recommenders
These recommenders suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on our history, it suggests us new videos that we could potentially watch.

In this notebook, we will learn how to build a system that recommends movies that are similar to a particular movie. To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

In [1]:
# importing packages
import numpy as np
import pandas as pd

In [2]:
# reading input files
#https://www.kaggle.com/tmdb/tmdb-movie-metadata
df_credits = pd.read_csv("tmdb_5000_credits.csv")
df_movies = pd.read_csv("tmdb_5000_movies.csv")
df_credits.shape, df_movies.shape

((4803, 4), (4803, 20))

In [3]:
df_movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
df_credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [5]:
# initial cleaning of data
df_credits.rename(columns = {"movie_id": "id"}, inplace = True)
df_movies_merge = df_movies.merge(df_credits[['id', 'cast', 'crew']], on = 'id')
df_movies_merge.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [6]:
df_movies_merge.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'cast', 'crew'],
      dtype='object')

In [7]:
df_movies_merge.drop(columns = ['homepage', 'status', 'production_countries', 'title'], inplace = True)
df_movies_merge.head(1)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


The plot description is available to us as the overview feature in our metadata dataset. Let's inspect the plots of a few movies:

In [8]:
df_movies_merge['overview'].head()

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

The problem at hand is a Natural Language Processing problem. Hence we need to extract some kind of features from the above text data before we can compute the similarity and/or dissimilarity between them. To put it simply, it is not possible to compute the similarity between any two overviews in their raw forms. To do this, we need to compute the word vectors of each overview or document, as it will be called from now on.

As the name suggests, word vectors are vectorized representation of words in a document. The vectors carry a semantic meaning with it. For example, man & king will have vector representations close to each other while man & woman would have representation far from each other.

We will compute `Term Frequency-Inverse Document Frequency (TF-IDF)` vectors for each document. This will give us a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

Fortunately, scikit-learn gives us a built-in `TfIdfVectorizer` class that produces the TF-IDF matrix in a couple of lines.

- Import the Tfidf module using scikit-learn;
- Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic;
- Replace not-a-number values with a blank string;
- Finally, construct the TF-IDF matrix on the data.

In [9]:
# Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object.
# min_df -> Remove all english stop words such as 'the', 'a'
# stop_words -> Ignore terms that have a document frequency strictly lower than the given threshold
tfidf = TfidfVectorizer(min_df = 3,  max_features = None, analyzer='word', stop_words='english')

# Replace NaN with an empty string
df_movies_merge['overview'] = df_movies_merge['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidfMatrix = tfidf.fit_transform(df_movies_merge['overview'])

# Output the shape of tfidfMatrix
tfidfMatrix.shape

(4803, 7559)

In [10]:
tfidf.get_feature_names()[200:250]

['acts',
 'actual',
 'actually',
 'ad',
 'adam',
 'adams',
 'adapt',
 'adaptation',
 'adapted',
 'adapts',
 'add',
 'addict',
 'addicted',
 'addiction',
 'addition',
 'addled',
 'adds',
 'adjust',
 'administration',
 'admiral',
 'admired',
 'admit',
 'admits',
 'adolescent',
 'adolf',
 'adopt',
 'adopted',
 'adoption',
 'adopts',
 'adorable',
 'adrenaline',
 'adrian',
 'adrift',
 'adult',
 'adulthood',
 'adults',
 'advanced',
 'advantage',
 'adventure',
 'adventures',
 'adventurous',
 'advertisement',
 'advertising',
 'advice',
 'adviser',
 'afar',
 'affable',
 'affair',
 'affairs',
 'affect']

In [11]:
tfidfMatrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

From the above output, we observe that there are 7559 different vocabularies or words in our dataset i.e, in 4803 movies.

With this matrix in hand, we can now compute a similarity score. There are several similarity metrics that we can use for this, such as the manhattan, euclidean, the Pearson, and the cosine similarity scores. Again, there is no right answer to which score is the best. Different scores work well in different scenarios, and it is often a good idea to experiment with different metrics and observe the results.

We will be using the `cosine similarity` to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores, which will be explained later). Mathematically, it is defined as follows:

<img src="cosine_similarity.jpg" alt="Alt text that describes the graphic" title="Title text" />

Since we have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give us the cosine similarity score. Therefore, we will use `sklearn's linear_kernel()` instead of cosine_similarities() since it is faster.

This would return a matrix of shape 4803 x 4803, which means each movie `overview` cosine similarity score with every other movie `overview`. Hence, each movie will be a 1 x 4803 column vector where each column will be a similarity score with each movie.

The function `linear_kernel` computes the linear kernel, that is, a special case of polynomial_kernel with degree=1 and coef0=0 (homogeneous). If x and y are column vectors, their linear kernel is:
```k(x,y) = x(transform) * y```

In [12]:
# Since we have used the TF-IDF vectorizer, calculating the dot product between each vector 
# will directly give us the cosine similarity score. Therefore, we will use sklearn's linear_kernel() 
# instead of cosine_similarities() since it is faster.

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosineSim = linear_kernel(tfidfMatrix, tfidfMatrix)
cosineSim.shape

(4803, 4803)

In [13]:
cosineSim[0]

array([1., 0., 0., ..., 0., 0., 0.])

We're going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [14]:
# Reverse mapping of indices and movie titles
indices = pd.Series(df_movies_merge.index, index = df_movies_merge['original_title']).drop_duplicates()
indices[:5]

original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

In [15]:
indices['Newlyweds']

4799

In [16]:
cosineSim[4799]

array([0., 0., 0., ..., 0., 0., 0.])

In [17]:
list(enumerate(cosineSim[indices['Newlyweds']]))

[(0, 0.0),
 (1, 0.0),
 (2, 0.0),
 (3, 0.0),
 (4, 0.0),
 (5, 0.0),
 (6, 0.0),
 (7, 0.0),
 (8, 0.0),
 (9, 0.0),
 (10, 0.0),
 (11, 0.0),
 (12, 0.0),
 (13, 0.0),
 (14, 0.0),
 (15, 0.0),
 (16, 0.0),
 (17, 0.0),
 (18, 0.0),
 (19, 0.0),
 (20, 0.0),
 (21, 0.0),
 (22, 0.0),
 (23, 0.0),
 (24, 0.0),
 (25, 0.0),
 (26, 0.0),
 (27, 0.0),
 (28, 0.0),
 (29, 0.0),
 (30, 0.0),
 (31, 0.0),
 (32, 0.0),
 (33, 0.0),
 (34, 0.0),
 (35, 0.0),
 (36, 0.0),
 (37, 0.0),
 (38, 0.0),
 (39, 0.0),
 (40, 0.0),
 (41, 0.0),
 (42, 0.0),
 (43, 0.0),
 (44, 0.0),
 (45, 0.0),
 (46, 0.0),
 (47, 0.0),
 (48, 0.0),
 (49, 0.0),
 (50, 0.0),
 (51, 0.0),
 (52, 0.0),
 (53, 0.0),
 (54, 0.0),
 (55, 0.0),
 (56, 0.0),
 (57, 0.0),
 (58, 0.0),
 (59, 0.0),
 (60, 0.0),
 (61, 0.0),
 (62, 0.0),
 (63, 0.0),
 (64, 0.0),
 (65, 0.0),
 (66, 0.0),
 (67, 0.0),
 (68, 0.0),
 (69, 0.0),
 (70, 0.0),
 (71, 0.0),
 (72, 0.0),
 (73, 0.0),
 (74, 0.0),
 (75, 0.0),
 (76, 0.0),
 (77, 0.0),
 (78, 0.0),
 (79, 0.0),
 (80, 0.0),
 (81, 0.0),
 (82, 0.0),
 (83, 0.0),
 (

In [18]:
sorted(list(enumerate(cosineSim[indices['Newlyweds']])), key = lambda x: x[1], reverse=True)

[(4799, 1.0000000000000002),
 (3969, 0.19270320765371432),
 (616, 0.18864326731057912),
 (2689, 0.18593166409230735),
 (1576, 0.1821590906609501),
 (2290, 0.16089644124676217),
 (504, 0.15492622854876056),
 (869, 0.14702943161183046),
 (866, 0.13397197042261558),
 (4576, 0.13231654284625802),
 (242, 0.1256311620019384),
 (2962, 0.1238150826042811),
 (2869, 0.1219574606591664),
 (3155, 0.12168090394595296),
 (1223, 0.11968300452604068),
 (3479, 0.11965064690119123),
 (4641, 0.1166490673300186),
 (1071, 0.11588259884990819),
 (3393, 0.11536024499975302),
 (2688, 0.11149522734533239),
 (3559, 0.11016122112090831),
 (4616, 0.10548258923873528),
 (1970, 0.10004975629403871),
 (4591, 0.09960926194758037),
 (1856, 0.09344414172308568),
 (1110, 0.09093472696648205),
 (237, 0.09081489043733419),
 (3638, 0.08991182691426543),
 (1385, 0.08953845099824984),
 (4584, 0.08921097567983116),
 (3583, 0.08723160961933182),
 (3253, 0.0831831863949244),
 (1949, 0.08182941033277094),
 (1364, 0.0802919954405

We are now in good shape to define our recommendation function. These are the following steps we'll follow:

- Get the index of the movie given its title.
- Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.
- Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
- Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
- Return the titles corresponding to the indices of the top elements.

In [19]:
def get_recommendations(movie, cosineSim):
    # get index of the movie
    idx = indices[movie]
    
    # fetch scores of similar movies
    simScores = sorted(list(enumerate(cosineSim[indices[idx]])), key = lambda x: x[1], reverse=True)
    
    # fetch top 10 similar movies based on scores
    simScores = simScores[1:11]
    
    # fetch movie titles
    movieIndices = [i[0] for i in simScores]
    return df_movies_merge['original_title'].iloc[movieIndices]

get_recommendations('Avatar', cosineSim)

3604                       Apollo 18
2130                    The American
1341                Obitaemyy Ostrov
634                       The Matrix
529                 Tears of the Sun
311     The Adventures of Pluto Nash
151                          Beowulf
2628             Blood and Chocolate
847                         Semi-Pro
570                           Ransom
Name: original_title, dtype: object

In [20]:
get_recommendations('The Dark Knight', cosineSim)

3                         The Dark Knight Rises
428                              Batman Returns
299                              Batman Forever
3854    Batman: The Dark Knight Returns, Part 2
1359                                     Batman
119                               Batman Begins
1181                                        JFK
9            Batman v Superman: Dawn of Justice
2507                                  Slow Burn
879                         Law Abiding Citizen
Name: original_title, dtype: object

In [21]:
get_recommendations('Pirates of the Caribbean: At World\'s End', cosineSim)

2542                        What's Love Got to Do with It
3095                                  My Blueberry Nights
2102                                      The Descendants
1280                                            Disturbia
2652                                              Bathory
4720                                The Birth of a Nation
792                                      Just Like Heaven
109     The Chronicles of Narnia: The Voyage of the Da...
1709                                           キャプテンハーロック
3632                                 90 Minutes in Heaven
Name: original_title, dtype: object

We see that, while our system has done a decent job of finding movies with similar plot descriptions, the quality of recommendations is not that great. "The Dark Knight" returns all Batman movies while it is more likely that the people who liked that movie are more inclined to enjoy other Christopher Nolan movies. This is something that cannot be captured by our present system.

In the next notebook, we will explore a better recommender system using the cast, crew, crew and keywords for each movies.