In [1]:
import os
os.chdir("../")
%pwd

'c:\\Users\\abhis\\Desktop\\MLProjects\\Movie Recommender'

In [2]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Content-Based Recommender
To personalize the recommendations, we build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.

We build two Content-Based Recommenders based on contents including:

 - Movie Overview
 - Movie Cast, Crew, Keywords and Genre
### Movie Overview Based Recommender
First we compute pairwise similarity scores for all movies based on their **overview** column. Then recommend movies based on that similarity score.

In [3]:
# Read the data

# ratings_df = pd.read_csv('artifacts/data_preparation/final_data/ratings.csv')
movies_df = pd.read_csv('artifacts/data_preparation/final_data/movies.csv')

In [4]:
movies_df.head()

Unnamed: 0,movieId,title,imdbId,tmdbId,genres,overview,popularity,poster_path,vote_average,vote_count,director,keywords
0,1,Toy Story (1995),114709,862,"['Animation', 'Adventure', 'Family', 'Comedy']","Led by Woody, Andy's toys live happily in his ...",101.402,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,8.0,16771,John Lasseter,"['martial arts', 'jealousy', 'friendship', 'bu..."
1,2,Jumanji (1995),113497,8844,"['Adventure', 'Fantasy', 'Family']",When siblings Judy and Peter discover an encha...,16.794,/v2XHtmVqpERPy0HA1y9wltoeEgW.jpg,7.238,9636,Joe Johnston,"['giant insect', 'board game', 'jungle', 'disa..."
2,3,Grumpier Old Men (1995),113228,15602,"['Romance', 'Comedy']",A family wedding reignites the ancient feud be...,9.856,/1FSXpj5e8l4KH6nVFO5SPUeraOt.jpg,6.47,328,Howard Deutch,"['fishing', 'halloween', 'sequel', 'old man', ..."
3,4,Waiting to Exhale (1995),114885,31357,"['Comedy', 'Drama', 'Romance']","Cheated on, mistreated and stepped on, the wom...",11.498,/kJokIbVDkd6Ywp7IONv8xgfiES7.jpg,6.272,134,Forest Whitaker,"['based on novel or book', 'interracial relati..."
4,5,Father of the Bride Part II (1995),113041,11862,"['Comedy', 'Family']",Just when George Banks has recovered from his ...,14.211,/rj4LBtwQ0uGrpBnCELr716Qo3mw.jpg,6.25,642,Charles Shyer,"['parent child relationship', 'baby', 'midlife..."


In [5]:
movies_df['overview'].isnull().sum()

362

In [6]:
movies_df['overview'].fillna(' ', inplace=True)

Now we compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview.

Term Frequency (TF) is the relative frequency of a word in a document and is given as (term instances/total instances). Inverse Document Frequency (IDF) is the relative count of documents containing the term and is given as log(number of documents/documents with term). The overall importance of each word to the documents in which they appear is equal to TF * IDF

This gives us a matrix where each column represents a word in the overall overview vocabulary and each row represents a movie.This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

Scikit-learn has a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines.

**Constructing TF-IDF Matrix**

In [None]:
tfidfv=TfidfVectorizer(analyzer='word', stop_words='english')
tfidfv_matrix=tfidfv.fit_transform(movies_df['overview']).astype('float32')
# print(tfidfv_matrix.todense())
# tfidfv_matrix.todense().shape

In [9]:
df = pd.DataFrame.sparse.from_spmatrix(
    tfidfv_matrix, columns=tfidfv.get_feature_names_out() )

In [10]:
df.head()

Unnamed: 0,00,000,000s,000th,001,006,007,009,0093,01,...,आव,गल,ஒற,றன,అన,నమయ,ണന,ധയ,ﬁrst,ﬂying
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
df.shape,movies_df.shape

((55631, 75214), (55631, 12))

So there are over 75k words describing 55631 movies

### Computing Similarity Score
We can compute the similarity score by different methods such as euclidean, Pearson and cosine similarity, k neighbour. We choose the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies because it is independent of magnitude and is relatively easy and fast to calculate. Mathematically, it is defined as follows:

$$
\text{cosine}(x,y)=\frac{x . y^T}{||x|| . ||y||}
$$
Since we have used the **TfidfVectorizer**, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is faster.

In [8]:
cosine_sim = linear_kernel(tfidfv_matrix.astype('float32'), tfidfv_matrix.astype('float32'))
cosine_sim.shape 

(55631, 55631)

We now have a pairwise cosine similarity matrix for all the movies in our dataset.

**Defining Recommendation Function**
The next step is to define a recommendation function that takes in a movie title as an input and outputs a list of the 10 most similar movies. In order to do this;

- We need a reverse mapping of movie titles and dataframe indices. In other words, we build a series to identify the index of a movie in our dataframe, given its title.

- The function should get the index of the movie given its title.

- Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.

- Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

- Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

- Return the titles corresponding to the indices of the top elements.

In [30]:
indices=pd.Series(data=list(movies_df.index), index= movies_df['title'] )

In [31]:
indices.head()

title
Toy Story (1995)                      0
Jumanji (1995)                        1
Grumpier Old Men (1995)               2
Waiting to Exhale (1995)              3
Father of the Bride Part II (1995)    4
dtype: int64

In [9]:
# Function that takes in movie title as input and outputs most similar movies
def content_recommendations(title, cosine_sim):
    
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the similarity scores
    sim_scores.sort(key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores=sim_scores[1:11]
    
    # Get the movie indices
    ind=[]
    for (x,y) in sim_scores:
        ind.append(x)
        
    # Return the top 10 most similar movies
    tit=[]
    for x in ind:
        tit.append(movies_df.iloc[x]['title'])
    return pd.Series(data=tit, index=ind)


In [15]:
content_recommendations('Father of the Bride Part II (1995)',cosine_sim)

6776                         Father of the Bride (1991)
6548                                       Kuffs (1992)
6284                             North to Alaska (1960)
52720                      Worried About the Boy (2010)
53138                              Father's Lion (1952)
19666                                    Babbitt (1934)
27438    Don't Raise the Bridge, Lower the River (1968)
36112                          You're Killing Me (2015)
47999                                   George ! (1972)
13562    Magic of Méliès, The (magie Méliès, La) (1997)
dtype: object

In [21]:
import re
pattern = r'.*Avengers.*' # r'.*Dark Knight.*'
matches = movies_df['title'].str.match(pattern, flags=re.IGNORECASE)

In [22]:
movies_df[matches]

Unnamed: 0,movieId,title,imdbId,tmdbId,genres,overview,popularity,poster_path,vote_average,vote_count,director,keywords
2029,2153,"Avengers, The (1998)",118661,9320,"['Thriller', 'Science Fiction', 'Action', 'Adv...","British Ministry agent John Steed, under direc...",18.274,/1p5thyQ4pCy876HpdvFARqJ62N9.jpg,4.362,620,Jeremiah S. Chechik,"['london, england', 'clone', 'spy', 'martial a..."
10787,44020,Ultimate Avengers (2006),491703,14609,"['Action', 'Animation', 'Family', 'Adventure',...",When a nuclear missile was fired at Washington...,12.303,/fKQqZEDmvKMCXEQztvMJHGou9dO.jpg,6.761,301,Curt Geda,"['mask', 'alien life-form', 'superhero', 'base..."
17743,89745,"Avengers, The (2012)",848228,24428,"['Science Fiction', 'Action', 'Adventure']",When an unexpected enemy emerges and threatens...,108.524,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,7.708,28709,Joss Whedon,"['new york city', 'shield', 'superhero', 'base..."
22891,110132,Avengers Confidential: Black Widow & Punisher ...,3482378,257346,"['Animation', 'Science Fiction', 'Action']",When the Punisher takes out a black-market wea...,28.347,/hRBXP91ATK5j1u0ibvrQLbxQr8c.jpg,6.3,210,Kenichi Shimizu,"['superhero', 'based on comic']"
24364,115727,Crippled Avengers (Can que) (Return of the 5 D...,77292,40081,"['Action', 'Drama']",A group of martial artists seek revenge after ...,5.113,/eKdvNCKtiuUFvwCXtpwISw8jqZ5.jpg,6.6,58,Chang Cheh,['martial arts']
26784,122892,Avengers: Age of Ultron (2015),2395427,99861,"['Action', 'Adventure', 'Science Fiction']",When Tony Stark tries to jumpstart a dormant p...,86.448,/4ssDuvEDkSArWEdyBl2X5EHvYKU.jpg,7.3,21383,Joss Whedon,"['artificial intelligence', 'sequel', 'superhe..."
26793,122912,Avengers: Infinity War - Part I (2018),4154756,299536,"['Adventure', 'Action', 'Science Fiction']",As the Avengers and their allies have continue...,180.25,/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg,8.259,27044,Anthony Russo,"['magic', 'sacrifice', 'superhero', 'based on ..."
31831,135979,Next Avengers: Heroes of Tomorrow (2008),1259998,14613,"['Animation', 'Family', 'Action', 'Adventure',...",The children of the Avengers hone their powers...,16.497,/fpG1NDbcLV2a7c8X7LC4FPISBT7.jpg,6.877,227,Jay Oliva,"['cartoon', 'based on comic']"
31925,136257,Avengers Grimm (2015),4296026,323660,"['Action', 'Fantasy']",When Rumpelstiltskin destroys the Magic Mirror...,13.261,/1SbBKCbnULACOqWKN7eLfTu1gVm.jpg,4.0,108,Jeremy M. Inman,"['fairy tale', 'brothers grimm', 'rumpelstilts..."
35776,145676,3 Avengers (1964),58651,296491,"['Action', 'Adventure', 'Comedy']",Ursus and his sword-wielding companions run he...,2.023,/p3SS46UrWx3SngctI3Gbks895MD.jpg,5.0,1,Gianfranco Parolini,['peplum']


In [18]:
content_recommendations('The Dark Knight (2011)',cosine_sim) 	

34038                                     Turbo Kid (2015)
9902                                        Macbeth (1948)
6921                              Hero (Ying xiong) (2002)
49398                          Empire of the Sharks (2017)
21929    Ninja, A Band of Assassins (Shinobi No Mono) (...
17045             Storm Warriors, The (Fung wan II) (2009)
36577                              West Of Shanghai (1937)
10715                Bitter Tea of General Yen, The (1933)
9213         Dark Prince: The True Story of Dracula (2000)
45244                  The Taking of Tiger Mountain (2014)
dtype: object

In [20]:
movies_df[movies_df['title'] == 'Hero (Ying xiong) (2002)']

Unnamed: 0,movieId,title,imdbId,tmdbId,genres,overview,popularity,poster_path,vote_average,vote_count,director,keywords
6921,7090,Hero (Ying xiong) (2002),299977,79,"['Drama', 'Adventure', 'Action', 'History']",One man defeated three assassins who sought to...,20.646,/dsSTITP8sq2pO7ZWo72NNYejYLW.jpg,7.506,1976,Zhang Yimou,"['countryside', 'loss of loved one', 'martial ..."


In [23]:
content_recommendations('Avengers, The (2012)',cosine_sim)  

24859    Delta Force One: The Lost Patrol (2002)
28797       Requiem per un agente segreto (1966)
11159                       Crime Busters (1977)
26784             Avengers: Age of Ultron (2015)
33642              Drums Across the River (1954)
13467                  Echelon Conspiracy (2009)
45761                              Brink! (1998)
34122                                   Nice Guy
25497        Kingsman: The Secret Service (2015)
7284                           Enemy Mine (1985)
dtype: object

While our system has done a decent job of finding movies with similar overviews and descriptions, the quality of recommendations is not that great. "The Dark Knight Rises" returns all Batman movies while it is more likely that the people who liked that movie are more inclined to enjoy other Christopher Nolan movies. This is something that cannot be captured by the present system.

### Movie Cast, Crew, Keywords, Genres Based Recommender

In order to improve the quality of the content-based recommender, we use better metadata. So we build a recommender based on the following metadata:

- the director
- the 3 top related genres
- the 3 top movie plot keywords
From the crew, cast, genres and keywords features, we need to extract the the director, genres and keywords associated with that movie.

**Preprocessing the Contents**

Applying literal_eval Function on Stringified Lists

Right now, our data in 'director', 'cast', 'genres' and 'keywords' columns is present in the form of "stringified" lists. So we need to convert it into a safe and usable structure. literal_eval is a function which evaluates a string as though it were an expression and returns a result.

In [6]:
type(movies_df['director'].iloc[0])

str

In [7]:
features = ['genres','keywords']
for feature in features:
    movies_df[feature] = movies_df[feature].apply(literal_eval)

In [8]:
type(movies_df['genres'].iloc[0])

list

In [9]:
# Get the list top 3 elements or entire list; whichever is more in cast, genres and keywords columns.

def get_top_elements(lst):
    # top_n = max(3, len(lst))
    return lst[:3]

In [10]:
movies_df['genres']= movies_df['genres'].apply(lambda x:get_top_elements(x))

In [11]:
movies_df['keywords']= movies_df['keywords'].apply(lambda x:get_top_elements(x))

In [12]:
movies_df[['title', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,director,keywords,genres
0,Toy Story (1995),John Lasseter,"[martial arts, jealousy, friendship]","[Animation, Adventure, Family]"
1,Jumanji (1995),Joe Johnston,"[giant insect, board game, jungle]","[Adventure, Fantasy, Family]"
2,Grumpier Old Men (1995),Howard Deutch,"[fishing, halloween, sequel]","[Romance, Comedy]"


The next step would be to convert the names and keyword instances into lowercase and strip all the spaces between them. This is done so that our vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same.

In [13]:
def clean_director(x):
    return x.lower().replace(' ','_')

def clean_top3(x):
    new=[]
    for a in x:
        new.append(a.lower().replace(' ','_'))
    return new

In [14]:
movies_df['director']=movies_df['director'].apply(lambda x: clean_director(x))

In [15]:
movies_df['genres']=movies_df['genres'].apply(lambda x:clean_top3(x))

movies_df['keywords']=movies_df['keywords'].apply(lambda x:clean_top3(x))


In [16]:
movies_df[['title', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,director,keywords,genres
0,Toy Story (1995),john_lasseter,"[martial_arts, jealousy, friendship]","[animation, adventure, family]"
1,Jumanji (1995),joe_johnston,"[giant_insect, board_game, jungle]","[adventure, fantasy, family]"
2,Grumpier Old Men (1995),howard_deutch,"[fishing, halloween, sequel]","[romance, comedy]"


Now we create the 'soup' column, that contains all the metadata that we want to feed to our vectorizer (namely actors, director, genres and keywords).

In [17]:
def create_soup(x):
    return ' '.join(x['keywords'])  + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [18]:
movies_df['soup'] = movies_df.apply(create_soup, axis=1)

In [19]:
type(movies_df['genres'].loc[0])

list

In [20]:
movies_df['soup'].loc[0]

'martial_arts jealousy friendship john_lasseter animation adventure family'

Constructing TF-IDF Matrix
The next steps are the same as what we did with our Movie Overview Based Recommender. One important difference is that we use the CountVectorizer() instead of TF-IDF. This is because we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies.

In [24]:
cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(movies_df['soup']).astype('float16')
# cv_matrix = csr_matrix(cv_matrix, dtype=np.float16)

In [25]:
cv_matrix

<55631x32642 sparse matrix of type '<class 'numpy.float16'>'
	with 283052 stored elements in Compressed Sparse Row format>

In [26]:
cosine_sim1 = None

In [27]:
cosine_sim1 =  cosine_similarity(cv_matrix, dense_output=False)

In [28]:
cosine_sim1

<55631x55631 sparse matrix of type '<class 'numpy.float64'>'
	with 1162388295 stored elements in Compressed Sparse Row format>

In [32]:
# Get the index of the movie that matches the title
idx = indices['Avengers, The (2012)']

In [33]:
# Get the row vector of cosine similarity scores
similarity_scores = cosine_sim1[idx, :]

# Convert the row vector to a dense array
sim_scores_dense = similarity_scores.toarray()[0]

# Enumerate the similarity scores with their indices
sim_scores = list(enumerate(sim_scores_dense))

In [34]:
sim_scores

[(0, 0.14285714285714282),
 (1, 0.14285714285714282),
 (2, 0.0),
 (3, 0.0),
 (4, 0.0),
 (5, 0.14285714285714282),
 (6, 0.0),
 (7, 0.28571428571428564),
 (8, 0.1690308509457033),
 (9, 0.28571428571428564),
 (10, 0.0),
 (11, 0.0),
 (12, 0.13363062095621217),
 (13, 0.0),
 (14, 0.3086066999241838),
 (15, 0.0),
 (16, 0.0),
 (17, 0.0),
 (18, 0.14285714285714282),
 (19, 0.28571428571428564),
 (20, 0.0),
 (21, 0.0),
 (22, 0.28571428571428564),
 (23, 0.14285714285714282),
 (24, 0.0),
 (25, 0.0),
 (26, 0.0),
 (27, 0.0),
 (28, 0.26726124191242434),
 (29, 0.0),
 (30, 0.0),
 (31, 0.13363062095621217),
 (32, 0.1889822365046136),
 (33, 0.0),
 (34, 0.0),
 (35, 0.0),
 (36, 0.0),
 (37, 0.0),
 (38, 0.0),
 (39, 0.0),
 (40, 0.1543033499620919),
 (41, 0.0),
 (42, 0.13363062095621217),
 (43, 0.0),
 (44, 0.0),
 (45, 0.0),
 (46, 0.14285714285714282),
 (47, 0.0),
 (48, 0.14285714285714282),
 (49, 0.1690308509457033),
 (50, 0.0),
 (51, 0.0),
 (52, 0.0),
 (53, 0.0),
 (54, 0.0),
 (55, 0.0),
 (56, 0.0),
 (57, 0.0),

In [35]:
# Function that takes in movie title as input and outputs most similar movies
def content_recommendations(title, cosine_sim):
    
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the row vector of cosine similarity scores
    similarity_scores = cosine_sim[idx, :]

    # Convert the row vector to a dense array
    sim_scores_dense = similarity_scores.toarray()[0]

    # Enumerate the similarity scores with their indices
    sim_scores = list(enumerate(sim_scores_dense))

    # Sort the movies based on the similarity scores
    sim_scores.sort(key=lambda x: x[1], reverse=True)
    # # Sort the movies based on the similarity scores
    # sim_scores.sort(key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores=sim_scores[1:11]
    
    # Get the movie indices
    ind=[]
    for (x,y) in sim_scores:
        ind.append(x)
        
    # Return the top 10 most similar movies
    tit=[]
    for x in ind:
        tit.append(movies_df.iloc[x]['title'])
    return pd.Series(data=tit, index=ind)


In [36]:
content_recommendations('Avengers, The (2012)',cosine_sim1)  

12656                          Incredible Hulk, The (2008)
20811                               Captain America (1979)
26784                       Avengers: Age of Ultron (2015)
43655                                     Max Steel (2016)
23555                                        Ra.One (2011)
1971                                 Rocketeer, The (1991)
3297                   Teenage Mutant Ninja Turtles (1990)
3298     Teenage Mutant Ninja Turtles II: The Secret of...
3649                                          X-Men (2000)
5016                              Time Machine, The (2002)
dtype: object

We see that our recommender has been successful in capturing more information due to more metadata and has given us better recommendations. It is more likely that Marvels or DC comics fans will like the movies of the same production house. Therefore, to our features above we can add production_company . We can also increase the weight of the director , by adding the feature multiple times in the soup.

In the previous notebook, you were introduced to a way to make recommendations using collaborative filtering. However, using this technique there are a large number of users who were left without any recommendations at all. Other users were left with fewer than the ten recommendations that were set up by our function to retrieve...

In order to help these users out, let's try another technique **content based** recommendations. Let's start off where we were in the previous notebook.