## 1. Importing Python Libraries

We shall start by importing the essential Python libraries

In [1]:
### IMPORTING LIBRARIES
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## 2. Importing the Datasets

Next, we shall import the datasets that we had pulled from TMDB.

In [2]:
### IMPORTING DATASETS
tmdb_movies = pd.read_csv('tmdb_movies_data.csv')
tmdb_genres = pd.read_csv('tmdb_genres.csv')
tmdb_directors = pd.read_csv('tmdb_directors.csv')
tmdb_actors = pd.read_csv('tmdb_actors.csv')

## 3. Cleaning and Merging the Datasets 

### a) Cleaning the Top Rated Movies Dataframe

First, let us take a look at _tmdb_movies_.

In [3]:
tmdb_movies

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/5hNcsnMkwU2LknLoru73c76el3z.jpg,"[35, 18, 10749]",19404,hi,दिलवाले दुल्हनिया ले जायेंगे,"Raj is a rich, carefree, happy-go-lucky second...",24.222,/2CAL2433ZeIihfX1Hb2139CX0pW.jpg,1995-10-20,Dilwale Dulhania Le Jayenge,False,8.7,3253
1,False,/iNh3BivHyg5sQRPP1KOkzguEX0H.jpg,"[18, 80]",278,en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,67.359,/q6y0Go1tsGEsmtFryDOJo3dEmqu.jpg,1994-09-23,The Shawshank Redemption,False,8.7,20172
2,False,/rSPw7tgCH9c6NqICZef4kZjFOQ5.jpg,"[18, 80]",238,en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",62.603,/eEslKSwcqmiNS6va24Pbxf2UKmJ.jpg,1972-03-14,The Godfather,False,8.7,15112
3,False,/jtAI6OJIWLWiRItNSZoWjrsUtmi.jpg,[10749],724089,en,Gabriel's Inferno Part II,Professor Gabriel Emerson finally learns the t...,10.796,/x5o8cLZfEXMoZczTYWLrUo1P7UJ.jpg,2020-07-31,Gabriel's Inferno Part II,False,8.7,1334
4,False,/fQq1FWp1rC89xDrRMuyFJdFUdMd.jpg,"[10749, 35]",761053,en,Gabriel's Inferno Part III,The final part of the film adaption of the ero...,34.804,/qtX2Fg9MTmrbgN1UUvGoCsImTM8.jpg,2020-11-19,Gabriel's Inferno Part III,False,8.6,901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9355,False,/lcLyZzhB1ctfdH0hGBsTFrbflqP.jpg,"[28, 14, 27]",12142,en,Alone in the Dark,Edward Carnby is a private investigator specia...,14.365,/o6Wf8lj8P9enQQbj4pC8jVDJHxI.jpg,2005-01-28,Alone in the Dark,False,3.2,429
9356,False,/hqJfW8G8FL28rckFHuCoKPecpG9.jpg,"[28, 12, 878, 10752]",5491,en,Battlefield Earth,"In the year 3000, man is no match for the Psyc...",8.795,/neMUscYddxr4cP8wnRHRMLcWS0A.jpg,2000-05-12,Battlefield Earth,False,3.2,621
9357,False,/aNUEHLNsNMprLZt6fjf5nqDq6er.jpg,"[27, 28, 53]",11059,en,House of the Dead,"Set on an island off the coast, a techno rave ...",10.019,/lI6UBnxwHztggSq8PhLibdOe2Nd.jpg,2003-04-11,House of the Dead,False,3.2,280
9358,False,/oHrrgAPEKpz0S1ofQntiZNrmGrM.jpg,"[28, 12, 14, 878, 53]",14164,en,Dragonball Evolution,The young warrior Son Goku sets out on a quest...,50.279,/sunS9xhPnFNP5wlOWrvbpBteAB.jpg,2009-03-12,Dragonball Evolution,False,2.9,1567


The dataframe contains 14 columns; some of them like _backdrop_path_, _'poster_path'_ and so on, contain no useful information. So, we shall drop such uninformative columns to get _tmdb_movies_updated_.

In [4]:
### DROPPING UNNECESSARY COLUMNS
tmdb_movies_updated = tmdb_movies.drop(['adult', 'backdrop_path', 'genre_ids', 'original_title', 'poster_path', 'video'], axis = 1)
tmdb_movies_updated

Unnamed: 0,id,original_language,overview,popularity,release_date,title,vote_average,vote_count
0,19404,hi,"Raj is a rich, carefree, happy-go-lucky second...",24.222,1995-10-20,Dilwale Dulhania Le Jayenge,8.7,3253
1,278,en,Framed in the 1940s for the double murder of h...,67.359,1994-09-23,The Shawshank Redemption,8.7,20172
2,238,en,"Spanning the years 1945 to 1955, a chronicle o...",62.603,1972-03-14,The Godfather,8.7,15112
3,724089,en,Professor Gabriel Emerson finally learns the t...,10.796,2020-07-31,Gabriel's Inferno Part II,8.7,1334
4,761053,en,The final part of the film adaption of the ero...,34.804,2020-11-19,Gabriel's Inferno Part III,8.6,901
...,...,...,...,...,...,...,...,...
9355,12142,en,Edward Carnby is a private investigator specia...,14.365,2005-01-28,Alone in the Dark,3.2,429
9356,5491,en,"In the year 3000, man is no match for the Psyc...",8.795,2000-05-12,Battlefield Earth,3.2,621
9357,11059,en,"Set on an island off the coast, a techno rave ...",10.019,2003-04-11,House of the Dead,3.2,280
9358,14164,en,The young warrior Son Goku sets out on a quest...,50.279,2009-03-12,Dragonball Evolution,2.9,1567


Next, let us take the title of each of these movies, convert them into string format and get rid of any punctuation. 

In [5]:
### CLEANING THE MOVIE TITLES
new_title = []
for name in range(0, len(tmdb_movies_updated)):
    title = tmdb_movies_updated['title'][name]
    title = str(title) #converting to string format
    title = re.sub(r'[^\w\s]+', '', title) #removing punctuations
    new_title.append(title)
tmdb_movies_updated['title'] = new_title
tmdb_movies_updated

Unnamed: 0,id,original_language,overview,popularity,release_date,title,vote_average,vote_count
0,19404,hi,"Raj is a rich, carefree, happy-go-lucky second...",24.222,1995-10-20,Dilwale Dulhania Le Jayenge,8.7,3253
1,278,en,Framed in the 1940s for the double murder of h...,67.359,1994-09-23,The Shawshank Redemption,8.7,20172
2,238,en,"Spanning the years 1945 to 1955, a chronicle o...",62.603,1972-03-14,The Godfather,8.7,15112
3,724089,en,Professor Gabriel Emerson finally learns the t...,10.796,2020-07-31,Gabriels Inferno Part II,8.7,1334
4,761053,en,The final part of the film adaption of the ero...,34.804,2020-11-19,Gabriels Inferno Part III,8.6,901
...,...,...,...,...,...,...,...,...
9355,12142,en,Edward Carnby is a private investigator specia...,14.365,2005-01-28,Alone in the Dark,3.2,429
9356,5491,en,"In the year 3000, man is no match for the Psyc...",8.795,2000-05-12,Battlefield Earth,3.2,621
9357,11059,en,"Set on an island off the coast, a techno rave ...",10.019,2003-04-11,House of the Dead,3.2,280
9358,14164,en,The young warrior Son Goku sets out on a quest...,50.279,2009-03-12,Dragonball Evolution,2.9,1567


### b) Merging with the Genres Dataframe

Lets take a look at the _tmdb_genres_ dataframe. 

In [6]:
tmdb_genres

Unnamed: 0,genre_1,genre_2
0,Comedy,Drama
1,Drama,Crime
2,Drama,Crime
3,Romance,
4,Romance,Comedy
...,...,...
9355,Action,Fantasy
9356,Action,Adventure
9357,Horror,Action
9358,Action,Adventure


We now combine _tmdb_movies_updated_ and _tmdb_genres_ by simply attaching the columns of _tmdb_genres_ to _tmdb_movies_updated_.

In [7]:
### COMBINING THE TOP RATED MOVIES AND MOVIE GENRE IDS DATASETS
tmdb_movies_updated['genre_1'] = tmdb_genres['genre_1']
tmdb_movies_updated['genre_2'] = tmdb_genres['genre_2']
tmdb_movies_updated

Unnamed: 0,id,original_language,overview,popularity,release_date,title,vote_average,vote_count,genre_1,genre_2
0,19404,hi,"Raj is a rich, carefree, happy-go-lucky second...",24.222,1995-10-20,Dilwale Dulhania Le Jayenge,8.7,3253,Comedy,Drama
1,278,en,Framed in the 1940s for the double murder of h...,67.359,1994-09-23,The Shawshank Redemption,8.7,20172,Drama,Crime
2,238,en,"Spanning the years 1945 to 1955, a chronicle o...",62.603,1972-03-14,The Godfather,8.7,15112,Drama,Crime
3,724089,en,Professor Gabriel Emerson finally learns the t...,10.796,2020-07-31,Gabriels Inferno Part II,8.7,1334,Romance,
4,761053,en,The final part of the film adaption of the ero...,34.804,2020-11-19,Gabriels Inferno Part III,8.6,901,Romance,Comedy
...,...,...,...,...,...,...,...,...,...,...
9355,12142,en,Edward Carnby is a private investigator specia...,14.365,2005-01-28,Alone in the Dark,3.2,429,Action,Fantasy
9356,5491,en,"In the year 3000, man is no match for the Psyc...",8.795,2000-05-12,Battlefield Earth,3.2,621,Action,Adventure
9357,11059,en,"Set on an island off the coast, a techno rave ...",10.019,2003-04-11,House of the Dead,3.2,280,Horror,Action
9358,14164,en,The young warrior Son Goku sets out on a quest...,50.279,2009-03-12,Dragonball Evolution,2.9,1567,Action,Adventure


### c) Merging with the Director and Cast Dataframes

Next, we have the dataframe _tmdb_directors_.

In [8]:
tmdb_directors

Unnamed: 0,id,director
0,19404,Aditya Chopra
1,278,Frank Darabont
2,238,Francis Ford Coppola
3,724089,Tosca Musk
4,761053,Tosca Musk
...,...,...
9355,12142,Uwe Boll
9356,5491,Roger Christian
9357,11059,Uwe Boll
9358,14164,James Wong


And _tmdb_actors_.

In [9]:
tmdb_actors

Unnamed: 0,id,actor_1,actor_2,actor_3,actor_4,actor_5
0,19404,Shah Rukh Khan,Kajol,Amrish Puri,Anupam Kher,Satish Shah
1,278,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,Clancy Brown
2,238,Marlon Brando,Al Pacino,James Caan,Robert Duvall,Richard S. Castellano
3,724089,Melanie Zanetti,Giulio Berruti,James Andrew Fraser,Margaux Brooke,Agnes Albright
4,761053,Melanie Zanetti,Giulio Berruti,Rhett Wellington,James Andrew Fraser,Margaux Brooke
...,...,...,...,...,...,...
9355,12142,Christian Slater,Tara Reid,Stephen Dorff,Will Sanderson,Ona Grauer
9356,5491,John Travolta,Barry Pepper,Forest Whitaker,Kim Coates,Sabine Karsenti
9357,11059,Jonathan Cherry,Tyron Leitso,Clint Howard,Ona Grauer,Michael Eklund
9358,14164,Justin Chatwin,Chow Yun-Fat,Joon Park,Jamie Chung,Emmy Rossum


Notice that the three dataframes _tmdb_movies_updated_, _tmdb_directors_ and _tmdb_actors_ all share the same column containing the unique ids for top rated movies. Thus, we can use inner join with respect to this unique id column to combine these three dataframes.

First, let us merge _tmdb_movies_updated_ and _tmdb_directors_.

In [10]:
### MERGING THE UPDATED MOVIES AND THE TMDB DIRECTOR DATASETS
tmdb_movies_updated = tmdb_movies_updated.merge(tmdb_directors, left_on = 'id', right_on = 'id')
tmdb_movies_updated

Unnamed: 0,id,original_language,overview,popularity,release_date,title,vote_average,vote_count,genre_1,genre_2,director
0,19404,hi,"Raj is a rich, carefree, happy-go-lucky second...",24.222,1995-10-20,Dilwale Dulhania Le Jayenge,8.7,3253,Comedy,Drama,Aditya Chopra
1,278,en,Framed in the 1940s for the double murder of h...,67.359,1994-09-23,The Shawshank Redemption,8.7,20172,Drama,Crime,Frank Darabont
2,238,en,"Spanning the years 1945 to 1955, a chronicle o...",62.603,1972-03-14,The Godfather,8.7,15112,Drama,Crime,Francis Ford Coppola
3,724089,en,Professor Gabriel Emerson finally learns the t...,10.796,2020-07-31,Gabriels Inferno Part II,8.7,1334,Romance,,Tosca Musk
4,761053,en,The final part of the film adaption of the ero...,34.804,2020-11-19,Gabriels Inferno Part III,8.6,901,Romance,Comedy,Tosca Musk
...,...,...,...,...,...,...,...,...,...,...,...
9355,12142,en,Edward Carnby is a private investigator specia...,14.365,2005-01-28,Alone in the Dark,3.2,429,Action,Fantasy,Uwe Boll
9356,5491,en,"In the year 3000, man is no match for the Psyc...",8.795,2000-05-12,Battlefield Earth,3.2,621,Action,Adventure,Roger Christian
9357,11059,en,"Set on an island off the coast, a techno rave ...",10.019,2003-04-11,House of the Dead,3.2,280,Horror,Action,Uwe Boll
9358,14164,en,The young warrior Son Goku sets out on a quest...,50.279,2009-03-12,Dragonball Evolution,2.9,1567,Action,Adventure,James Wong


Then, we merge the resulting _tmdb_movies_updated_ with _tmdb_actors_ to get get _df_. 

In [11]:
### MERGING THE UPDATED MOVIES AND THE TMDB ACTORS DATASETS
df = tmdb_movies_updated.merge(tmdb_actors, left_on = 'id', right_on = 'id')
df

Unnamed: 0,id,original_language,overview,popularity,release_date,title,vote_average,vote_count,genre_1,genre_2,director,actor_1,actor_2,actor_3,actor_4,actor_5
0,19404,hi,"Raj is a rich, carefree, happy-go-lucky second...",24.222,1995-10-20,Dilwale Dulhania Le Jayenge,8.7,3253,Comedy,Drama,Aditya Chopra,Shah Rukh Khan,Kajol,Amrish Puri,Anupam Kher,Satish Shah
1,278,en,Framed in the 1940s for the double murder of h...,67.359,1994-09-23,The Shawshank Redemption,8.7,20172,Drama,Crime,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,Clancy Brown
2,238,en,"Spanning the years 1945 to 1955, a chronicle o...",62.603,1972-03-14,The Godfather,8.7,15112,Drama,Crime,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Robert Duvall,Richard S. Castellano
3,724089,en,Professor Gabriel Emerson finally learns the t...,10.796,2020-07-31,Gabriels Inferno Part II,8.7,1334,Romance,,Tosca Musk,Melanie Zanetti,Giulio Berruti,James Andrew Fraser,Margaux Brooke,Agnes Albright
4,761053,en,The final part of the film adaption of the ero...,34.804,2020-11-19,Gabriels Inferno Part III,8.6,901,Romance,Comedy,Tosca Musk,Melanie Zanetti,Giulio Berruti,Rhett Wellington,James Andrew Fraser,Margaux Brooke
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9355,12142,en,Edward Carnby is a private investigator specia...,14.365,2005-01-28,Alone in the Dark,3.2,429,Action,Fantasy,Uwe Boll,Christian Slater,Tara Reid,Stephen Dorff,Will Sanderson,Ona Grauer
9356,5491,en,"In the year 3000, man is no match for the Psyc...",8.795,2000-05-12,Battlefield Earth,3.2,621,Action,Adventure,Roger Christian,John Travolta,Barry Pepper,Forest Whitaker,Kim Coates,Sabine Karsenti
9357,11059,en,"Set on an island off the coast, a techno rave ...",10.019,2003-04-11,House of the Dead,3.2,280,Horror,Action,Uwe Boll,Jonathan Cherry,Tyron Leitso,Clint Howard,Ona Grauer,Michael Eklund
9358,14164,en,The young warrior Son Goku sets out on a quest...,50.279,2009-03-12,Dragonball Evolution,2.9,1567,Action,Adventure,James Wong,Justin Chatwin,Chow Yun-Fat,Joon Park,Jamie Chung,Emmy Rossum


## 4. Creating a Combined Feature

In order to establish a relationship between any two movies, we need to get a measure that quantifies their relatedness. This is provided by calculating similarity scores which measures distances between two movies with dimensions representing their features. If the distance is small, the features are having a high degree of similarity. Whereas a large distance will be a low degree of similarity.

Let us first create a feature that is a combination of other features. Here, we shall take the features _'title'_, _'original_language'_, _'genre_1'_, _'genre_2'_, _'director'_, _'actor_1'_, _'actor_2'_, _'actor_3'_, _'actor_4'_ and _'actor_5'_ to create _'combined_features'_. So if we feed a movie to our model, the next recommended movie will possibly be of the same language, of similar genre and either have a similar cast or the same director.

In [12]:
### COMBINING FEATURES FOR SIMILARITY SCORES
def combine_features(data):
    return data['title'] + ' ' + data['original_language'] + ' ' + data['genre_1'] + ' ' + data['genre_2']  + ' '+ data['director'] + ' ' + data['actor_1'] + ' ' + data['actor_2'] + ' ' + data['actor_3'] + ' ' + data['actor_4'] + ' ' + data['actor_5']
df['combined_features'] = df.apply(combine_features, axis = 1)
df

Unnamed: 0,id,original_language,overview,popularity,release_date,title,vote_average,vote_count,genre_1,genre_2,director,actor_1,actor_2,actor_3,actor_4,actor_5,combined_features
0,19404,hi,"Raj is a rich, carefree, happy-go-lucky second...",24.222,1995-10-20,Dilwale Dulhania Le Jayenge,8.7,3253,Comedy,Drama,Aditya Chopra,Shah Rukh Khan,Kajol,Amrish Puri,Anupam Kher,Satish Shah,Dilwale Dulhania Le Jayenge hi Comedy Drama Ad...
1,278,en,Framed in the 1940s for the double murder of h...,67.359,1994-09-23,The Shawshank Redemption,8.7,20172,Drama,Crime,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,Clancy Brown,The Shawshank Redemption en Drama Crime Frank ...
2,238,en,"Spanning the years 1945 to 1955, a chronicle o...",62.603,1972-03-14,The Godfather,8.7,15112,Drama,Crime,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Robert Duvall,Richard S. Castellano,The Godfather en Drama Crime Francis Ford Copp...
3,724089,en,Professor Gabriel Emerson finally learns the t...,10.796,2020-07-31,Gabriels Inferno Part II,8.7,1334,Romance,,Tosca Musk,Melanie Zanetti,Giulio Berruti,James Andrew Fraser,Margaux Brooke,Agnes Albright,Gabriels Inferno Part II en Romance None Tosca...
4,761053,en,The final part of the film adaption of the ero...,34.804,2020-11-19,Gabriels Inferno Part III,8.6,901,Romance,Comedy,Tosca Musk,Melanie Zanetti,Giulio Berruti,Rhett Wellington,James Andrew Fraser,Margaux Brooke,Gabriels Inferno Part III en Romance Comedy To...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9355,12142,en,Edward Carnby is a private investigator specia...,14.365,2005-01-28,Alone in the Dark,3.2,429,Action,Fantasy,Uwe Boll,Christian Slater,Tara Reid,Stephen Dorff,Will Sanderson,Ona Grauer,Alone in the Dark en Action Fantasy Uwe Boll C...
9356,5491,en,"In the year 3000, man is no match for the Psyc...",8.795,2000-05-12,Battlefield Earth,3.2,621,Action,Adventure,Roger Christian,John Travolta,Barry Pepper,Forest Whitaker,Kim Coates,Sabine Karsenti,Battlefield Earth en Action Adventure Roger Ch...
9357,11059,en,"Set on an island off the coast, a techno rave ...",10.019,2003-04-11,House of the Dead,3.2,280,Horror,Action,Uwe Boll,Jonathan Cherry,Tyron Leitso,Clint Howard,Ona Grauer,Michael Eklund,House of the Dead en Horror Action Uwe Boll Jo...
9358,14164,en,The young warrior Son Goku sets out on a quest...,50.279,2009-03-12,Dragonball Evolution,2.9,1567,Action,Adventure,James Wong,Justin Chatwin,Chow Yun-Fat,Joon Park,Jamie Chung,Emmy Rossum,Dragonball Evolution en Action Adventure James...


## 5. Calculating Similarity Scores

Next, we shall create a Document Term Matrix(DTM) from this _'combined_features'_ column, where each row will represent a particular movie and the columns will represent all the words appearing in the _'combined_features'_ column of all the movies. Thus, each _(i,j)_-th entry of this DTM will be the frequency of appearance of the _j_-th word in the _'combined_features'_ column of the _i_-th movie. For this, we pass the _'combined_features'_ column through _CountVectorizer_.

We then feed this DTM into the _cosine_similarity_ function to get the similarity scores.

In [13]:
### CREATING A DTM AND CALCULATING SIMILARITY SCORES
vectorizer = CountVectorizer()
count = vectorizer.fit_transform(df['combined_features']) #creating dtm
similarity = cosine_similarity(count)

## 6. Creating Functions for Easy Flow of Data

Before going forward, let us take a moment to create two functions which will help us down the line:
1. _get_id_from_title_, which will take a movie title and return its unique id.
2. _get_title_from_id_, which will take a movie id and return its title.

In [14]:
### CREATION OF FUNCTIONS
def get_id_from_title(title):
    return df[df['title'] == title]['id'].values[0]

def get_title_from_id(movie_id):
        return tmdb_movies[tmdb_movies['id'] == movie_id]['title'].values[0]

## 7. Getting Movie Recommendations I

Now, we shall feed in a movie name so that we can get our recommendations. We choose _Spider-Man_ as the title we want recommendations for. We first remove any punctuation from the title and then get its unique id from the TMDB dataset.

In [15]:
### CLEANING THE USER DEFINED MOVIE TITLE
movie_user = 'Spider-Man'
movie_user = re.sub(r'[^\w\s]+', '', movie_user) #removing pupnctuations
movie_index = get_id_from_title(movie_user) #getting the unique id for the user defined movie
movie_index

557

Next, we try to obtain the row index of this movie in _df_ and then use this index to get the row of similarity scores between this particular movie and the remaining others. We transpose this row into a column and create a pandas dataframe, _df_similarity_. Next, we attach the unique movie ids from _df_ to _df_similarity_.

In [16]:
### GETTING SIMILARITY SCORES FOR THE USER DEFINED MOVIE
df_similarity = pd.DataFrame(similarity[df[df['id'] == movie_index].index].reshape(-1, 1), columns = ['cosine_similarity'])
df_similarity['id'] = df['id']
df_similarity

Unnamed: 0,cosine_similarity,id
0,0.000000,19404
1,0.058926,278
2,0.117851,238
3,0.111803,724089
4,0.111803,761053
...,...,...
9355,0.172062,12142
9356,0.121268,5491
9357,0.114708,11059
9358,0.176777,14164


Now, we sort this dataframe with respect to the decreasing similarity scores.

In [17]:
### SORTING THE DATAFRAME IN DESCENDING ORDER OF SIMILARITY SCORES
df_similarity = df_similarity.sort_values(by = 'cosine_similarity', ascending = False, ignore_index = True)
df_similarity.head(10)

Unnamed: 0,cosine_similarity,id
0,1.0,557
1,0.8125,558
2,0.727607,559
3,0.375,19901
4,0.33541,68728
5,0.3125,297802
6,0.294628,252178
7,0.294174,371638
8,0.279508,10839
9,0.279508,346672


Lastly, we take the id of the top 20 movies with highest similarity scores and use them to get the titles of the recommended movies. These are the movies that are recommended by the model.

In [18]:
### PRINTING THE TOP 20 RECOMMENDED MOVIES
for movie in range(1, 21):
    movie_id = df_similarity['id'][movie]
    print(get_title_from_id(movie_id))

Spider-Man 2
Spider-Man 3
Daybreakers
Oz the Great and Powerful
Aquaman
'71
The Disaster Artist
Cross of Iron
Underworld: Blood Wars
Avatar
Jumanji
Homefront
Flyboys
Capricorn One
Spectre
Brothers
Wimbledon
Skyfall
Skinwalkers
K-9


The recommendations are good but what we would like here is to improve the model so that other Spider-Man movies outside the Raimi trilogy are included in our recommendations.

## 8. Getting Movie Recommendations II

First, we make a function that automates the whole process described above: takes a movie title, calculates the similarity scores, and returns 20 recommendations.

In [19]:
def get_recommendations(movie_user):
    movie_user = re.sub(r'[^\w\s]+', '', movie_user) #removing punctuations
    movie_index = get_id_from_title(movie_user) #getting the unique id for the user defined movie
    df_similarity = pd.DataFrame(similarity[df[df['id'] == movie_index].index].reshape(-1, 1), columns = ['cosine_similarity'])
    df_similarity['id'] = df['id']
    df_similarity = df_similarity.sort_values(by = 'cosine_similarity', ascending = False, ignore_index = True)
    for movie in range(1, 21):
        movie_id = df_similarity['id'][movie]
        print(get_title_from_id(movie_id))

Let us now try to get recommendations for _The Dark Knight_. 

In [20]:
get_recommendations('The Dark Knight')

The Dark Knight Rises
Batman Begins
The Prestige
True Romance
The Rum Diary
The New World
I'm Not There
The Next Karate Kid
Murder in the First
The Machinist
The Man Who Would Be King
Batman: The Dark Knight Returns, Part 2
Batman: The Dark Knight Returns, Part 1
The Woman in the Window
The Seeker: The Dark Is Rising
Zulu
The Trial of the Chicago 7
The Getaway
The Courier
The Patriot


While the recommendations do contain the rest of the movies from the Nolan Trilogy, we would like to see more Batman movies there.