## **CONTENT BASED - MOVIE RECOMMEDATION SYSTEM**

### **Overview**


In recent years, the movie industry has experienced tremendous growth with the creation of various platforms like Netflix, Hulu, and Amazon Prime. These platforms are used to watch movies, and the most significant advantage is the ability to watch whatever one wants to watch at any time. However, with the availability of thousands of movies on these platforms, it can be difficult to decide on what to watch. This problem is what a movie recommender system aims to solve.

In this project, I will create a content-based movie recommender system using the TMDB 5000 dataset available on Kaggle. The TMDB 5000 dataset contains information about movies, including the title, movie overview, genres, keywords, cast and crew.

We will suggest movies based on similarity between movies.

#### **Step 1: Data Collection / Extraction**
The first step in building a content-based movie recommender system is to collect the data. We can download the TMDB 5000 dataset from Kaggle. The dataset contains two files: a credits file and a movies file. The credits file contains information about the cast and crew, while the movies file contains information about the movies.

#### **Step 2: Data Preprocessing**
After downloading the dataset, the next step is to preprocess the data. We will merge the credits and movies files into one file, which we will use to build our recommender system. We will also remove duplicates and missing values and convert the text data into lowercase.

#### **Step 3: Feature Extraction**
The next step is to extract features from the text data. We will use the TF-IDF vectorizer to extract features from the movie plot, genres, and keywords. TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

#### **Step 4: Similarity Calculation**
After feature extraction, we will calculate the similarity between movies based on the features extracted. We will use the cosine similarity metric to calculate the similarity between movies. The cosine similarity metric measures the cosine of the angle between two vectors in a multidimensional space. It ranges from -1 to 1, with 1 being the most similar.

#### **Step 5: Building the Recommender System**
The final step is to build the recommender system. We will use the cosine similarity scores to recommend movies that are similar to the input movie. When a user inputs a movie, the system will retrieve the top 5 similar movies based on the cosine similarity scores.

### **1. Data Extraction**

The TMDB 5000 dataset is available on Kaggle and can be downloaded from this link: https://www.kaggle.com/tmdb/tmdb-movie-metadata. The dataset contains two files: a credits file and a movies file. The credits file contains information about the cast and crew, while the movies file contains information about the movies.

In [3]:
#IMPORT LIBRARIES 
import pandas as pd
import numpy as np

The TMDB 5000 dataset is available on Kaggle and can be downloaded from this link: https://www.kaggle.com/tmdb/tmdb-movie-metadata. The dataset contains two files: a credits file and a movies file. 

The credits file contains information about the cast and crew, while the movies file contains information about the movies.

In [4]:
#IMPORTING DATA - Load the movies and credits dataset
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [5]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [6]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### **2. Data Preprocessing**

After downloading the dataset, the next step is to preprocess the data. We will merge the credits and movies files into one file, which we will use to build our recommender system.

In [7]:
#merge the two dataframes using the movie 'title'
movies_data = movies.merge(credits, on = 'title')

In [8]:
movies_data.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


**Feature Selection**

We want to create a Movie recommender system based on content.

We will perform feature selection to carefully select the features which will help us learn something about the movie and make good recommendations.

We are performing feature selection based on that intuition and keeping our end goal in mind:

* budget - we don't want to recommend movie based on the budget (movies with similar budget does not necessarily mean that they are similar in any way 
* genre✔ - we will keep genre as it can help group movies and give better recommendations
* homepage - the movie's homepage is not relevent in identifing similarity
* movie_id ✔ - it will be useful if we want to later refer to some other information about the movie so we'll keep it.
* id - same as movie_id (already included)
* keywords ✔ - in our dataset these represent the tags associated with the movie which can be used to identify similar movies
* original_language - most of the movies in our dataset are in english language so this will not be able to provide much insights
* original_title - it can contain movie name in other languages than english which might not give much insights and may also confuse the model
* overview ✔ - provides the summary of the movie which can give us a lot of information about movies.

* popularity, production_companies, production_countries, release_date, revenue - These features can be useful in some cases but we will not include them to keep our model simple and more focused on movie content and the people involved in it.

* runtime, spoken_languages, status  - does not give much information about the movie similarity
* tagline - this feature of the movies can be a little vague and non - representative of the movie
* title ✔ - the title of the movie in english
* vote_average, vote_count - as we are recommending based on movie content we will not include these metrics 
* cast ✔ - We will use the information about the people involved in the movie to recommend similar movies based on cast.
* crew ✔ - crew includes people such as the director of the movie which can be used to make recommendations.

In [9]:
#SELECT THE FEATURES MENTIONED AS SELECTED ABOVE

movies_data = movies_data[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

In [10]:
movies_data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


**Dealing with Null and Duplicate Values**

In the preprocessing step, we will also check for any null or duplicate values and then remove duplicates and missing values.

**Check for NULL values**

In [11]:
movies_data.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

we have 3 records with overview = null and since overview is an important column containing information about movies and each movie is unique, 
we would not impute / replace this data but instead remove these movie records from our dataset as we don't have much info about them.

In [12]:
movies_data.dropna(inplace = True)

In [13]:
movies_data.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

we successfully removed the records with null values

**Check for Duplicate Data**

In [14]:
#this gives us the number identical records in our database
movies.duplicated().sum()

0

we can see that there are no duplicate records in our data

#### **Transform / Convert data**

We want to create a new data frame with 3 columns : movie_id, title and a column for all the information that we have about the movie using which we will group similar movies and perform the prediction task.

To do that we first need to perform some essential steps:

Extract and Convert the data available into a easier format (list of words) so we can then later combine indiviual pieces to represent all the information together.

In [15]:
movies_data.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

Right now we have a list of dictionaries and we want to convert that into a list of just words / text

In [16]:
#create a function to extract the useful / needed information from the list of dictonaries which is stored as a string

def extract_list(record_feature):
    result_list = []
    
    #here we need to add a eval() because the data in such features are stored as string
    #and we need to convert it into list for use
    for val in eval(record_feature):
        result_list.append(val['name'])
    
    return result_list

In [17]:
#Create a list of genres names of a movie
movies_data['genres'] = movies_data['genres'].apply(extract_list)

In [18]:
movies_data['keywords'] = movies_data['keywords'].apply(extract_list)

In [19]:
movies_data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


We now have the 'genre' and 'keywords' stored as list of words, and now we have to apply a similar process on cast feature.

For this feature we will extract the first 3 cast members names.

In [20]:
#create a function to extract the list of top 3 members name from the list of dictonaries which is stored as a string

def extract_list_of_3(record_feature):
    result_list = []
    counter = 0
    
    #here we need to add a eval() because the data in such features are stored as string
    #and we need to convert it into list for use
    for val in eval(record_feature):
        if(counter == 3):
            break
        else:
            result_list.append(val['name'])
            counter += 1
    
    return result_list

In [21]:
#extract the names of top 3 cast members from each movie's cast
movies_data['cast'] = movies_data['cast'].apply(extract_list_of_3)

In [22]:
movies_data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


We have successfully cleaned the data in 3 columns = 'genres', 'keywords' and 'cast'.

Now we need to extract the Name of the director from the 'crew' column feature of the movie. (where the job of the person is 'Director')

In [23]:
def extract_director(record_feature):
    result_list = []
    
    #here we need to add a eval() because the data in such features are stored as string
    #and we need to convert it into list for use
    for val in eval(record_feature):
        if(val['job'] == 'Director'):
            result_list.append(val['name'])
            break
    
    return result_list

In [24]:
movies_data['crew'] = movies_data['crew'].apply(extract_director)

In [25]:
movies_data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


Now 4 of our previously strange data columns are now stored as a list of strings to be used.

Similarly we will convert the overview text feature of the movie into a list of string / words to be used.

In [26]:
movies_data['overview'] = movies_data['overview'].apply(lambda x : x.split())

In [27]:
movies_data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


Now we have successfully converted all of our features into a list of words which then we can combine to form a large collection of words and then we will convert the combined list back to a string to be used furthur.

We have to remove spaces from between the values, such as names 'Johnny Depp' - we have to remove the space from between else our model will treat 'Johnny' and 'Depp' as two seperate tokens - while we want to represent it as a single entity.

In [28]:
#to remove spaces ' ' we replace them with '' no-space characters
#using labda functions and list comprehenstions

movies_data['genres'] = movies_data['genres'].apply(lambda x : [ i.replace(" ", "") for i in x])
movies_data['keywords'] = movies_data['keywords'].apply(lambda x : [ i.replace(" ", "") for i in x])
movies_data['cast'] = movies_data['cast'].apply(lambda x : [ i.replace(" ", "") for i in x])
movies_data['crew'] = movies_data['crew'].apply(lambda x : [ i.replace(" ", "") for i in x])

In [29]:
movies_data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


In [30]:
movies_data['all_info'] = movies_data['overview'] + movies_data['genres'] + movies_data['keywords'] + movies_data['cast'] + movies_data['crew'] 

In [31]:
movies_data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,all_info
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


**Create New Data Frame with essential features**

We have cleaned our data into a machine model usable form and now we will create a new data frame with required 3 features:
* movie_id
* title
* all_info

In [32]:
data = movies_data[['movie_id', 'title', 'all_info']]

In [33]:
data.head()

Unnamed: 0,movie_id,title,all_info
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


Convert the all information column into a space seperated string.

In [None]:
data['all_info'] = data['all_info'].apply(lambda x : " ".join(x))

In [35]:
data.head()

Unnamed: 0,movie_id,title,all_info
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [36]:
#we can see the final string containing all information about the movie
data['all_info'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

It is recommended to store all the text data in lowercase. So we will convert all text data into lower case.

In [None]:
data['all_info'] = data['all_info'].apply(lambda x : x.lower())

In [38]:
data.head()

Unnamed: 0,movie_id,title,all_info
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


### **3. Feature Extraction**

The next step is to extract features from the text data. We will use the bag-of-words vectorizer to extract features from the overall movie information (all_info).

We need to perform feature extraction so that we can calculate the simialrity between two movies.

#### **Stemming**

Multiple words in Information about movies refer to the same base word (like 'action' and 'actions' are just singular and plural form of the same base word 'action') so we will perform stemming using 'nltk' library to extract the base words / features - so that we can represent the words effectively.

In [None]:
!pip install nltk

In [69]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [70]:
def stem(text):
    res = []
    
    #extract stem words from string
    for i in text.split():
        res.append(ps.stem(i))
    
    #convert it back to string
    return " ".join(res)

In [None]:
data['all_info'] = data['all_info'].apply(stem)

#### **Representing Movie information in form of vectors**

**Using TF-IDF**

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer instance
tfidf = TfidfVectorizer(stop_words = 'english', max_features = 5000, min_df = 3)

# Fit and transform the information text about movie - create a tf-idf representation of each movie info
tfidf_matrix = tfidf.fit_transform(data['all_info']).toarray()

In [74]:
tfidf_matrix.shape

(4806, 5000)

In [75]:
#We can take a look at the features that were extracted by vectorizers - (feature names)
tfidf_features = tfidf.get_feature_names_out() 

**Using Count Vectorizer (simple bag-of-words)**

In [76]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance / object
count_vect = CountVectorizer(max_features = 5000, stop_words = 'english', min_df = 3)

# Fit and transform the information text about movie - create a bag of words representation of each movie info
count_vect_matrix = count_vect.fit_transform(data['all_info']).toarray()

In [77]:
count_vect_matrix.shape

(4806, 5000)

In [78]:
#We can take a look at the features that were extracted by vectorizers - (feature names)
count_vect_features = count_vect.get_feature_names_out()

In [80]:
count_vect_features[:100] #the first 100 extracted features

array(['000', '007', '10', '100', '11', '12', '13', '14', '15', '16',
       '17', '17th', '18', '18th', '18thcenturi', '19', '1910', '1920',
       '1930', '1940', '1944', '1950', '1950s', '1960', '1960s', '1970',
       '1970s', '1971', '1974', '1976', '1980', '1985', '1990', '1999',
       '19th', '19thcenturi', '20', '200', '2003', '2009', '20th', '21st',
       '23', '24', '25', '30', '300', '3d', '40', '50', '500', '60', '70',
       '80', 'aaron', 'aaroneckhart', 'abandon', 'abduct',
       'abigailbreslin', 'abil', 'abl', 'aboard', 'abov', 'abus',
       'academ', 'academi', 'accept', 'access', 'accid', 'accident',
       'acclaim', 'accompani', 'accomplish', 'account', 'accus', 'ace',
       'achiev', 'acquaint', 'act', 'action', 'actionhero', 'activ',
       'activist', 'activities', 'actor', 'actress', 'actual', 'ad',
       'adam', 'adamsandl', 'adamshankman', 'adapt', 'add', 'addict',
       'adjust', 'admir', 'admit', 'adolesc', 'adopt', 'ador'],
      dtype=object)

### **Step 4: Similarity Calculation**

After feature extraction, we will calculate the similarity between movies based on the features extracted. We will use the **cosine similarity metric** to calculate the similarity between movies. The cosine similarity metric measures the cosine of the angle between two vectors in a multidimensional space. It ranges from -1 to 1, with 1 being the most similar.

In [81]:
from sklearn.metrics.pairwise import cosine_similarity

Compute the cosine similarity matrix

We are calculating the pairwise similarities between all samples in count_vect_matrix

In [82]:
cosine_sim_count_vect = cosine_similarity(count_vect_matrix)

In [83]:
cosine_sim_count_vect.shape

(4806, 4806)

We are calculating the pairwise similarities between all samples in tfidf_matrix

In [85]:
cosine_sim_tfidf = cosine_similarity(tfidf_matrix)

In [86]:
cosine_sim_tfidf.shape

(4806, 4806)

### **5. Building the Recommender System**

The final step is to build the recommender system. We will use the cosine similarity scores to recommend movies that are similar to the input movie. When a user inputs a movie, the system will retrieve the top 10 similar movies based on the cosine similarity scores.

Define a function to get recommendations based on cosine similarity scores

In [91]:
#Using this syntax we can fetch the index of the movie in our dataframe
data[data['title'] == 'Avatar'].index[0]

0

In [92]:
def get_recommendations(movie_title, cosine_sim = cosine_sim_count_vect):
    
    #Get the index of the movie that matches the title given as argument
    idx = data[data['title'].str.lower() == movie_title.lower()].index[0]
    
    #Get similarity score for all the movies with respect to this movie
    #we use enumerate to maintain the index information about the movies 
    similarity_scores = list(enumerate(cosine_sim[idx]))
    
    #Sort the movies according to the similarity scores which is stored at index 1 for each movie because of enumerate
    sim_scores_sorted = sorted(similarity_scores, reverse = True, key = lambda x : x[1]) 
    
    #Select the top 5 similar Movies 
    top_sim_scores = sim_scores_sorted[1 : 6]
    
    #Get the indicies of the selected top similar movies 
    top_movie_indices = [i[0] for i in top_sim_scores]
    
    #Return the names of top 5 similar Movies
    return data['title'].iloc[top_movie_indices]

To use the recommender system, we can call the get_recommendations() function and pass in the title of a movie that we like. The function will return a list of 5 movies that are similar to the input movie based on the all information that we have such as: movie plot(overview), genres, keywords, cast and crew.

In [96]:
get_recommendations('The Dark Knight', cosine_sim_tfidf)

3                         The Dark Knight Rises
428                              Batman Returns
119                               Batman Begins
299                              Batman Forever
3861    Batman: The Dark Knight Returns, Part 2
Name: title, dtype: object

In [98]:
get_recommendations('The Dark Knight', cosine_sim_count_vect)

3       The Dark Knight Rises
119             Batman Begins
428            Batman Returns
299            Batman Forever
1363                   Batman
Name: title, dtype: object

As you can see, the recommender system recommends movies that are similar to "The Dark Knight", such as "The Dark Knight Rises", "Batman Begins", and "Batman Forever".

### **Export the Needed similarity matrix and movie data for later deployment and use**

we can just export all the fundamental data needed for making future predictions so that we don't have to calculate that every time.

In [102]:
#Export the movies data as dictonary
data.to_csv("movies_data.csv")

In [106]:
#Export the similarity matrix which contains the cosinr similarity matrix calculation for movies
pd.DataFrame(cosine_sim_count_vect).to_csv('cosine_sim_count_vect.csv')
pd.DataFrame(cosine_sim_tfidf).to_csv('cosine_sim_tfidf.csv')

Let's try some more examples

In [117]:
#List of first 10 movies in our dataset
movies.title.head(10)

0                                      Avatar
1    Pirates of the Caribbean: At World's End
2                                     Spectre
3                       The Dark Knight Rises
4                                 John Carter
5                                Spider-Man 3
6                                     Tangled
7                     Avengers: Age of Ultron
8      Harry Potter and the Half-Blood Prince
9          Batman v Superman: Dawn of Justice
Name: title, dtype: object

In [108]:
get_recommendations('Avatar')

1216    Aliens vs Predator: Requiem
2409                         Aliens
3731                  Falcon Rising
507                Independence Day
539                      Titan A.E.
Name: title, dtype: object

In [109]:
get_recommendations('Superman returns')

870                          Superman II
1299                        Superman III
2439    Superman IV: The Quest for Peace
813                             Superman
232                        The Wolverine
Name: title, dtype: object

In [111]:
get_recommendations('John Carter')

1322                     Riddick
3094                      Krrish
3378    The Other Side of Heaven
610       The Legend of Hercules
1257                  Get Carter
Name: title, dtype: object

In [113]:
get_recommendations('spider-man 3')

30                  Spider-Man 2
159                   Spider-Man
38      The Amazing Spider-Man 2
20        The Amazing Spider-Man
1540               Arachnophobia
Name: title, dtype: object

In [114]:
get_recommendations('Harry Potter and the Half-Blood Prince')

113    Harry Potter and the Order of the Phoenix
114          Harry Potter and the Goblet of Fire
276      Harry Potter and the Chamber of Secrets
197     Harry Potter and the Philosopher's Stone
191     Harry Potter and the Prisoner of Azkaban
Name: title, dtype: object

In [115]:
get_recommendations('Avengers: Age of Ultron')

31       Iron Man 3
79       Iron Man 2
68         Iron Man
129            Thor
16     The Avengers
Name: title, dtype: object

In [116]:
get_recommendations('Tangled')

2315               Out of Inferno
269     The Princess and the Frog
255             Home on the Range
1682               Animals United
42                    Toy Story 3
Name: title, dtype: object

WOW Those are some amazing recommendations🥰