## Content Based Movie Recommendation Engine on TMDB Dataset

<b>About:<b/>
    
Recommendation systems are imperative in today's day and age. It not only helps the user make quicker and more personalized decisions but also helps the business draw better conversions. There could be millions, if not billions, of products offered by a business, and it's highly likely that the user might not get what they want. Recommendation systems help organisations bridge this gap.
There are three main types of recommendation systems:
1. Content-based: recommendations generated based on the similarity of content consumed. (e.g., Spotify, Netflix, etc)
2. Collaborative Based : Recommendations generated based on the similarity of users. (e.g., Facebook, Instagram, etc)
3. Hybrid: Utilizes both the above mentioned approaches. (e.g., most E-commerce websites are now adopting this approach.

<b>Steps Involved:</b>

1. Data Fetching
2. Pre-Processing
3. Model Building
4. A Working Website 
5. Deploying on Heroku

<b>Dataset: </b> https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?resource=download

In [31]:
#importing the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

In [32]:
#Loading the dataset
movies = pd.read_csv('tmdb_5000_movies.csv') #first dataframe
credits = pd.read_csv('tmdb_5000_credits.csv')  #second dataframe

In [33]:
#Birds-Eye view of the first dataframe
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [34]:
#Dimensionality of the first dataframe
movies.shape

(4803, 20)

In [35]:
#Data Types Involved in the first dataframe
movies.dtypes

budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
status                   object
tagline                  object
title                    object
vote_average            float64
vote_count                int64
dtype: object

In [36]:
#Now second dataframe
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [37]:
#Data Types involved in the second dataframe
credits.dtypes

movie_id     int64
title       object
cast        object
crew        object
dtype: object

In [39]:
#Merging the 2 dataframes on 'title'
movies = movies.merge(credits,on='title')

In [40]:
#New dataframe
movies.shape

(4809, 23)

In [41]:
movies.sample()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
4715,0,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",,13187,"[{""id"": 65, ""name"": ""holiday""}, {""id"": 207317,...",en,A Charlie Brown Christmas,When Charlie Brown complains about the overwhe...,8.701183,"[{""name"": ""Warner Bros. Home Video"", ""id"": 5173}]",...,25.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,"That's what Christmas is all about, Charlie Br...",A Charlie Brown Christmas,7.5,153,13187,"[{""cast_id"": 2, ""character"": ""Freida (voice)"",...","[{""credit_id"": ""52fe454b9251416c75051a75"", ""de..."


<b>Columns to be removed because they may not contribute to the content based tagging:</b> 
budget, homepage, id, original_language, original_title, popularity, production_comapany, production_countries, release-date

In [42]:
#Pertinent Columns 
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [43]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4809 non-null   int64 
 1   title     4809 non-null   object
 2   overview  4806 non-null   object
 3   genres    4809 non-null   object
 4   keywords  4809 non-null   object
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: int64(1), object(6)
memory usage: 300.6+ KB


In [44]:
#Checking for missing data
movies.isna().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [45]:
#Drop the nulls
movies.dropna(inplace=True)

In [46]:
#Checking...
movies.isna().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [66]:
#Checking for duplicates
movies.duplicated().sum()

In [47]:
#Function that converts list of dictionaries to list
import ast #to convert string of lists to lists (abstract syntax tree module)
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [48]:
#Function that fetches the name of the director from 'crew'.
def director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 
movies['crew'] = movies['crew'].apply(director)

In [49]:
#Applying the above function on 'genres','keywords' and 'cast' respectively.
movies['genres'] = movies['genres'].apply(convert)
movies['keywords'] = movies['keywords'].apply(convert)
movies['cast'] = movies['cast'].apply(convert)
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...",[Gore Verbinski]


In [50]:
#Slicing the cast
movies['cast'] = movies['cast'].apply(lambda x:x[0:3])

In [51]:
#Removing spaces in 'crew','cast','genres','keywords' respectively for accurate tagging.
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])

In [52]:
#Isolating each keyword in the overview 
movies['overview'] = movies['overview'].apply(lambda x:x.split()) 

In [53]:
movies.sample()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
2328,21301,Barbershop 2: Back in Business,"[The, continuing, adventures, of, the, barbers...","[Comedy, Drama]",[blaxploitation],"[IceCube, CedrictheEntertainer, SeanPatrickTho...",[KevinRodneySullivan]


In [54]:
#A unified column for tagging 
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [55]:
#Creating a new dataframe having only 3 columns: 'movie_id','title', and 'tags'.
final = movies.drop(columns=['overview','genres','keywords','cast','crew'])
final['tags'] = final['tags'].apply(lambda x: " ".join(x))
final.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


<b>Vectorizing</b>
All the tags will be converted to vectors and then the movies having similar vectors (Closest vectors) will be recommended. Bag-Of-Words technique will be utilized in this process. Alternatively, one can also use TFIDF or word2vec. 

In [59]:
# Vectorization
from sklearn.feature_extraction.text import CountVectorizer
CV = CountVectorizer(max_features=5000,stop_words='english')

In [60]:
vector = cv.fit_transform(new['tags']).toarray() #to convert scipy sparse matrix to a numpy array
vector.shape

(4806, 5000)

In [61]:
#How close are the vectors? (distance is inversely proportional to similarity) 
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)
similarity

array([[1.        , 0.08964215, 0.06071767, ..., 0.02519763, 0.0277885 ,
        0.        ],
       [0.08964215, 1.        , 0.06350006, ..., 0.02635231, 0.        ,
        0.        ],
       [0.06071767, 0.06350006, 1.        , ..., 0.02677398, 0.        ,
        0.        ],
       ...,
       [0.02519763, 0.02635231, 0.02677398, ..., 1.        , 0.07352146,
        0.04774099],
       [0.0277885 , 0.        , 0.        , ..., 0.07352146, 1.        ,
        0.05264981],
       [0.        , 0.        , 0.        , ..., 0.04774099, 0.05264981,
        1.        ]])

In [63]:
def recommend(movie):
    index = final[final['title'] == movie].index[0] #get the index of the movie
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1]) #sort the movies in the descending order, sorting to be done based on the similarity
    for i in distances[1:6]: 
        print(new.iloc[i[0]].title) #printing similar movies
recommend('Spectre')

Quantum of Solace
Never Say Never Again
Skyfall
Thunderball
From Russia with Love


In [64]:
pickle.dump(final,open('movie_list.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))