### Movie Recommendation System 
This movie recommendation system is a machine learning project that comprises of dataset obtained from Kaggle. This project makes use of two data sets:
- tmdb_5000_credits.csv
- tmdb-5000_movies.csv

This project will result in the creation of a content-based recommendation system.

#### Importing Python libraries


In [1]:
import numpy as np
import pandas as pd
import ast

### Data Preprocessing
#### Importing datasets


In [2]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

We will merge both datasets on the basis of Title


In [3]:
movies = movies.merge(credits, on='title')     # on="" explains which title to base the merging on

We will now only keep the columns that will benefit us in creating the recommendation system.
These columns are: movie_id, title, overview, genres, keywords, cast, and crew

In [4]:
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

We want only three columns in the dataset. Titles, Movie_id, and tags (which will be created by merging other columns)

In [5]:
# Dropping the rows which have null values
movies.dropna(inplace=True)

In [6]:
# We will create a helper function which will extract the name of the genres from the genre column

def convert(obj):               # Passing the List of dictionaries (It will be originally as string)
    L = []
    for i in ast.literal_eval(obj):               # Going over each dictionary
        L.append(i['name'])                       # Extracting the name of the genre
    return L

In [7]:
movies['genres'] = movies['genres'].apply(convert)

In [8]:
# Similarly applying it on the keywords
movies['keywords'] = movies['keywords'].apply(convert)

In [9]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [10]:
# Creating a function that will help us in extracting the names of the first three cast members

def convert3(obj):               # Passing the List of dictionaries (It will be originally as string)
    L = []
    counter = 0
    for i in ast.literal_eval(obj):               # Going over each dictionary
        if counter != 3:                          # Since we require only the first three cast members
            L.append(i['name'])                   # Extracting the name of the genre
            counter += 1
        else:
            break
    return L

In [11]:
movies['cast'] = movies['cast'].apply(convert3)

In [12]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [13]:
# We need a function that will only extract the name of the director from the crew column
# We will need the dictionary, in which the job description is director

def fetch_director(obj):               # Passing the List of dictionaries (It will be originally as string)
    L = []
    for i in ast.literal_eval(obj):               # Going over each dictionary
        if i['job'] == 'Director':
            L.append(i['name'])                   # Extracting the name of the genre
            break
    return L

In [14]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [15]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [16]:
# Converting the overview (Present in string format) to List format so that it can be concatenated with other columns
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [17]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [18]:
# We will be removing the blank space between words so that our recommendation engine becomes more accurate.
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])

In [19]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]


In [20]:
# Creating a tag column which will be a concatenation of overview, genres, keywords, cast, and crew
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [21]:
# Creating a new dataframe which consists only the required columns

new_df = movies[['movie_id', 'title', 'tags']]
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [22]:
# Converting the tags column into a string

new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))
new_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."


In [23]:
# Converting tags column into lower case (Recommended practice)

new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())
new_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


We will now apply Stemming. Stemming is a technique in which similar words are replaced by the same word. E.g. Speaking, Speak, Speaks will be replaced by a single word speak

In [27]:
#Importing nltk library
import nltk

In [28]:
from nltk.stem.porter import PorterStemmer
# Importing PorterStemmer class and creating an object from it
ps = PorterStemmer()

In [29]:
# Creating a helper function
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))               # y array will now contain text that has been stemmed
    return " ".join(y)

In [30]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


In [31]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."


### Vectorization

#### We will perform text vectorization after data preprocessing
We will be using the bag of words technique. Using this technique we will combine all the words present in the tag column. From there we will select a specified number of most appeared words. Once it is done for each row (in this case film) we will create a vector of "specified number of most appeared words" dimensions. With each dimension corresponding to the frequency of that word in the row. In this way text vectorization will take place.

In vectorization, we will not consider stop words. Stop words are used in English grammar but they have no contribution to the overall meaning of the sentence e.g. are, and, to, for. We will use Scikit learn library for this

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

#Creating cv object from the CountVectorizer class, we will define the dimensions as 5000 as well as stop words as those in English language
cv = CountVectorizer(max_features=5000, stop_words='english')   

In [33]:
#Performing vectorization
vectors = cv.fit_transform(new_df['tags']).toarray()

In [37]:
vectors.shape


(4806, 5000)

In [39]:
#We can check all 5000 words as
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      shape=(5000,), dtype=object)

We now have to calculate the distance between movies (vectors). The more the distance less will be their similarities. We will not calculate Euclidean distance, instead we will calculate Cosine distance.

Cosine disctance works by finding the angle between two vectors.

In [40]:
# We can achieve cosine similarity by using sklearn
from sklearn.metrics.pairwise import cosine_similarity

In [41]:
# Applying cosine_similarity
similarity = cosine_similarity(vectors)

In [42]:
similarity.shape

(4806, 4806)

In [43]:
similarity

array([[1.        , 0.08346223, 0.0860309 , ..., 0.04499213, 0.        ,
        0.        ],
       [0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
        0.02615329],
       [0.0860309 , 0.06063391, 1.        , ..., 0.02451452, 0.        ,
        0.        ],
       ...,
       [0.04499213, 0.02378257, 0.02451452, ..., 1.        , 0.03962144,
        0.04229549],
       [0.        , 0.        , 0.        , ..., 0.03962144, 1.        ,
        0.08714204],
       [0.        , 0.02615329, 0.        , ..., 0.04229549, 0.08714204,
        1.        ]], shape=(4806, 4806))

#### Creating the Main Function

In [44]:
def recommend(movie):
    #First we will need to find the index of the movie that is searched
    movie_index = new_df[new_df['title'] == movie].index[0]
    
    #Need to find the similarity distance of the specified film with other films
    distances = similarity[movie_index]
    
    #We will now find similar movies, We will sort the distance after enumerating it (enumarating will ensure that index position
    # of the film will not change) 
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:10]   #lambda will ensure that sorting is not done on the basis of index, range is specified in []
    
    #We will now print the recommended movies
    for i in movies_list:
        print(new_df.iloc[i[0]].title)
    

In [49]:
recommend('Batman Begins')

The Dark Knight
Batman
Batman
The Dark Knight Rises
10th & Wolf
Rockaway
Batman v Superman: Dawn of Justice
Synecdoche, New York
Defendor
