# Movie Recommender System using ML Algorithms

### In this project  -  We develop a recommendation system using Cosine Similarity as the base ML algorithm. The data is acquired from Kaggle and is based of the IMDB Movie Lens Database. Provided below is the code which is used to acquire data, pre-process it and then apply Cosine Similarity to it find the most similar movies  based the User's input. The system returns 10 of the most matching movies

In [1]:
import pandas as pd
import numpy as np

#Importing the relevant datasets from the mounted Google Drive (change the code below if the data is hosted elsewhere)
metadata = pd.read_csv("data/movies_metadata.csv")
ratings = pd.read_csv("data/ratings.csv")
credits = pd.read_csv("data/credits.csv")
keywords = pd.read_csv("data/keywords.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


### The combined memory taken up by the datasets is around 900MB. Whose data needs to processed before algorithms can be applied on it.

In [2]:
print("metadata shape:",metadata.shape)
print("ratings shape:",ratings.shape)
print("credits shape:",credits.shape)
print("keywords shape:",keywords.shape)

metadata shape: (45466, 24)
ratings shape: (26024289, 4)
credits shape: (45476, 3)
keywords shape: (46419, 2)


In [3]:
#Ratings dataset preprocessing
newratings = ratings[["movieId","rating"]]
#newratings.sort_values("movieId",ascending=False,inplace = True)
newnewratings=newratings.groupby('movieId',as_index=False)['rating'].mean()

print("New ratings dataset shape :",newnewratings.shape)
newnewratings=newnewratings.sort_values('rating',ascending=False)
print("New ratings dataset top :")
print(newnewratings.head(3))


New ratings dataset shape : (45115, 2)
New ratings dataset top :
       movieId  rating
34799   147966     5.0
37018   154341     5.0
40487   164620     5.0


### Here we starting pre processing our data,
#### There are a number of number Nan and string values in our popularity column, and as we are taking  the top 20000 most popular movies into considertion for recommendation system, We need to remove these 'dirty' rows.


### We have millions of rows of data in ratings file which belong to multiple movies. So we find the average rating of each of these movies.

In [4]:

#dropping Nan values as they very unpopular movies
metadata["popularity"].apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna()

#Cleaning the data from invalid values in the dataset
metadata=metadata[metadata["popularity"].apply(lambda x: isinstance(x, float))]


metadata = metadata.sort_values(by='popularity',ascending=False)
metadata = metadata.iloc[:25000,:]


keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')
newnewratings['id'] = newnewratings['movieId'].astype('int')
newnewratings['rating'] = newnewratings['rating'].astype('str')



metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')
metadata = metadata.merge(newnewratings,on ='id')

print(newnewratings.shape)
print(metadata.columns)
metadata = metadata.sort_values('rating',ascending=False)


(45115, 3)
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords', 'movieId',
       'rating'],
      dtype='object')


### Printing information on some of datasets we will be using to better understand how information is stored.

In [5]:
metadata.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords', 'movieId',
       'rating'],
      dtype='object')

In [6]:

metadata[['title', 'cast', 'crew', 'keywords', 'genres','rating']].head(3)

Unnamed: 0,title,cast,crew,keywords,genres,rating
859,Any Day Now,"[{'cast_id': 2, 'character': 'Rudy', 'credit_i...","[{'credit_id': '52fe4a4ec3a36847f81c6a77', 'de...","[{'id': 387, 'name': 'california'}, {'id': 824...","[{'id': 18, 'name': 'Drama'}]",5.0
2303,Phil Spector,"[{'cast_id': 2, 'character': 'Linda Kenney Bad...","[{'credit_id': '52fe4d2dc3a36847f8252d75', 'de...",[],"[{'id': 10770, 'name': 'TV Movie'}, {'id': 18,...",5.0
3832,Yellow Rock,"[{'cast_id': 3, 'character': 'Tom Hanner', 'cr...","[{'credit_id': '53f11470c3a3685ae2002fa1', 'de...","[{'id': 156948, 'name': 'missing child'}]","[{'id': 37, 'name': 'Western'}]",5.0


In [7]:
#raises an exception if the input isn't a valid Python datatype, so the code won't be executed if it's not.
#Parse the stringified features into their corresponding python objects

from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [8]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [9]:
#Getting a list of the actors, keywords and genres
def get_list(x):
    if isinstance(x, list): 
        names = [i['name'] for i in x] 
        
        if len(names) > 3:
            names = names[:3]
        return names

    return []

In [10]:
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

metadata[['title', 'cast', 'director', 'keywords', 'genres','rating']].head()

Unnamed: 0,title,cast,director,keywords,genres,rating
859,Any Day Now,"[Alan Cumming, Garret Dillahunt, Isaac Leyva]",Travis Fine,"[california, drag queen, homophobia]",[Drama],5.0
2303,Phil Spector,"[Helen Mirren, Al Pacino, Jeffrey Tambor]",David Mamet,[],"[TV Movie, Drama]",5.0
3832,Yellow Rock,"[Michael Biehn, James Russo, Lenore Andriel]",Nick Vallelonga,[missing child],[Western],5.0
3736,Burning Secret,"[David Eberts, Faye Dunaway, Klaus Maria Brand...",Andrew Birkin,"[austria, american diplomat]",[Drama],5.0
3321,Brannigan,"[John Wayne, Richard Attenborough, Judy Geeson]",Douglas Hickox,"[london england, scotland yard, cop]","[Action, Crime, Drama]",5.0


#### Further Pre-Processing for datasets

In [11]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x] #cleaning up spaces in the data
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [12]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

metadata[['title', 'cast', 'director', 'keywords', 'genres','rating']].head()

Unnamed: 0,title,cast,director,keywords,genres,rating
859,Any Day Now,"[alancumming, garretdillahunt, isaacleyva]",travisfine,"[california, dragqueen, homophobia]",[drama],5.0
2303,Phil Spector,"[helenmirren, alpacino, jeffreytambor]",davidmamet,[],"[tvmovie, drama]",5.0
3832,Yellow Rock,"[michaelbiehn, jamesrusso, lenoreandriel]",nickvallelonga,[missingchild],[western],5.0
3736,Burning Secret,"[davideberts, fayedunaway, klausmariabrandauer]",andrewbirkin,"[austria, americandiplomat]",[drama],5.0
3321,Brannigan,"[johnwayne, richardattenborough, judygeeson]",douglashickox,"[londonengland, scotlandyard, cop]","[action, crime, drama]",5.0


#### The dataframe 'metadata' holds all pre-processed values in itself. Thiis dataframe will be used to the cosine simialrity algorithm

In [13]:

def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
                                                                                                      
metadata['soup'] = metadata.apply(create_soup, axis=1)
#metadata.head()
metadata[['title', 'soup', 'cast', 'director', 'keywords', 'genres','rating']].head()

Unnamed: 0,title,soup,cast,director,keywords,genres,rating
859,Any Day Now,california dragqueen homophobia alancumming ga...,"[alancumming, garretdillahunt, isaacleyva]",travisfine,"[california, dragqueen, homophobia]",[drama],5.0
2303,Phil Spector,helenmirren alpacino jeffreytambor davidmamet...,"[helenmirren, alpacino, jeffreytambor]",davidmamet,[],"[tvmovie, drama]",5.0
3832,Yellow Rock,missingchild michaelbiehn jamesrusso lenoreand...,"[michaelbiehn, jamesrusso, lenoreandriel]",nickvallelonga,[missingchild],[western],5.0
3736,Burning Secret,austria americandiplomat davideberts fayedunaw...,"[davideberts, fayedunaway, klausmariabrandauer]",andrewbirkin,"[austria, americandiplomat]",[drama],5.0
3321,Brannigan,londonengland scotlandyard cop johnwayne richa...,"[johnwayne, richardattenborough, judygeeson]",douglashickox,"[londonengland, scotlandyard, cop]","[action, crime, drama]",5.0


In [14]:
#Getting the user's input for genre, actors and directors of their liking.
def get_genres():
  genres = input("What Movie Genre are you interested in like Action , Comedy , Drama(if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  genres = " ".join(["".join(n.split()) for n in genres.lower().split(',')])
  return genres

def get_actors():
  actors = input("Who are some actors within the genre that you like  (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  actors = " ".join(["".join(n.split()) for n in actors.lower().split(',')])
  return actors

def get_directors():
  directors = input("Who are some directors within the genre that you like  (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  directors = " ".join(["".join(n.split()) for n in directors.lower().split(',')])
  return directors

def get_keywords():
  keywords = input("What are some of the keywords that describe the movie you want to watch, like elements of the plot, whether or not it is about friendship,boardgame etc? (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  keywords = " ".join(["".join(n.split()) for n in keywords.lower().split(',')])
  return keywords

def get_searchTerms():
  searchTerms = [] 
  genres = get_genres()
  if genres != 'skip':
    searchTerms.append(genres)

  actors = get_actors()
  if actors != 'skip':
    searchTerms.append(actors)

  directors = get_directors()
  if directors != 'skip':
    searchTerms.append(directors)

  keywords = get_keywords()
  if keywords != 'skip':
    searchTerms.append(keywords)
  
  return searchTerms

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def make_recommendation(metadata=metadata):
  new_row = metadata.iloc[-1,:].copy() #creating a copy of the last row of the 
  #dataset, which we will use to input the user's input
  
  #grabbing the new wordsoup from the user
  searchTerms = get_searchTerms()  
  new_row.iloc[-1] = " ".join(searchTerms) #adding the input to our new row
  
  #adding the new row to the dataset
  metadata = metadata.append(new_row)
  
  #Vectorizing the entire matrix 
  count = CountVectorizer(stop_words='english')
  count_matrix = count.fit_transform(metadata['soup'])

  #Obtaining cosine similarity 
  cosine_sim2 = cosine_similarity(count_matrix, count_matrix) 
  
  #sorting cosine similarities by highest to lowest
  sim_scores = list(enumerate(cosine_sim2[-1,:]))
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

  #matching the similarities to the movie titles and ids
  ranked_titles = []
  for i in range(1, 11):
    indx = sim_scores[i][0]
    ranked_titles.append([metadata['title'].iloc[indx], metadata['imdb_id'].iloc[indx]])
  
  return ranked_titles

In [16]:
make_recommendation()

[['Bill Burr: Why Do I Do This?', 'tt1254947'],
 ['Stewart Lee: If You Prefer a Milder Comedian, Please Ask for One',
  'tt1756754'],
 ['Patton Oswalt: My Weakness Is Strong', 'tt1503646'],
 ['Six Days Seven Nights', 'tt0120828'],
 ['Working Girl', 'tt0096463'],
 ['Case départ', 'tt1821362'],
 ['Jeff Dunham: Minding the Monsters', 'tt2461736'],
 ['Fallen Art', 'tt0440846'],
 ['Bill Burr: Let It Go', 'tt1717578'],
 ['Kevin Hart: Seriously Funny', 'tt1714196']]