## Aim is to create a hybrid movie recommender engine: one that uses both content and collaborative filtering to generate recommendations

*First we will create a simple recommender that uses weighted rating to create a highest rated and genre wise movie charts.

*Then we will move on to using metadata i.e. using features such as cast, director, genres to generate recommendations

*Following this we will focus on collaborative filtering using Surprise library and combine this with content filtering to create a hybrid movie recommender engine

[](http://)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# We will first create a Simple Recommender System
#Load data and examine
filepath='/kaggle/input/the-movies-dataset/movies_metadata.csv'
df=pd.read_csv(filepath)
df.head()


In [None]:
# since df.genres contains genre info as
#"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"
#It is necessary to extract genres for a movie into a list
#Note use of fillna, literal_eval and isinstance
df['genres']=df['genres'].fillna('[]').apply(literal_eval).apply(lambda x : [d['name'] for d in x] if isinstance(x,list) else [])


In [None]:
#Weighted Rating (WR) =  ((v/v+m).R)+((m/v+m).C) where,
# v is the number of votes for the movie
# m is the minimum votes required to be listed in the chart
# R is the average rating of the movie
# C is the mean vote across the whole report
#Here, we are creating two series vote_counts and vote_averages to calculate C and m
#We are taking m as 0.95 to create a highest rated movies chart.
vote_counts=df[df['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages=df[df['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
m = vote_counts.quantile(0.95)
print(C,m)

In [None]:
#Extract year from 'release_date'
df['year']=pd.to_datetime(df['release_date'],errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)


In [None]:
#Create a movies chart, with movies above 434 votes only
#Converting 'vote_count' and 'vote_average' to int to calculate weighted rating
chart=df[(df['vote_count']>=m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
chart['vote_count']=chart['vote_count'].astype('int')
chart['vote_average']=chart['vote_average'].astype('int')


In [None]:
#calc. weighted rating and store top 250 movies
def weighted_rating(x):
    v=x['vote_count']
    R=x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
chart['w_rating']=chart.apply(weighted_rating,axis=1)
chart=chart.sort_values(by='w_rating',ascending=False).head(250)

In [None]:
chart.head(15)


In [None]:
#Creating a genre wise chart.
#Take note of stack,level and how gen_df has been created
s=df.apply(lambda x:pd.Series(x['genres']),axis=1).stack().reset_index(level=1,drop=True)
s.name = 'genre'
gen_df = df.drop('genres', axis=1).join(s)

In [None]:
#function to build genre-wise chart
def build_genre_chart(genre,percentile=0.80):
    ugenre_df=gen_df[gen_df['genre']==genre]
    vote_counts=ugenre_df[ugenre_df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages=ugenre_df[ugenre_df['vote_average'].notnull()]['vote_average'].astype('int')
    C=vote_averages.mean()
    m=vote_counts.quantile(percentile)
    
    chart=ugenre_df[(ugenre_df['vote_count']>=m) & (ugenre_df['vote_count'].notnull()) & (ugenre_df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    chart['vote_count']=chart['vote_count'].astype('int')
    chart['vote_average']=chart['vote_average'].astype('int')
    chart['wr']=chart.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    chart = chart.sort_values('wr', ascending=False).head(250)
    
    return chart

In [None]:
build_genre_chart('Romance').head(15)

# Part II: Content Based filtering. The recommender above has severe limitations as the recommendations are limited to top movies.

-Features such as genre, actors, directors are not taken into consideration when churning out recommendations.

-To improve this, content based filtering can be used that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.

In [None]:
#Content Based Filtering. Will be built on a smaller dataset because computing power is limited.
#links_small is a smaller data set of around 10000 movies
links_small=pd.read_csv('/kaggle/input/the-movies-dataset/links_small.csv')
links_small=links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
df = df.drop([19730, 29503, 35587])
df['id']=df['id'].astype('int')
sdf=df[df['id'].isin(links_small)]
sdf.shape

In [None]:
#Create a feature called 'description' and apply TfidfVectorizer on it to get feature vectors. These feature vectors 
#will be used to calculate cosine_similarity
sdf['tagline'] = sdf['tagline'].fillna('')
sdf['description'] = sdf['overview']+sdf['tagline']
sdf['description'] = sdf['description'].fillna('')


In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(sdf['description'])

In [None]:
cos_sim=linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
#creating a series 'indices' with title as index and feature as index from sdf. sdf statement is very important
sdf=sdf.reset_index()
titles = sdf['title']
indices = pd.Series(sdf.index, index=sdf['title'])
indices.dtype

In [None]:
indices.head()

In [None]:
def get_recommendations(title):
    idx=indices[title]
    sim_scores=list(enumerate(cos_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores=sim_scores[1:31]
    movie_indices=[i[0] for i in sim_scores]
    return titles.iloc[movie_indices]


In [None]:
get_recommendations('The Godfather').head(10)

In [None]:
#Metadata Based Recommender. Adding credits(crew,cast) and keywords for each movie
credits = pd.read_csv('/kaggle/input/the-movies-dataset/credits.csv')
keywords = pd.read_csv('/kaggle/input/the-movies-dataset/keywords.csv')

In [None]:
#merge the two dataframes with df and create sdf from links_small
credits['id']=credits['id'].astype('int')
keywords['id']=keywords['id'].astype('int')
df['id']=df['id'].astype('int')
df=df.merge(credits,on='id')
df=df.merge(keywords,on='id')
sdf=df[df['id'].isin(links_small)]

In [None]:
sdf.head()

In [None]:
#Apply literal_eval to return list and create two more features
sdf['cast']=sdf['cast'].apply(literal_eval)
sdf['crew']=sdf['crew'].apply(literal_eval)
sdf['keywords']=sdf['keywords'].apply(literal_eval)
sdf['crewsize']=sdf['crew'].apply(lambda x:len(x))
sdf['castsize']=sdf['cast'].apply(lambda x:len(x))

In [None]:
def get_director(x):
    for i in x:
        if i['job']=='Director':
            return i['name']
    return np.nan

In [None]:
sdf['director']=sdf['crew'].apply(get_director)


In [None]:
#take the first three actors from each list as they major characters in the movie
sdf['cast']=sdf['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x,list) else [])
sdf['cast']=sdf['cast'].apply(lambda x:x[:3] if len(x)>=3 else x)

In [None]:
sdf.head()

In [None]:
#extract keywords
sdf['keywords'] = sdf['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])


In [None]:
sdf.head()

In [None]:
sdf['cast']=sdf['cast'].apply( lambda x: [str.lower(i.replace(" ","")) for i in x] )
sdf['director']=sdf['director'].astype('str').apply( lambda x: str.lower(x.replace(" ", "")) )
sdf['director'] = sdf['director'].apply(lambda x: [x,x, x])

In [None]:
s = sdf.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s.head()
s = s.value_counts()
s[:5]

In [None]:
s=s[s>1]
#Stemming is the process of reducing a word to its word stem
stemmer=SnowballStemmer('english')
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [None]:
stemmer.stem('dogs')

In [None]:
sdf['keywords']=sdf['keywords'].apply(filter_keywords)
sdf['keywords']=sdf['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
sdf['keywords'] = sdf['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [None]:
sdf['soup'] = sdf['keywords'] + sdf['cast'] + sdf['director'] + sdf['genres']
sdf['soup'] = sdf['soup'].apply(lambda x: ' '.join(x))
sdf.soup[0]

In [None]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(sdf['soup'])



In [None]:
cos_sim=linear_kernel(count_matrix,count_matrix)

In [None]:
sdf = sdf.reset_index()
titles = sdf['title']
indices = pd.Series(sdf.index, index=sdf['title'])

In [None]:
get_recommendations('The Dark Knight').head(10)

In [None]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cos_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = sdf.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [None]:
improved_recommendations('The Dark Knight')


## Part III: Collaborative filtering
-Content based recommenders do not offer recommendations based on the user's personal taste. 
-Therefore, Collaborative Filtering can be used to offer such a solution.
-It is based on the idea that users similar to me can be used to predict how much I will like a particular product or service.

-Using the Surprise library that uses extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE.

In [3]:
#Collaborative filtering
reader = Reader()
ratings=pd.read_csv('/kaggle/input/the-movies-dataset/ratings_small.csv')
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'])

NameError: name 'Reader' is not defined

In [None]:
trainset = data.build_full_trainset()
svd.fit(trainset)
ratings[ratings['userId'] == 1]

In [None]:
ratings[ratings['userId'] == 1]


In [None]:
svd.predict(1, 2105,3)

# Hybrid Recommender

*This recommender uses both collaborative and content based filtering techniques to provide personalised recommendations to the user.

In [None]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan
    

In [None]:
id_map = pd.read_csv('/kaggle/input/the-movies-dataset/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map

In [None]:
id_map = id_map.merge(sdf[['title', 'id']], on='id').set_index('title')
id_map

In [None]:
indices_map = id_map.set_index('id')

In [None]:
indices_map

In [None]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    #print(idx)
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cos_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = sdf.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [None]:
#Based on the user ID and movie name(500 and Avatar respectively), suggests movies based on the scores calculated.
hybrid(500,'Avatar')