### This is the second part of my project on "Movie Recommender Systems". In the first notebook (HRS.ipynb), i attempted at narrating the story of movies by performing an extensive exploratory data analysis and data wrangling on Movies Metadata collected from TMDB. 

### I also built two extremely minimalist predictive models to predict movie revenue and movie success and visualise which features influence the output (revenue and success respectively).

### In this notebook (HRS_2), i will attempt at implementing a few recommendation algorithms (content based, collaborative filtering and popularity based) and try to build an ensemble of these models to come up with our final recommendation system.

### With us, we have two MovieLens datasets.

### The Full Dataset: Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.

### The Small Dataset: Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

# Simple Recommender System

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

##### The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user.

##### The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre.

In [2]:
md = pd.read_csv('Desktop/ML DATA/movies_metadata.csv')
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x:[i['name'] for i in x] if isinstance(x,list) else [])

##### I use the TMDB Ratings to come up with our Top Movies Chart. I will use IMDB's weighted rating formula to construct my chart. Mathematically, it is represented as follows:

##### Weighted Rating (WR) =  ((v/(v+m)).R) + ((m/(v+m)).C)
##### where,

##### v is the number of votes for the movie
##### m is the minimum votes required to be listed in the chart
##### R is the average rating of the movie
##### C is the mean vote across the whole report

##### The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. We will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

##### I will build our overall Top 250 Chart and will define a function to build charts for a particular genre.

In [4]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

C = vote_averages.mean()
C

5.244896612406511

In [5]:
m = vote_counts.quantile(0.95)
m

434.0

In [6]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [7]:
qualified = md[(md['vote_count']>=m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title','year','vote_count','vote_average','popularity','genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

##### Therefore, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB. We also see that the average rating for a movie on TMDB is 5.244 on a scale of 10. 2274 Movies qualify to be on our chart.

In [8]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
qualified['wr'] = qualified.apply(weighted_rating,axis=1)

In [10]:
qualified = qualified.sort_values('wr',ascending=False).head(250)

##### Top Movies

In [11]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.4235,"[Adventure, Fantasy, Action]",7.851924


##### Let us now construct our function that builds charts for particular genres. For this, we will relax our default conditions to the 85th percentile instead of 95.

In [12]:
s = md.apply(lambda x:pd.Series(x['genres']),axis=1).stack().reset_index(level=1,drop=True)
s.name = 'genre'
gen_md = md.drop('genres',axis=1).join(s)

In [13]:
def build_chart(genre,percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count']>=m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title','year','vote_count','vote_average','popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count'])*C),axis=1)
    qualified = qualified.sort_values('wr',ascending=False).head(250)
    
    return qualified

##### Let us see our method in action by displaying the Top 15 Romance Movies (Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).

In [14]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,8.565285
351,Forrest Gump,1994,8147,8,48.3072,7.971357
876,Vertigo,1958,1162,8,18.2082,7.811667
40251,Your Name.,2016,1030,8,34.461252,7.789489
883,Some Like It Hot,1959,835,8,11.8451,7.745154
1132,Cinema Paradiso,1988,834,8,14.177,7.744878
19901,Paperman,2012,734,8,7.19863,7.713951
37863,Sing Street,2016,669,8,10.672862,7.689483
882,The Apartment,1960,498,8,11.9943,7.599317
38718,The Handmaiden,2016,453,8,16.727405,7.566166


# Content Based Recommenders

##### The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

##### For instance, consider a person who loves Dilwale Dulhania Le Jayenge, My Name is Khan and Kabhi Khushi Kabhi Gham. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

##### To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as Content Based Filtering.

##### I will build two Content Based Recommenders based on:

##### 1. Movie Overviews and Taglines
##### 2. Movie Cast, Crew, Keywords and Genre

In [15]:
links_small = pd.read_csv('Desktop/ML DATA/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [16]:
md = md.drop([19730,29503,35587])

In [17]:
md['id'] = md['id'].astype('int')

In [18]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

##### Movie Description Based Recommender

In [19]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [20]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1,2),min_df=0,stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [21]:
tfidf_matrix.shape

(9099, 268124)

##### Cosine similarity

##### I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

##### cosine(x,y)= (x.y⊺)/(||x||.||y||) 
##### Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [22]:
cosine_sim = linear_kernel(tfidf_matrix,tfidf_matrix)

In [23]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

##### We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [24]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index = smd['title'])

In [25]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x:x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

#####  Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [26]:
get_recommendations('The Godfather').to_frame().head(10)

Unnamed: 0,title
973,The Godfather: Part II
8387,The Family
3509,Made
4196,Johnny Dangerously
29,Shanghai Triad
5667,Fury
2412,American Movie
1582,The Godfather: Part III
4221,8 Women
2159,Summer of Sam


In [27]:
get_recommendations('The Dark Knight').to_frame().head(10)

Unnamed: 0,title
7931,The Dark Knight Rises
132,Batman Forever
1113,Batman Returns
8227,"Batman: The Dark Knight Returns, Part 2"
7565,Batman: Under the Red Hood
524,Batman
7901,Batman: Year One
2579,Batman: Mask of the Phantasm
2696,JFK
8165,"Batman: The Dark Knight Returns, Part 1"


##### We see that for The Dark Knight, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard movie in the Batman Franchise.

##### Therefore, I am going to use much more suggestive metadata than Overview and Tagline. In the next subsection, I will build a more sophisticated recommender that takes genre, keywords, cast and crew into consideration.

##### Metadata based content recommender

##### To build the standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [28]:
credits = pd.read_csv('Desktop/ML DATA/credits.csv')
keywords = pd.read_csv('Desktop/ML DATA/keywords.csv')

In [29]:
credits['id'] = credits['id'].astype('int')
keywords['id'] = keywords['id'].astype('int')
md['id'] = md['id'].astype('int')

In [30]:
md.shape

(45463, 25)

In [31]:
md = md.merge(credits,on='id')
md = md.merge(keywords,on='id')

In [32]:
md.shape

(46628, 28)

In [33]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9219, 28)

##### We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

##### Crew: From the crew, we will only pick the director as our feature since the others don't contribute that much to the feel of the movie.
##### Cast: Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [34]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [35]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [36]:
smd['director'] = smd['crew'].apply(get_director)

In [37]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x,list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x)>=3 else x)

In [38]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x,list) else [])

##### These are steps I follow in the preparation of my genres and credits data:

##### 1. Strip Spaces and Convert to Lowercase from all our features. This way, our engine will not confuse between Johnny Depp and Johnny Galecki.
##### 2. Mention Director 3 times to give it more weight relative to the entire cast.

In [39]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ","")) for i in x])

In [40]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ","")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

##### Keywords

##### We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculate the frequenct counts of every keyword that appears in the dataset.

In [41]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1,drop=True)
s.name = 'keyword'

In [42]:
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

##### Keywords occur in frequencies ranging from 1 to 610. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. Finally, we will convert every word to its stem so that words such as Dogs and Dog are considered the same.

In [43]:
s = s[s > 1]

In [44]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [45]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [46]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ","")) for i in x])

In [47]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [48]:
count = CountVectorizer(analyzer='word',ngram_range=(1,2),min_df = 0,stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [49]:
cosine_sim = cosine_similarity(count_matrix,count_matrix)

In [50]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

##### We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results. Let us check for The Dark Knight again and see what recommendations I get this time around.

In [51]:
get_recommendations('The Dark Knight').to_frame()

Unnamed: 0,title
8031,The Dark Knight Rises
6218,Batman Begins
6623,The Prestige
2085,Following
7648,Inception
4145,Insomnia
3381,Memento
8613,Interstellar
7659,Batman: Under the Red Hood
1134,Batman Returns


In [52]:
get_recommendations('Rambo').to_frame()

Unnamed: 0,title
7671,The Expendables
4317,Staying Alive
1936,Rocky II
1937,Rocky III
1938,Rocky IV
6653,Rocky Balboa
281,The Specialist
2448,Nighthawks
8026,Bullet to the Head
8713,The Expendables 3


##### Popularity and Ratings

##### One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that Batman and Robin has a lot of similar characters as compared to The Dark Knight but it was a terrible movie that shouldn't be recommended to anyone.

##### Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

##### I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of  m , we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [53]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title','vote_count','vote_average','year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count']>=m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating,axis=1)
    qualified = qualified.sort_values('wr',ascending=False).head(10)
    return qualified

In [54]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.917588
8613,Interstellar,11187,8,2014,7.897107
6623,The Prestige,4510,8,2006,7.758148
3381,Memento,4168,8,2000,7.740175
8031,The Dark Knight Rises,9263,7,2012,6.921448
6218,Batman Begins,7511,7,2005,6.904127
1134,Batman Returns,1706,6,1992,5.846862
132,Batman Forever,1529,5,1995,5.054144
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.013943
1260,Batman & Robin,1447,4,1997,4.287233


In [55]:
improved_recommendations('Mission: Impossible - Ghost Protocol')

Unnamed: 0,title,vote_count,vote_average,year,wr
5686,The Incredibles,5290,7,2004,6.866926
8673,Mission: Impossible - Rogue Nation,3274,7,2015,6.794575
2228,The Iron Giant,1470,7,1999,6.59994
7140,Quantum of Solace,3015,6,2008,5.904983
8952,Tomorrowland,2904,6,2015,5.901823
561,Mission: Impossible,2677,6,1996,5.894659
6484,Mission: Impossible III,2062,6,2006,5.868704
3669,Jurassic Park III,2109,5,2001,5.041795
6785,Fantastic 4: Rise of the Silver Surfer,2648,5,2007,5.034486
8854,Terminator Genisys,3677,5,2015,5.025854


In [56]:
improved_recommendations('Rambo')

Unnamed: 0,title,vote_count,vote_average,year,wr
1930,First Blood,1523,7,1982,6.610774
8039,Mission: Impossible - Ghost Protocol,4026,6,2011,5.926521
7671,The Expendables,2977,6,2010,5.903924
8027,The Expendables 2,2940,6,2012,5.902871
8713,The Expendables 3,1830,6,2014,5.85525
1938,Rocky IV,984,6,1985,5.768889
1936,Rocky II,948,6,1979,5.762869
1937,Rocky III,894,6,1982,5.753227
1929,Rambo: First Blood Part II,884,6,1985,5.751354
6653,Rocky Balboa,858,6,2006,5.746351


In [57]:
improved_recommendations('The Notebook')

Unnamed: 0,title,vote_count,vote_average,year,wr
734,Breakfast at Tiffany's,1082,7,1961,6.49755
5539,Before Sunset,734,7,2004,6.347847
7337,My Sister's Keeper,614,7,2009,6.273173
3980,John Q,604,7,2002,6.266171
4354,Frida,397,7,2002,6.083376
97,The Bridges of Madison County,397,7,1995,6.083376
8645,The Other Woman,1467,6,2014,5.827609
1264,My Best Friend's Wedding,606,6,1997,5.68489
6675,Alpha Dog,463,6,2006,5.634655
3898,Kate & Leopold,430,6,2001,5.6207


##### Unfortunately, Batman and Robin does not disappear from our recommendation list. This is probably due to the fact that it is rated a 4, which is only slightly below average on TMDB. It certainly doesn't deserve a 4 when amazing movies like The Dark Knight Rises has only a 7. However, there is nothing much we can do about this. Therefore, we will conclude our Content Based Recommender section here and come back to it when we build a hybrid engine.

# Collaborative Filtering Recommendation System

##### Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

##### Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

##### Therefore, in this section, we will use a technique called Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

##### I will not be implementing Collaborative Filtering from scratch. Instead, I will use the Surprise library that used extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [58]:
reader = Reader()

In [59]:
ratings = pd.read_csv('Desktop/ML DATA/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [60]:
from surprise import dataset

In [69]:
from sklearn.model_selection import KFold
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
kf = KFold(n_splits=5)
kf.split(data)

<generator object _BaseKFold.split at 0x000002184019C048>

In [90]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE','MAE'])

{'test_rmse': array([0.8979407 , 0.90763852, 0.88639994, 0.88909262, 0.89960978]),
 'test_mae': array([0.69397906, 0.69677664, 0.6834638 , 0.68349759, 0.69078676]),
 'fit_time': (9.755292177200317,
  9.993813753128052,
  10.339694738388062,
  9.978724718093872,
  11.131991863250732),
 'test_time': (0.29247379302978516,
  0.2975788116455078,
  0.38968992233276367,
  0.2580740451812744,
  0.26847290992736816)}

In [91]:
mean_rmse = (0.8979407 + 0.90763852 + 0.88639994 + 0.88909262 + 0.89960978)/5
mean_mae = (0.69397906 + 0.69677664 + 0.6834638 + 0.68349759 + 0.690786761)/5

In [92]:
print('MEAN RMSE: ', mean_rmse)
print('MEAN MAE: ',mean_mae)

MEAN RMSE:  0.896136312
MEAN MAE:  0.6897007702


##### We get a mean Root Mean Sqaure Error of 0.8963 which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

In [93]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x21845a1f788>

##### Let us pick a user and check the ratings s/he has given.

In [94]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [104]:
svd.predict(1,302,3)

Prediction(uid=1, iid=302, r_ui=3, est=2.6661628843245015, details={'was_impossible': False})

##### For movie with ID 302, we get an estimated prediction of 2.686. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

# Hybrid Recommender System

##### In this section, I will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

##### 1. Input: User ID and the Title of a Movie
##### 2. Output: Similar movies sorted on the basis of expected ratings by that particular user.

In [95]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [96]:
id_map = pd.read_csv('Desktop/ML DATA/links_small.csv')[['movieId','tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId','id']
id_map = id_map.merge(smd[['title','id']], on='id').set_index('title')

In [97]:
indices_map = id_map.set_index('id')

In [98]:
def hybrid(userId,title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title','vote_count','vote_average','year','id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId,indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est',ascending=False)
    return movies.head(10)

In [99]:
hybrid(1, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
974,Aliens,3282.0,7.7,1986,679,3.052847
922,The Abyss,822.0,7.1,1989,2756,3.02532
1011,The Terminator,4208.0,7.4,1984,218,2.976584
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,2.942065
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,2.89231
2014,Fantastic Planet,140.0,7.6,1973,16306,2.833797
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,2.827136
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,2.810231
1376,Titanic,7770.0,7.5,1997,597,2.757346
3060,Sinbad and the Eye of the Tiger,39.0,6.3,1977,11940,2.709489


In [100]:
hybrid(500, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
4347,Piranha Part Two: The Spawning,41.0,3.9,1981,31646,3.322933
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,3.314176
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,3.282249
1376,Titanic,7770.0,7.5,1997,597,3.262686
3060,Sinbad and the Eye of the Tiger,39.0,6.3,1977,11940,3.238267
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.232843
7265,Dragonball Evolution,475.0,2.9,2009,14164,3.152962
922,The Abyss,822.0,7.1,1989,2756,3.110923
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,3.078523
831,Escape to Witch Mountain,60.0,6.5,1975,14821,3.042114


In [105]:
hybrid(10098, 'Rambo')

Unnamed: 0,title,vote_count,vote_average,year,id,est
8039,Mission: Impossible - Ghost Protocol,4026.0,6.8,2011,56292,3.766514
4418,Escape to Victory,163.0,6.7,1981,17360,3.715429
6653,Rocky Balboa,858.0,6.5,2006,1246,3.668667
1930,First Blood,1523.0,7.2,1982,1368,3.636806
1936,Rocky II,948.0,6.9,1979,1367,3.601809
8713,The Expendables 3,1830.0,6.1,2014,138103,3.522488
658,Daylight,385.0,5.8,1996,11228,3.519906
2448,Nighthawks,87.0,6.4,1981,21610,3.509734
2883,The Lords of Flatbush,18.0,5.7,1974,38925,3.457191
8027,The Expendables 2,2940.0,6.1,2012,76163,3.403476
