# Movies Recommender System

In this assignment, you will attempt at implementing a few recommendation algorithms (content based, popularity based and collaborative filtering) and try to build an ensemble of these models to come up with the final recommendation system.

**1. Dataset**

These files contain metadata for all 45,000 movies listed in the Film Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, movie vote counts and vote averages.
This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5.

This dataset consists of the following files:

• movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full Film dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

• keywords.csv: Contains the movie plot keywords for the movies. Available in the form of a stringified JSON Object.

• credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

• links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

• links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

• ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

**2. Task**

Your task is to build a Simple Recommender using movies from the Full Dataset whereas all personalized recommender systems (Content, Collaborative and Hybrid) will make use of the small dataset (due to the computing power required).

• Simple Recommender: Please make sure that this system use overall TMDB Vote Count and Vote Averages to build Top Movies Charts, in general and for a specific genre.
(Hint: The IMDB Weighted Rating System can be used to calculate ratings)

• Content Based Recommender: You have to build two content based engines; one that takes movie overview and taglines as input and the other which takes metadata such as cast, crew, genre and keywords to come up with predictions.
(Hint: You can further use a simple filter to give greater preference to movies with more votes and higher ratings.)

• Collaborative Filtering: Build a collaborative filter based on single value decomposition (SVD).
(Hint: The RMSE obtained must be at least less than 1 and the engine should give estimated ratings for a given user and movie.)

• Hybrid Engine: Now here you will use ideas from content and collaborative filtering to build an engine that gives movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.
(Hint: (Input: User ID and the Title of a Movie and Output: Similar movies sorted on the basis of expected ratings by that particular user. The result should be different recommendations for different users, the movie being the same))

**Important**

You can use Jupyter notebook locally or Google Colab for this assignment. Please ensure that you print out the Data in every step (Recommendation, Prediction, EDA) using libraries like sea-born or matplotlib to plot the findings wherever relevant. Write your own conclusions according to your findings. If you are using Jupyter notebook, then please upload your code solution repo to Github and if using Colab then please share the Colab Notebook.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

from ast import literal_eval

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

## Simple Recommender

The Simple Recommender offers generalized recommnendations to every user based on some metric score.
The implementation of this model is simple. I will sort movies based on ratings and display the top movies of our list. 

In [2]:
md = pd. read_csv('H:/Movie Recommendation Kredent/Dataset/movies_metadata.csv')
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [4]:
md.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


There are total of 45460 movies listed in this movies dataset.  
The average number of votes given to a movie is 109.897338.  
The average rating given to a movie is 5.618207.


In [5]:
min_vote_count = md["vote_count"].min()
max_vote_count = md["vote_count"].max()

print("Minimum number of vote counts :", + min_vote_count)
print("Maximum number of vote counts :", + max_vote_count)


Minimum number of vote counts : 0.0
Maximum number of vote counts : 14075.0


In [6]:
md[md["vote_count"] == 14075]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
15480,False,,160000000,"[Action, Thriller, Science Fiction, Mystery, A...",http://inceptionmovie.warnerbros.com/,27205,tt1375666,en,Inception,"Cobb, a skilled thief who commits corporate es...",...,2010-07-14,825532764.0,148.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Your mind is the scene of the crime.,Inception,False,8.1,14075.0


In [7]:
md[md["vote_count"] == 0]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
83,False,,0,[],,188588,tt0113612,en,Last Summer in the Hamptons,"Filmed entirely on location in East Hampton, L...",...,1995-11-22,0.0,108.0,[],Released,,Last Summer in the Hamptons,False,0.0,0.0
107,False,,0,[Crime],,96357,tt0113276,en,Headless Body in Topless Bar,An ex-con holds a group of people hostage in a...,...,1995-05-20,0.0,110.0,[],Released,,Headless Body in Topless Bar,False,0.0,0.0
126,False,,0,[],,290157,tt0110217,en,Jupiter's Wife,"Michel Negroponte, a documentary filmmaker, me...",...,1995-01-01,0.0,87.0,[],Released,A Haunting Real Life Mystery,Jupiter's Wife,False,0.0,0.0
132,False,,0,"[Music, Documentary]",,124636,tt0114500,en,Sonic Outlaws,Within days after the release of Negativland's...,...,1995-08-01,0.0,87.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Sonic Outlaws,False,0.0,0.0
137,False,,0,[],,124639,tt0114618,en,Target,A subtle yet violent commentary on feudal lords.,...,1995-08-01,0.0,122.0,[],Released,,Target,False,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45432,False,,0,[Documentary],,323132,tt0074124,en,Altar of Fire,This film records a 12 day ritual performed by...,...,1976-01-01,0.0,45.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Altar of Fire,False,0.0,0.0
45434,False,,0,[],,325439,tt0055178,en,Le Meraviglie di Aladino,Young Aladdin (Donald O'Connor) has a series o...,...,1961-10-31,0.0,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,The Wonders of Aladdin,False,0.0,0.0
45452,False,,0,[Documentary],,276895,tt3054038,en,Deep Hearts,"Deep Hearts is a film about the Bororo Fulani,...",...,1981-01-01,0.0,58.0,"[{'iso_639_1': 'ff', 'name': 'Fulfulde'}, {'is...",Released,,Deep Hearts,False,0.0,0.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


The heighest number of reviews/ratings were given to the movie Inception.  
There are nearly 2899 movies with no reviews.

To build a simple recommender system, We will need a metric/score to rate and compare the movies. We can calculate this score for all the movies and then sort them to get the top rated movies.  
The "vote_average" can be used as the score to sort the movies. However, as seen from the above, there are many movies with no reviews. Also, it might happen that there are movies with high "vote_average" but low "vote_count".  

It won't be fair enough since a movie with 8.2 average rating and only 4 votes cannot be considered better than the movie with 7.6 as as average rating but 50 votes. 

As per the hint, I will use the IBDM weighted rating formula to rate the movies.

**The formula is :**

Weighted Rating (WR) =  ((( v/(v+m) x R )) + (( m/ (v+m) x C )))

where,

v is the number of votes for the movie.   
m is the minimum votes required to be listed in the chart.  
R is the average rating of the movie.  
C is the mean vote across the whole report.

Here,

v = "vote_count"  
R = "vote_average"    
C = mean of "vote_average" = 5.6

The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. I will use 95th percentile as the cutoff. In other words, for a movie to featured in the recomendation, it must have more votes than at least 95% of the movies in the list.

In [8]:
v = md["vote_count"]
R = md["vote_average"]
C = md["vote_average"].mean()
m = md["vote_count"].quantile(0.95)

print("Value of C:", + C)
print("Value of m:", + m)

Value of C: 5.618207215133889
Value of m: 434.0


In [9]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

In [10]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [11]:
md['genres']

0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45461                 [Drama, Family]
45462                         [Drama]
45463       [Action, Drama, Thriller]
45464                              []
45465                              []
Name: genres, Length: 45466, dtype: object

In [12]:
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

Hence, to get featured in the recommendation list, a movie should have atleast 434 number of votes. The average rating of a movie is 5.6. A total of 2274 movies are eligible to be in the recommendation chart.

In [13]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [14]:
qualified['IMDB Weighted Rate'] = qualified.apply(weighted_rating, axis=1)

In [15]:
qualified = qualified.sort_values('IMDB Weighted Rate', ascending=False).head(250)

### Top Movies

In [16]:
qualified[['title', 'vote_count', 'vote_average', 'IMDB Weighted Rate']].head(20)

Unnamed: 0,title,vote_count,vote_average,IMDB Weighted Rate
15480,Inception,14075,8,7.928755
12481,The Dark Knight,12269,8,7.918626
22879,Interstellar,11187,8,7.911049
2843,Fight Club,9678,8,7.897775
4863,The Lord of the Rings: The Fellowship of the Ring,8892,8,7.88916
292,Pulp Fiction,8670,8,7.886457
314,The Shawshank Redemption,8358,8,7.882427
7000,The Lord of the Rings: The Return of the King,8226,8,7.880635
351,Forrest Gump,8147,8,7.879536
5814,The Lord of the Rings: The Two Towers,7641,8,7.871988


Above is the recommendation chart for top 20 movies using IMDB weighted rating system.

## Content Based Recommender

The recommender we built gives the same recommendation to everyone, regardless of the user's personal taste. To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

I will build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

I will use the smaller dataset for this purpose.

In [17]:
links_small = pd.read_csv('H:/Movie Recommendation Kredent/Dataset/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [18]:
links_small.head()

0      862
1     8844
2    15602
3    31357
4    11862
Name: tmdbId, dtype: int32

In [19]:
md["id"] = md["id"].astype(int)

ValueError: invalid literal for int() with base 10: '1997-08-20'

Here, similar to the id=1997-08-20, id=1997-08-20 and id=2012-09-29 also are absurd values. Hence, I will delete the entire rows for these with there index values.

In [20]:
md = md.drop([19730, 29503, 35587])

In [21]:
md['id'] = md['id'].astype('int')

In [22]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

This small_movie dataset has 9k data and hence we will use this dataset in future.

### Movie Description Based Recommender

We will use the overview and the taglines of the movie to build the content based recomendation system.  

At first, we will create a new column in the small_movie dataset that will be the combination on overview text and tagline text. We will then convert the text to vector using TF-IDF (Term Frequency - Inverse Document Frequency).  

If  
r = set of words in a text from the data corpus. 
w = a word in the text.  
D = Data Corpus with all the words.

TF(w,r) = (No. of time w occures in r) / (No. of total words in r).   ; 0<= TF(w,r) <=1  
IDF(w,D) = log(N/n) where N = number of documents, n = number of documents which contains w.

The TF-TDF vector is calculated as  :

TF(w,r) X IDF(w,D)


Then, we will calculate similarity between these vectors using Cosine Similarity.

In [23]:
smd["tagline"].head(10)

0                                                  NaN
1            Roll the dice and unleash the excitement!
2    Still Yelling. Still Fighting. Still Ready for...
3    Friends are the people who let you be yourself...
4    Just When His World Is Back To Normal... He's ...
5                             A Los Angeles Crime Saga
6    You are cordially invited to the most surprisi...
7                               The Original Bad Boys.
8                           Terror goes into overtime.
9                 No limits. No fears. No substitutes.
Name: tagline, dtype: object

In [24]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [25]:
smd["description"].head(10)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
5    Obsessive master thief, Neil McCauley leads a ...
6    An ugly duckling having undergone a remarkable...
7    A mischievous young boy, Tom Sawyer, witnesses...
8    International action superstar Jean Claude Van...
9    James Bond must unmask the mysterious head of ...
Name: description, dtype: object

In [26]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [27]:
tfidf_matrix.shape

(9099, 268124)

There are 268124 words in the small_movie dataset. Each desciption of the 9000 movies is represented as 268124 dimension vector.  

Now, I will find similarities between two movies using cosine similarity.

Cosine Similarity between two vectors are mathematically represented as - 

cosine(x,y) = x. y / (||x||.||y||)

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Lets use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [28]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [29]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

We now have a similarity matrix between any two pair of movies.   
Now we will define the recommendation system that returns the 30 most similar movies based on the cosine similarity score.

The process will be as follows :  

1. Assign the index with the title value. Get the index of the movie given its title(which will be same as title).
2. Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.
3. Sort the list of tuples based on the similarity scores; that is, the second element.
4. Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
4. Return the titles(or index) corresponding to the indices of the top elements.

In [30]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [31]:
indices

title
Toy Story                                                0
Jumanji                                                  1
Grumpier Old Men                                         2
Waiting to Exhale                                        3
Father of the Bride Part II                              4
                                                      ... 
Shin Godzilla                                         9094
The Beatles: Eight Days a Week - The Touring Years    9095
Pokémon: Spell of the Unknown                         9096
Pokémon 4Ever: Celebi - Voice of the Forest           9097
Force Majeure                                         9098
Length: 9099, dtype: int64

In [32]:
smd.head()

Unnamed: 0,index,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year,description
0,0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ..."
1,1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...
2,2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,...,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995,A family wedding reignites the ancient feud be...
3,3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,...,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom..."
4,4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,...,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,1995,Just when George Banks has recovered from his ...


We can see that after resetting the index, the old index has been aded as a column, and new indices are assigned.  
All the titles are stored in "titles".  
"indices" is a series of the indices of the small_movie dataset. The indices of the series "indices" have been set as the title of the small_movie dataset.

In [33]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [34]:
get_recommendations('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

The above content based recomendation system was able to recognise the similar movies and give the recomendation.  

Now I am going to **genre**, **keywords**, **cast** and **crew** into consideration for building the recomendation system.

### Metadata Based Recommender

To build the standard metadata based content recommender, I will need to merge our current dataset with the crew and the keyword datasets.

In [35]:
credits = pd.read_csv('H:/Movie Recommendation Kredent/Dataset/credits.csv')
keywords = pd.read_csv('H:/Movie Recommendation Kredent/Dataset/keywords.csv')

In [36]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [37]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [38]:
#Changing all id to int.
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [39]:
md.shape

(45463, 25)

In [40]:
#Merging credits and keywords with md.
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [41]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9219, 28)

We now have our cast, crew, genres and credits, all in one dataframe

In [42]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,year,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,1995,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,1995,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,1995,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


I now have our cast, crew, genres and credits, all in one dataframe. 

In [43]:
smd["crew"].head(1)

0    [{'credit_id': '52fe4284c3a36847f8024f49', 'de...
Name: crew, dtype: object

In [44]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

After exploring the csv files, we can come to a colclusion about crew and caste as follows - 

**1.Crew:** Select only director as feature since the others don't contribute in prediction of the movie.  
**2.Cast:** There are a number of actors who are casted in a movie. All the actors do not impact the audiance in the same way. Arbitrarily we will choose the top 3 actors that appear in the credits list. 

In [45]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [46]:
smd['director'] = smd['crew'].apply(get_director)

In [47]:
smd.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,video,vote_average,vote_count,year,cast,crew,keywords,cast_size,crew_size,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,False,7.7,5415.0,1995,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",13,106,John Lasseter
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,False,6.9,2413.0,1995,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",26,16,Joe Johnston


In [48]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [49]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

What I will create a metadata dump for every movie which consists of **genres, director, main actors and keywords.** I then use a **Count Vectorizer** to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps I follow in the preparation of my genres and credits data:
1. **Strip Spaces and Convert to Lowercase** from all our features. This way, our engine will not confuse between **Johnny Depp** and **Johnny Galecki.** 
2. **Mention Director 3 times** to give it more weight relative to the entire cast.

In [50]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [51]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

In [52]:
smd.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,video,vote_average,vote_count,year,cast,crew,keywords,cast_size,crew_size,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,False,7.7,5415.0,1995,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva...",13,106,"[johnlasseter, johnlasseter, johnlasseter]"


#### Keywords

I will do a small amount of pre-processing of the keywords before putting them to any use. As a first step, I will calculate the frequenct counts of every keyword that appears in the dataset.

In [53]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s

0           jealousy
0                toy
0                boy
0         friendship
0            friends
            ...     
41391    destruction
41391          kaiju
41391          toyko
41669          music
41669    documentary
Name: keyword, Length: 64407, dtype: object

In [54]:
s = s.value_counts()
s[:5]
s

independent film          610
woman director            550
murder                    399
duringcreditsstinger      327
based on novel            318
                         ... 
stoning                     1
accident victim             1
class distinction           1
individuality               1
story within the story      1
Name: keyword, Length: 12940, dtype: int64

Keywords occur in frequencies ranging from 1 to 610. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. Finally, we will convert every word to its stem.


In [55]:
s = s[s > 1]

In [56]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [57]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [58]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

In [59]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [60]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [61]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [62]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

We will reuse the get_recommendations function that we had written earlier with the changed cosine similarities.

In [63]:
get_recommendations('The Dark Knight').head(10)

8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
8613                  Interstellar
7659    Batman: Under the Red Hood
1134                Batman Returns
Name: title, dtype: object

We see that our recommender has been successful in capturing more information due to more metadata and has given us (arguably) better recommendations.

#### Popularity and Ratings

This recommendation system recommends movies regardless of ratings and popularity. 

Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of $m$, we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [64]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [65]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.928755
8613,Interstellar,11187,8,2014,7.911049
6623,The Prestige,4510,8,2006,7.790919
3381,Memento,4168,8,2000,7.775381
8031,The Dark Knight Rises,9263,7,2012,6.938156
6218,Batman Begins,7511,7,2005,6.924519
1134,Batman Returns,1706,6,1992,5.922571
132,Batman Forever,1529,5,1995,5.13668
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.035196
1260,Batman & Robin,1447,4,1997,4.373366


## Collaborative Filtering

Our content based engine is only capable of suggesting movies which are *close* to a certain movie.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. 

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [66]:
reader = Reader()

In [67]:
ratings = pd.read_csv('H:/Movie Recommendation Kredent/Dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [68]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)


algo = SVD()

# Run 5-fold cross-validation and then print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8992  0.9025  0.8972  0.8956  0.8916  0.8972  0.0036  
MAE (testset)     0.6927  0.6939  0.6944  0.6887  0.6867  0.6913  0.0030  
Fit time          4.63    4.77    4.70    4.91    4.50    4.70    0.14    
Test time         0.12    0.13    0.21    0.14    0.12    0.15    0.03    


{'test_rmse': array([0.89916202, 0.90248174, 0.89721187, 0.89560084, 0.89155529]),
 'test_mae': array([0.69270298, 0.69388984, 0.69443922, 0.68872131, 0.68672953]),
 'fit_time': (4.628800630569458,
  4.770929574966431,
  4.704632997512817,
  4.906178951263428,
  4.496537685394287),
 'test_time': (0.124664306640625,
  0.1315920352935791,
  0.21106958389282227,
  0.14059233665466309,
  0.12470412254333496)}

We get a mean **Root Mean Sqaure Error** of 0.8963.  Let us now train on our dataset and arrive at predictions.

In [69]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x212521ff9e8>

Let us pick user 1 and check the ratings s/he has given.

In [70]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [71]:
algo.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.6533977453532533, details={'was_impossible': False})

For movie with ID 302, we get an estimated prediction of **2.686**.The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

## Hybrid Recommender

In this section, I will try to build a simple hybrid recommender. It works as follows-  

* **Input:** User ID and the Title of a Movie
* **Output:** Similar movies sorted on the basis of expected ratings by that particular user.

In [72]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [73]:
id_map = pd.read_csv('H:/Movie Recommendation Kredent/Dataset/links_small.csv')[['movieId', 'tmdbId']]

In [74]:
id_map.head()

Unnamed: 0,movieId,tmdbId
0,1,862.0
1,2,8844.0
2,3,15602.0
3,4,31357.0
4,5,11862.0


In [75]:
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']

In [76]:
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [77]:
indices_map = id_map.set_index('id')

In [78]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: algo.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [79]:
hybrid(1, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.241714
1011,The Terminator,4208.0,7.4,1984,218,3.068841
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,2.921418
344,True Lies,1138.0,6.8,1994,36955,2.880287
922,The Abyss,822.0,7.1,1989,2756,2.869618
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,2.869315
974,Aliens,3282.0,7.7,1986,679,2.866691
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,2.853986
4347,Piranha Part Two: The Spawning,41.0,3.9,1981,31646,2.762691
1668,Return from Witch Mountain,38.0,5.6,1978,14822,2.745464


In [80]:
hybrid(500, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
974,Aliens,3282.0,7.7,1986,679,3.388408
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,3.344586
1011,The Terminator,4208.0,7.4,1984,218,3.305514
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,3.23033
831,Escape to Witch Mountain,60.0,6.5,1975,14821,3.203448
8724,Jupiter Ascending,2816.0,5.2,2015,76757,3.140756
4347,Piranha Part Two: The Spawning,41.0,3.9,1981,31646,3.124457
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,3.118421
7265,Dragonball Evolution,475.0,2.9,2009,14164,3.085677
2014,Fantastic Planet,140.0,7.6,1973,16306,3.073313


We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.

## Conclusion

In this notebook, I have built 4 different recommendation engines based on different ideas and algorithms. They are as follows:

1. **Simple Recommender:** This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.
2. **Content Based Recommender:** I have built two content based engines; one that took movie overview and taglines as input and the other which took metadata such as cast, crew, genre and keywords to come up with predictions. I also deviced a simple filter to give greater preference to movies with more votes and higher ratings.
3. **Collaborative Filtering:** I used the  Surprise Library to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.
4. **Hybrid Engine:** I brought together ideas from content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.
