# Movie Recommendation System

![](https://i.kym-cdn.com/entries/icons/original/000/026/825/movies-tiles.jpg)

# Recommender System

In our Movie recommender system, we will be implementing Simple, Content and Collaborative filtering. Combining all these models we will be building the final model i.e. Hybrid filtering. Here, we will be using two dataset Full Dataset and Small Dataset.

- Full Dataset:- Is made up of 26,000,000 ratings and 750,000 tag applications submitted by 270,000 users to 45,000 movies. Includes tag genome data across 1,100 tags, with 12 million related ratings.

- Small Dataset:- Comprises of 100,000 reviews and 1,300 tag apps added by 700 users to 9,000 movies.


We will build our Simple Recommender using Full Dataset movies, while the small dataset will be used by all personalized recommendation systems like Collaborative Recommender System, Content Based Recommender and Hybrid Recommender (due to the computing power I possess being very limited). Let us build our simple recommendation system as a first step.

# Simple Recommender

The Simple Recommender provides every user with generalized recommendations based on movie popularity and sometimes genre. The underlying principle behind this recommender is that movies that are more popular and more critically acclaimed would be more likely to be enjoyed by the average viewer.  This model does not offer personalized recommendations based on the user.

This model's implementation is highly trivial. All we need to do is sort our movies based on ratings and popularity and display our list of top films. As an added step, to get the top movies of a specific genre, we can pass into a genre argument.

In [1]:
import pandas as pd
import numpy as np
import warnings; warnings.simplefilter('ignore')

movie_data = pd. read_csv("movies_metadata.csv")
movie_data.head(10)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,1995-12-15,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0
6,False,,58000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,11860,tt0114319,en,Sabrina,An ugly duckling having undergone a remarkable...,...,1995-12-15,0.0,127.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,You are cordially invited to the most surprisi...,Sabrina,False,6.2,141.0
7,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,45325,tt0112302,en,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses...",...,1995-12-22,0.0,97.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Original Bad Boys.,Tom and Huck,False,5.4,45.0
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,1995-12-22,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0
9,False,"{'id': 645, 'name': 'James Bond Collection', '...",58000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.mgm.com/view/movie/757/Goldeneye/,710,tt0113189,en,GoldenEye,James Bond must unmask the mysterious head of ...,...,1995-11-16,352194034.0,130.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,No limits. No fears. No substitutes.,GoldenEye,False,6.6,1194.0


In [2]:
movie_data.shape

(45466, 24)

In [3]:
movie_data.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


In [4]:
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

We will be dropping few features that are not useful for our recommender system

In [5]:
# Dropping unnecessary columns
movie_data.drop(['homepage', 'adult', 'overview', 'poster_path', 'tagline', 'video'], axis=1)

Unnamed: 0,belongs_to_collection,budget,genres,id,imdb_id,original_language,original_title,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,Toy Story,21.9469,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,Jumanji,17.0155,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Jumanji,6.9,2413.0
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,en,Grumpier Old Men,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Grumpier Old Men,6.5,92.0
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,en,Waiting to Exhale,3.85949,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Waiting to Exhale,6.1,34.0
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,en,Father of the Bride Part II,8.38752,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",439050,tt6209470,fa,رگ خواب,0.072051,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Subdue,4.0,1.0
45462,,0,"[{'id': 18, 'name': 'Drama'}]",111109,tt2028550,tl,Siglo ng Pagluluwal,0.178241,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,Century of Birthing,9.0,3.0
45463,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",67758,tt0303758,en,Betrayal,0.903007,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Betrayal,3.8,6.0
45464,,0,[],227506,tt0008536,en,Satana likuyushchiy,0.003503,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,Satan Triumphant,0.0,0.0


### Features of our dataset


* **belongs_to_collection:** Name of the franchise the movie belongs to.
* **budget:** The budget of the movie in dollars.
* **genres:** A stringified list of dictionaries that list out all the genres associated with the movie.
* **id:** The ID of the move.
* **imdb_id:** The IMDB ID of the movie.
* **original_language:** The language in which the movie was originally shot in.
* **original_title:** The original title of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **production_companies:** A stringified list of production companies involved with the making of the movie.
* **production_countries:** A stringified list of countries where the movie was shot/produced in.
* **release_date:** Theatrical Release Date of the movie.
* **revenue:** The total revenue of the movie in dollars.
* **runtime:** The runtime of the movie in minutes.
* **spoken_languages:** A stringified list of spoken languages in the film.
* **status:** The status of the movie (Released, To Be Released, Announced, etc.)
* **title:** The Official Title of the movie.
* **vote_average:** The average rating of the movie.
* **vote_count:** The number of votes by users, as counted by TMDB.

Genre in our dataset is in form of list of dictionaries.So we need to convert it into list form 

In [6]:
import ast
from ast import literal_eval

#converted list of dictionaries to list
movie_data['genres'] = movie_data['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] 
                                                                       if isinstance(x, list) else [])

We will be using the TMDB Ratings to come up with our **Top Movies Chart.** For that we will use IMDB's *weighted rating* formula to construct top chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

The next step is to decide a suitable value for m, the minimum votes needed to appear in the chart. We are going to be using 95th percentile as our limit. In other words, for a movie to be included in the charts, it must have more votes in the list than at least 95 percent of the movies.

Building our overall Top 250 Chart and we will then define a function to build charts for a particular genre.

In [7]:
vote_count = movie_data[movie_data['vote_count'].notnull()]['vote_count'].astype('int')
vote_average = movie_data[movie_data['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_average.mean()
C

5.244896612406511

In [8]:
m = vote_count.quantile(0.95)
m

434.0

In [9]:
movie_data['year'] = pd.to_datetime(movie_data['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [10]:
top_movies = movie_data[(movie_data['vote_count'] >= m) & (movie_data['vote_count'].notnull()) & (movie_data['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
top_movies['vote_count'] = top_movies['vote_count'].astype('int') # converted float to integer
top_movies['vote_average'] = top_movies['vote_average'].astype('int')
top_movies.shape

(2274, 6)

Therefore a movie has to have at least **434 votes** on TMDB to qualify to be considered for the list. We also see that on a scale of 10, the average rating for a movie on TMDB is **5.244**. **2274** movies are qualified to be on our list.

In [11]:
#applying the IMDB's weighted rating formula 
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [12]:
top_movies['weighted_rating'] = top_movies.apply(weighted_rating, axis=1)

In [13]:
top_movies = top_movies.sort_values('weighted_rating', ascending=False).head(250)

## Top Movies

In [14]:
top_movies.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,weighted_rating
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.4235,"[Adventure, Fantasy, Action]",7.851924


We observe that at the very top of our list are three Christopher Nolan movies, **Inception**, **The Dark Knight** and **Interstellar**. A strong bias of TMDB users towards specific genres and directors is also seen in the table.

Let us create our feature now, which constructs charts for different genres. 

In [15]:
s = movie_data.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
genre_movie = movie_data.drop('genres', axis=1).join(s)

In [16]:
genre_movie.head()

Unnamed: 0,adult,belongs_to_collection,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year,genre
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995,Animation
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995,Comedy
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995,Family
1,False,,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,Adventure
1,False,,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,Fantasy


In [17]:
def simple_recommender(genre, percentile=0.95):
    movie_data2 = genre_movie[genre_movie['genre'] == genre]
    vote_count = movie_data2[movie_data2['vote_count'].notnull()]['vote_count'].astype('int')
    vote_average = movie_data2[movie_data2['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_average.mean()
    m = vote_count.quantile(percentile)
    
    top_movies = movie_data2[(movie_data2['vote_count'] >= m) & (movie_data2['vote_count'].notnull()) & (movie_data2['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    top_movies['vote_count'] = movie_data2['vote_count'].astype('int')
    top_movies['vote_average'] = top_movies['vote_average'].astype('int')
    
    top_movies['wr'] = top_movies.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    top_movies = top_movies.sort_values('wr', ascending=False).head(250)
    
    return top_movies

Let's see our system in practice by showing the Top 15 Romance Movies (Romance hardly featured anywhere in our Generic Top Chart despite being one of the most common genres of movies).

## Top Romance Movies

In [18]:
simple_recommender('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
351,Forrest Gump,1994,8147,8,48.3072,7.86986
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,7.582757
876,Vertigo,1958,1162,8,18.2082,7.298862
40251,Your Name.,2016,1030,8,34.461252,7.235471
883,Some Like It Hot,1959,835,8,11.8451,7.117619
1132,Cinema Paradiso,1988,834,8,14.177,7.116921
19901,Paperman,2012,734,8,7.19863,7.041055
37863,Sing Street,2016,669,8,10.672862,6.984338
1639,Titanic,1997,7770,7,26.8891,6.916316
19731,Silver Linings Playbook,2012,4840,7,14.4881,6.869789


The top romance movie according to our metrics is **Forrest Gump**.Followed by bollywood's movie **Dilwale Dulhania Le Jayenge**.

### Data Pre Processing for Persolinzed Recommender Engines

To build our personalized recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.
Also, as mentioned in the introduction, we will be using a subset of all the movies available to us due to limiting computing power available.

In [19]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')
links = pd.read_csv('links.csv')
links = links[links['tmdbId'].notnull()]['tmdbId'].astype('int')

In [20]:
credits.info()
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
cast    45476 non-null object
crew    45476 non-null object
id      45476 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.0+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
id          46419 non-null int64
keywords    46419 non-null object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


The **credits.csv** file contains the cast and crew information of the movie set, while **keywords.csv** contains the keywords used to describe the movie

In [21]:
movie_data = movie_data.drop([19730, 29503, 35587])
movie_data['id'] = movie_data['id'].astype('int')

small_data = movie_data[movie_data['id'].isin(links)]
small_data.shape

keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')

In [22]:
# mergeing credits and keywords dataset with our small dataset
small_data = small_data.merge(credits, on='id')
small_data = small_data.merge(keywords, on='id')

In [23]:
small_data.shape

(9219, 28)

In [24]:
small_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9219 entries, 0 to 9218
Data columns (total 28 columns):
adult                    9219 non-null object
belongs_to_collection    1688 non-null object
budget                   9219 non-null object
genres                   9219 non-null object
homepage                 2001 non-null object
id                       9219 non-null int32
imdb_id                  9219 non-null object
original_language        9219 non-null object
original_title           9219 non-null object
overview                 9207 non-null object
popularity               9219 non-null object
poster_path              9216 non-null object
production_companies     9219 non-null object
production_countries     9219 non-null object
release_date             9219 non-null object
revenue                  9219 non-null float64
runtime                  9219 non-null float64
spoken_languages         9219 non-null object
status                   9217 non-null object
tagline           

We can observe that cast,crew and keywords columns have been added to our small dataset

In [25]:
small_data

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,year,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,1995,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,1995,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,1995,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9214,False,,8000000,[Drama],,159550,tt0255313,en,The Last Brickmaker in America,A man must cope with the loss of his wife and ...,...,Released,,The Last Brickmaker in America,False,7.0,1.0,2001,"[{'cast_id': 1, 'character': 'Henry Cobb', 'cr...","[{'credit_id': '544475aac3a36819fb000578', 'de...","[{'id': 6054, 'name': 'friendship'}, {'id': 20..."
9215,False,,1000000,"[Thriller, Romance]",,392572,tt5165344,hi,रुस्तम,"Rustom Pavri, an honourable officer of the Ind...",...,Released,Decorated Officer. Devoted Family Man. Defendi...,Rustom,False,7.3,25.0,2016,"[{'cast_id': 0, 'character': 'Rustom Pavri', '...","[{'credit_id': '5951baf692514129c4016600', 'de...","[{'id': 10540, 'name': 'bollywood'}]"
9216,False,,15050000,"[Adventure, Drama, History, Romance]",,402672,tt3859980,hi,Mohenjo Daro,"Village lad Sarman is drawn to big, bad Mohenj...",...,Released,,Mohenjo Daro,False,6.7,26.0,2016,"[{'cast_id': 0, 'character': 'Sarman', 'credit...","[{'credit_id': '57cd5d3592514179d50018e8', 'de...","[{'id': 10540, 'name': 'bollywood'}]"
9217,False,,15000000,"[Action, Adventure, Drama, Horror, Science Fic...",,315011,tt4262980,ja,シン・ゴジラ,From the mind behind Evangelion comes a hit la...,...,Released,A god incarnate. A city doomed.,Shin Godzilla,False,6.6,152.0,2016,"[{'cast_id': 4, 'character': 'Rando Yaguchi : ...","[{'credit_id': '560892fa92514177550018b2', 'de...","[{'id': 1299, 'name': 'monster'}, {'id': 7671,..."


Now we've got our cast, crew, genres and credits in one data frame. Let's wrangle this a bit more with the following axioms: 

1) **Crew** :We're only going to select the director from the team as our element, as the others don't add too much to the movie taste.

2) **Cast**: Choosing Cast is a little trickier. Lesser recognized actors and small roles have no real impact on the perception of a film by people. So we only have to pick the main characters and their respective actors. We will randomly pick the top 3 actors appearing in the credits list.

In [26]:
#Converting columns which are in the form of list of dictionaires to list

small_data['cast'] = small_data['cast'].apply(literal_eval)
small_data['crew'] = small_data['crew'].apply(literal_eval)
small_data['keywords'] = small_data['keywords'].apply(literal_eval)
small_data['cast_size'] = small_data['cast'].apply(lambda x: len(x))
small_data['crew_size'] = small_data['crew'].apply(lambda x: len(x))

In [27]:
#defining a function to get directors from the crew list

def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [28]:
small_data['director'] = small_data['crew'].apply(director)
small_data['cast'] = small_data['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
small_data['cast'] = small_data['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)
small_data['keywords'] = small_data['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
small_data['cast'] = small_data['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
small_data['director'] = small_data['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
small_data['director'] = small_data['director'].apply(lambda x: [x,x])

In [29]:
small_data = small_data.reset_index()
titles = small_data['title']
indices = pd.Series(small_data.index, index=small_data['title'])

We must do a little pre-processing of our keywords before we put them to good use. We measure the frequency counts of each keyword appearing in the dataset as a first step.

In [30]:
c = small_data.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
c.name = 'keyword'
c = c.value_counts()
c[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

Keywords occur in frequencies of between 1 and 610. We have no use for keywords which only occur once. And these can be eliminated . We will finally convert every word to its stem so that words like Dogs and Dog are considered the same.

In [31]:
c = c[c > 1]

In [32]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english') #reduces to root word
stemmer.stem('running')

'run'

In [33]:
def keywords(x):
    words = []
    for i in x:
        if i in c:
            words.append(i)
    return words

In [34]:
small_data['keywords'] = small_data['keywords'].apply(keywords)
small_data['keywords'] = small_data['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
small_data['keywords'] = small_data['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

Now that we are done with data pre processing we can move on to our next recommender engine which is **"Content Based Recommender System"**

# Content Based Recommender System

<img src="https://miro.medium.com/max/1026/1*BME1JjIlBEAI9BV5pOO5Mg.png" alt="Drawing" style="width: 300px;"/>


We suffered from some major limitations in the simple recommendation engine. It offers the same suggestion to everyone, no matter the user's personal taste. If our top 15 list were to be looked at by a person who loves romantic films (and hates action), he/she would probably not like any of the films.

To personalise our recommendations more, We are going to build an engine that takes in a movie that a user currently likes as input. Then it analyzes the contents (storyline, genre, cast, director etc.) of the movie to find out other movies which have similar content. Then it ranks similar movies according to their similarity scores and recommends the most relevant movies to the user.Since we will be using contents to build this engine, this is also known as **Content Based Filtering.**



In [35]:
small_data.columns

Index(['index', 'adult', 'belongs_to_collection', 'budget', 'genres',
       'homepage', 'id', 'imdb_id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'year', 'cast', 'crew', 'keywords',
       'cast_size', 'crew_size', 'director'],
      dtype='object')

In this dataset, we observe that a movie has a lot of additional detail. We don't need them all. So as our feature set (the "Content" of the movie), we choose keywords, cast, genres and director column to use.

To do that we will combined all those features in one column

In [36]:
small_data['combined_features'] = small_data['keywords'] + small_data['cast'] + small_data['director'] + small_data['genres']
small_data['combined_features'] = small_data['combined_features'].apply(lambda x: ' '.join(x))
small_data['combined_features']

0       jealousi toy boy friendship friend rivalri boy...
1       boardgam disappear basedonchildren'sbook newho...
2       fish bestfriend duringcreditssting waltermatth...
3       basedonnovel interracialrelationship singlemot...
4       babi midlifecrisi confid age daughter motherda...
                              ...                        
9214    friendship sidneypoitier wendycrewson jayo.san...
9215    bollywood akshaykumar ileanad'cruz eshagupta t...
9216    bollywood hrithikroshan poojahegde kabirbedi a...
9217    monster godzilla giantmonst destruct kaiju hir...
9218    music documentari paulmccartney ringostarr joh...
Name: combined_features, Length: 9219, dtype: object

Now, we need to represent the combined features as vectors. So we will be using TfidfVectorizer() class from sklearn.feature_extraction.text library to do that.

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stopwords
tfidf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tfidf.fit_transform(small_data['combined_features'])

#### Cosine Similarity

We will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [38]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

In [39]:
cosine_similarity

array([[1.        , 0.00545459, 0.00235834, ..., 0.        , 0.        ,
        0.        ],
       [0.00545459, 1.        , 0.        , ..., 0.00568909, 0.00483432,
        0.        ],
       [0.00235834, 0.        , 1.        , ..., 0.00507731, 0.        ,
        0.        ],
       ...,
       [0.        , 0.00568909, 0.00507731, ..., 1.        , 0.02244657,
        0.        ],
       [0.        , 0.00483432, 0.        , ..., 0.02244657, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the most similar movies based on the cosine similarity score.

In [40]:
def get_recommendations(title):
    index = indices[title]
    simlarity_scores = list(enumerate(cosine_similarity[index]))
    simlarity_scores = sorted(simlarity_scores, key=lambda x: x[1], reverse=True)
    simlarity_scores = simlarity_scores[1:31]
    movie_indices = [i[0] for i in simlarity_scores]
    return titles.iloc[movie_indices]

In [41]:
get_recommendations('The Dark Knight').head(10)

8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
1134                Batman Returns
7659    Batman: Under the Red Hood
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
1260                Batman & Robin
Name: title, dtype: object

The recommendations seem to have acknowledged other Christopher Nolan movies (due to the director's heavy weighting) and placed them as top recommendations. 

In [42]:
get_recommendations('Pulp Fiction').head(10)

1381            Jackie Brown
8905       The Hateful Eight
5200       Kill Bill: Vol. 2
4903       Kill Bill: Vol. 1
898           Reservoir Dogs
8310        Django Unchained
7280    Inglourious Basterds
6788             Death Proof
616     The Great White Hype
4595                   Basic
Name: title, dtype: object

**Popularity and Ratings**

One thing we see in our recommendation system is that, regardless of ratings and popularity, it recommends movies. It is true that **Batman and Robin** compared to **The Dark Knight** have a lot of similar characters, but it was a bad movie that should not be recommended to anyone.

To achieve that, we will add a function to remove terrible movies and return movies which are successful and have had a strong critical response.

Based on similarity scores, We will take the top 50 movies and measure the vote for the 75th percentile movie. Then using this as the $m$ value, we will use the IMDB formula to calculate the weighted rating of each movie, as we did in the Simple Recommender System.

In [43]:
def content_based_recommendations(title):
    index = indices[title]
    similarity_scores = list(enumerate(cosine_similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:51]
    movie_indices = [i[0] for i in similarity_scores]
    
    movies = small_data.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.75)
    top_movies = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    top_movies['vote_count'] = top_movies['vote_count'].astype('int')
    top_movies['vote_average'] = top_movies['vote_average'].astype('int')
    top_movies['weighted_rating'] = top_movies.apply(weighted_rating, axis=1)
    top_movies = top_movies.sort_values('weighted_rating', ascending=False).head(10)
    return top_movies

In [44]:
content_based_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,weighted_rating
7648,Inception,14075,8,2010,7.917588
8613,Interstellar,11187,8,2014,7.897107
6623,The Prestige,4510,8,2006,7.758148
8871,Deadpool,11444,7,2016,6.935872
8031,The Dark Knight Rises,9263,7,2012,6.921448
6218,Batman Begins,7511,7,2005,6.904127
8872,Captain America: Civil War,7462,7,2016,6.903532
8869,Ant-Man,6029,7,2015,6.882142
7583,Kick-Ass,4747,7,2010,6.852979
7600,Iron Man 2,6969,6,2010,5.955732


We can clearly observe that movies recommended to us now have good rating and strong critical response. Also the earlier recommendation of **Batman and Robin** is not present as it has terrible vote_average ,i.e, 4.

In [45]:
content_based_recommendations('Pulp Fiction')

Unnamed: 0,title,vote_count,vote_average,year,weighted_rating
898,Reservoir Dogs,3821,8,1992,7.718986
8310,Django Unchained,10297,7,2012,6.929017
7280,Inglourious Basterds,6598,7,2009,6.891679
8846,Kingsman: The Secret Service,6069,7,2015,6.882867
4903,Kill Bill: Vol. 1,5091,7,2003,6.862133
8905,The Hateful Eight,4405,7,2015,6.842588
5200,Kill Bill: Vol. 2,4061,7,2004,6.830542
1381,Jackie Brown,1580,7,1997,6.62179
3214,Unbreakable,1994,6,2000,5.865027
6788,Death Proof,1359,6,2007,5.817225


# Collaborative Filtering Recommender System

<img src="https://miro.medium.com/max/345/1*x8gTiprhLs7zflmEn1UjAQ.png" alt="Drawing" style="width: 400px;"/>

Our content-based recommender suffers from some extreme constraints. It can only recommend movies that are **close** to a certain movie. That is, it is not capable of capturing tastes across genres and making recommendations.

Also, it is does not really gives us personal recommendations as it doesn't take into consideration the users personal taste and biases of that user.

In this recommender system, We will therefore use a technique called **Collaborative Filtering** to make more personalized suggestions to Movie Lovers. Collaborative filtering is based on the concept that it is possible to consider a users similar to me to predict which movies I would like which those users have already watched, but I have not.

We will be implementing two different technique for Collabrative Filtering Recommender Engine:
 
 1) Using Scikit Learn's Surprise Library
 
 2) Using Pearson's Correlation

### Using Scikit Learn's Surprise Library

We will be using the **Surprise** library that uses extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [46]:
ratings = pd.read_csv("ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [47]:
ratings.rating.value_counts()


4.0    28750
3.0    20064
5.0    15095
3.5    10538
4.5     7723
2.0     7271
2.5     4449
1.0     3326
1.5     1687
0.5     1101
Name: rating, dtype: int64

From here we can see rating of 4.0 has highest value counts. This means more people rated the movie 4.0.

In [48]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
userId       100004 non-null int64
movieId      100004 non-null int64
rating       100004 non-null float64
timestamp    100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [49]:
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [50]:
from surprise import Reader, Dataset

reader = Reader()

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [51]:
from surprise.model_selection import train_test_split

trainset, testset = train_test_split(data, test_size=0.25)

In [52]:
from surprise import SVD, accuracy
svd = SVD()

from surprise.model_selection import cross_validate
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)


Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8857  0.8991  0.8970  0.9031  0.8948  0.8959  0.0058  
MAE (testset)     0.6800  0.6940  0.6924  0.6952  0.6917  0.6906  0.0055  
Fit time          3.72    3.73    3.74    3.72    3.74    3.73    0.01    
Test time         0.10    0.18    0.10    0.10    0.10    0.12    0.03    


{'test_rmse': array([0.88565157, 0.89911118, 0.8969547 , 0.90314233, 0.89481328]),
 'test_mae': array([0.68001171, 0.69395337, 0.69238824, 0.69520955, 0.69168572]),
 'fit_time': (3.7172257900238037,
  3.731935739517212,
  3.7403717041015625,
  3.7234628200531006,
  3.7383670806884766),
 'test_time': (0.10200214385986328,
  0.1836094856262207,
  0.10000491142272949,
  0.1000216007232666,
  0.10000038146972656)}

For our case, we get a mean Root Mean Sqaure Error of 0.8975 which is more than good enough. Let us now train the dataset and do some predictions.

In [53]:
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1611e9b2c48>

In [54]:
predictions = svd.test(testset)

Let us pick user 30 and check the ratings s/he has given.

In [55]:
ratings[ratings['userId'] == 30]

Unnamed: 0,userId,movieId,rating,timestamp
5048,30,1,4.0,944943070
5049,30,2,2.0,945277634
5050,30,6,4.0,945276746
5051,30,8,4.0,968786809
5052,30,11,4.0,948141296
...,...,...,...,...
6054,30,6436,5.0,1055788276
6055,30,6440,4.0,1055797723
6056,30,6452,5.0,1055788246
6057,30,6473,4.0,1055797789


In [56]:
svd.predict(30, 302, 3)

Prediction(uid=30, iid=302, r_ui=3, est=3.8342822025140038, details={'was_impossible': False})

 We get an average prediction of 3.941 for film with ID 302. One surprising aspect of this recommender method is that what the film is (or what it contains) doesn't matter. It operates solely on the basis of an allocated film ID, and attempts to predict ratings based on how the other users predicted the film.

### Collaborative Filtering using Pearson Correlation


Pearson’s Coefficient of correlation is used to get statistical relationship between two variables. In collaborative filtering recommender system, we are going to use Pearson’s correlation to get us the correlation coefficient between similar movies which will help us to get more personalized movie recommendation.

In [57]:
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')
ratings = pd.merge(movies,ratings).drop(['genres','timestamp'], axis = 1)
ratings.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),7,3.0
1,1,Toy Story (1995),9,4.0
2,1,Toy Story (1995),13,5.0
3,1,Toy Story (1995),15,2.0
4,1,Toy Story (1995),19,3.0


In [58]:
user_ratings = ratings.pivot_table(index=['userId'],columns=['title'],values='rating')
user_ratings.head()

title,'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),1-900 (06) (1994),...,Zoom (2006),Zootopia (2016),Zulu (1964),Zulu (2013),[REC] (2007),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [59]:
#removes movies which has less than 10 users who rated it and fill Nan with 0
user_ratings = user_ratings.dropna(thresh=10,axis=1).fillna(0)

In [60]:
user_ratings

title,"'burbs, The (1989)",(500) Days of Summer (2009),...And Justice for All (1979),10 Things I Hate About You (1999),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Young Guns II (1990),Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
item_similarity = user_ratings.corr(method='pearson')
item_similarity.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),...And Justice for All (1979),10 Things I Hate About You (1999),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Young Guns II (1990),Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",1.0,0.049358,0.183453,0.082564,0.101948,0.162694,-0.002466,0.067309,0.020832,0.105458,...,0.297098,0.015872,0.040455,0.203206,0.082473,0.086927,0.191698,0.057206,0.099502,0.308778
(500) Days of Summer (2009),0.049358,1.0,0.026848,0.230095,0.078582,0.040188,0.068425,0.156769,0.117532,0.380868,...,0.000182,0.172106,0.345785,-0.034352,0.279399,0.348115,0.161343,-0.029451,0.107851,0.013626
...And Justice for All (1979),0.183453,0.026848,1.0,0.040358,0.141549,0.334189,0.005991,0.283803,-0.017748,0.069345,...,0.013836,-0.018287,0.009675,0.081699,0.036289,0.007654,0.036299,0.132239,-0.01425,0.220807
10 Things I Hate About You (1999),0.082564,0.230095,0.040358,1.0,0.180185,0.182573,0.163474,0.065632,0.0276,0.106938,...,0.031471,0.178147,0.123387,0.121883,0.17097,0.126995,0.256138,0.101245,0.173165,0.171767
101 Dalmatians (1996),0.101948,0.078582,0.141549,0.180185,1.0,0.353796,0.118462,0.045265,-0.030411,0.037087,...,0.09824,0.008197,0.015425,0.089189,0.053503,0.011,0.037184,0.024336,0.023662,0.123471


In [62]:
def collaborative_recommender(movie_name,user_rating):
    similarity_score = item_similarity[movie_name]*(user_rating-2.5)
    similarity_score = similarity_score.sort_values(ascending=False)
    
    return similarity_score

In [63]:
collaborative_recommender("17 Again (2009)", 5). head(10)

title
17 Again (2009)                                        2.500000
How to Lose a Guy in 10 Days (2003)                    1.162404
Green Lantern (2011)                                   1.154159
Tangled (2010)                                         1.138901
13 Going on 30 (2004)                                  1.095069
Princess Diaries, The (2001)                           1.020039
27 Dresses (2008)                                      1.008918
Mean Girls (2004)                                      0.918612
Harry Potter and the Deathly Hallows: Part 2 (2011)    0.894475
Holiday, The (2006)                                    0.876778
Name: 17 Again (2009), dtype: float64

In [64]:
my_rating = [("17 Again (2009)", 5), ("101 Dalmatians (1996)", 3), ("(500) Days of Summer (2009)",2)]
similar_movies = pd.DataFrame()

for movie,rating in my_rating:
    similar_movies = similar_movies.append(collaborative_recommender(movie,rating), ignore_index = True)
    
similar_movies
similar_movies.sum().sort_values(ascending=False).head(10)

17 Again (2009)                        2.409609
How to Lose a Guy in 10 Days (2003)    1.090288
13 Going on 30 (2004)                  1.069531
Tangled (2010)                         1.017275
Green Lantern (2011)                   0.989302
Princess Diaries, The (2001)           0.980114
27 Dresses (2008)                      0.970204
101 Dalmatians (1996)                  0.829600
D2: The Mighty Ducks (1994)            0.821359
Holiday, The (2006)                    0.797397
dtype: float64

# Hybrid Recommender System

Hybrid recommender system will brings together the techniques we have implemented in the engines based on the content based and the collaborative filter recommender system. Consideration we need to take to implement hybrid system are as follows: 

- **Input**: User ID and the Movie 

- **Output Title**: Related films sorted by that particular user based on predicted ratings.

In [65]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [66]:
id_map = pd.read_csv('links.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(small_data[['title', 'id']], on='id').set_index('title')

#Build ID to title mappings
indices_map = id_map.set_index('id')

We will build a hybrid function which will combine the techniques used in content based recommender system and surprise library based Collaborative filtering recommender system.

In [67]:
def hybrid(userId, title):
    
    #Extract the cosine_sim index of the movie
    index = indices[title]
    
    #Extract the TMDB ID of the movie
    tmdbId = id_map.loc[title]['id']
    
    #Extract the movie ID internally assigned by the dataset
    movie_id = id_map.loc[title]['movieId']
    
    #Extract the similarity scores and their corresponding index for every movie from the cosine_sim matrix
    similiarity_scores = list(enumerate(cosine_similarity[int(index)]))
    
    #Sort the index, score in decreasing order of similarity scores
    similiarity_scores = sorted(similiarity_scores, key=lambda x: x[1], reverse=True)
    
    #Select the top 25 tuples, excluding the first 
    similiarity_scores = similiarity_scores[1:26]
    
    #Store the cosine_sim indices of the top 25 movies in a list
    movie_indices = [i[0] for i in similiarity_scores]
    
    #Extract the metadata of the aforementioned movies
    movies = small_data.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    
    #Compute the predicted ratings using the SVD filter
    movies['predicted_ratings'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    
    #Sort the movies in decreasing order of predicted rating
    movies = movies.sort_values('predicted_ratings', ascending=False)
    
    #Return the top 10 movies as recommendations
    return movies.head(10)

In [68]:
hybrid(1, 'The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,id,predicted_ratings
3381,Memento,4168.0,8.1,2000,77,3.783995
8613,Interstellar,11187.0,8.1,2014,157336,3.543562
7648,Inception,14075.0,8.1,2010,27205,3.488185
6623,The Prestige,4510.0,8.0,2006,1124,3.398754
6218,Batman Begins,7511.0,7.5,2005,272,3.275979
8001,Batman: Year One,255.0,7.1,2011,69735,3.187929
7583,Kick-Ass,4747.0,7.1,2010,23483,3.116289
2131,Superman,1042.0,6.9,1978,1924,3.060907
8334,"Batman: The Dark Knight Returns, Part 2",426.0,7.9,2013,142061,3.046984
2085,Following,363.0,7.2,1998,11660,3.006152


In [69]:
hybrid(500, 'The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,id,predicted_ratings
3381,Memento,4168.0,8.1,2000,77,3.408939
7648,Inception,14075.0,8.1,2010,27205,3.340383
8613,Interstellar,11187.0,8.1,2014,157336,3.321429
7583,Kick-Ass,4747.0,7.1,2010,23483,3.287319
8031,The Dark Knight Rises,9263.0,7.6,2012,49026,3.231979
8334,"Batman: The Dark Knight Returns, Part 2",426.0,7.9,2013,142061,3.20152
8001,Batman: Year One,255.0,7.1,2011,69735,3.074207
4145,Insomnia,1181.0,6.8,2002,320,3.029359
8478,Justice League: Crisis on Two Earths,152.0,7.1,2010,30061,2.937685
6733,TMNT,349.0,6.0,2007,1273,2.910097


We see that we get different suggestions for different users for our hybrid recommender while the film is the same. Our reviews are also more personalised and tailored to individual users.

## Conclusion:
Thus, We have built four different recommendation system based on different ideas and algorithms. They are as follows:

1)**Simple Recommender**: This system used TMDB Vote Count and Vote Averages overall to create Top Movies Charts, in general and for a particular genre. The IMDB Weighted Ranking System was used to determine ratings for which final sorting was performed.

2)**Content Based Recommender**: We have used contents (storyline, genre, cast, director etc.) of the movie to find out other movies which have similar content.It ranks similar movies according to their similarity scores and recommends the most relevant movies to the user.

3)**Collaborative Filtering Recommender:** We have build two collaborative engines;one that uses the powerful Surprise Library to create a collaborative filter based on a decomposition of a single value. The obtained RMSE was less than 1 and the engine received approximate ratings for a given user and film. And other engine that uses pearson's coefficient algorithm to recommend movies that was liked by user with similar taste.

4)**Hybrid Recommender:** We put together ideas from content and collaborative filtering to create an algorithm that gave recommendations of movies to a specific user based on the average ratings it had internally measured for that user.