# Movies Recommender System

In this notebook, I will attempt at implementing a few recommendation algorithms (content based, popularity based and collaborative filtering) and try to build an ensemble of these models to come up with our final recommendation system. With us, we have two MovieLens datasets.

* **The Full Dataset:** Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
* **The Small Dataset:** Comprises 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

I will build a Simple Recommender using movies from the *Full Dataset* whereas all personalised recommender systems will make use of the small dataset (due to the computing power I possess being very limited). As a first step, I will build my simple recommender system.

In [2]:
import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

ModuleNotFoundError: No module named 'pandas'

## Simple Recommender

The Simple Recommender offers generalized recommendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user.

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre. 

In [2]:
try:
    md = pd.read_csv('../input/movies-data/metadata_small.csv', dtype=
    {'id': int, 'vote_count': int, 'vote_averages': float})
except FileNotFoundError:
    cols = ['id', 'title', 'release_date', 'genres', 'vote_count',
            'vote_average', 'popularity']
    md = pd.read_csv('../input/the-movies-dataset/movies_metadata.csv',
                     skiprows=[19731, 29504, 35588],  #skip error data
                     usecols=cols)
    #extract genres
    md['genres'] = md['genres'].apply(lambda x: [i['name'] for i in literal_eval(x)])
    md = md[md['title'].notnull()].astype({'vote_count': int})
    md = md[cols]
    md.to_csv('../input/movies-data/metadata_small.csv', index=False)

md.head()

NameError: name 'pd' is not defined

I use the TMDB Ratings to come up with our **Top Movies Chart.** I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.
I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [None]:
print(f"C = {md['vote_average'].mean()}")
print(f"m95 = {md['vote_count'].quantile(0.95)}")
md[md['vote_count'] >= 434].copy().shape

C = 5.618207215133889
m95 = 434.0


(2274, 7)

Therefore, to qualify to be considered for the chart, a movie has to have at least **434 votes** on TMDB. We also see that the average rating for a movie on TMDB is **5.618** on a scale of 10. And **2274** Movies qualify to be on our chart.

In [None]:
def weighted_rating(df, percentile=0.95):
    C = df['vote_average'].mean()
    m = df['vote_count'].quantile(percentile)
    qualified = df[df['vote_count'] >= m].copy()
    R = qualified['vote_average']
    v = qualified['vote_count']
    qualified['weighted_rating'] = (v / (v + m) * R) + (m / (m + v) * C)
    qualified = qualified.sort_values('weighted_rating', ascending=False)
    return qualified

### Top Movies

In [None]:
weighted_rating(md).head(10)

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
314,278,The Shawshank Redemption,1994-09-23,"['Drama', 'Crime']",8358,8.5,51.645403,8.357746
834,238,The Godfather,1972-03-14,"['Drama', 'Crime']",6024,8.5,41.109264,8.306334
12481,155,The Dark Knight,2008-07-16,"['Drama', 'Action', 'Crime', 'Thriller']",12269,8.3,123.167259,8.208376
2843,550,Fight Club,1999-10-15,['Drama'],9678,8.3,63.869599,8.184899
292,680,Pulp Fiction,1994-09-10,"['Thriller', 'Crime']",8670,8.3,140.950236,8.172155
351,13,Forrest Gump,1994-07-06,"['Comedy', 'Drama', 'Romance']",8147,8.2,48.307194,8.069421
522,424,Schindler's List,1993-11-29,"['Drama', 'History', 'War']",4436,8.3,41.725123,8.061007
23671,244786,Whiplash,2014-10-10,['Drama'],4376,8.3,64.29999,8.058025
5481,129,Spirited Away,2001-07-20,"['Fantasy', 'Adventure', 'Animation', 'Family']",3968,8.3,41.048867,8.035598
1154,1891,The Empire Strikes Back,1980-05-17,"['Adventure', 'Action', 'Science Fiction']",5998,8.2,19.470959,8.025793


We see that three Crime Drama, **The Shawshank Redemption**, **The Godfather** and **The Dark Knight** occur at the very top of our chart. The chart also indicates a strong bias of TMDB Users towards particular genres and directors.

Let us now construct our function that builds charts for particular genres. For this, we will use relax our default conditions to the **85** percentile instead of 95.

In [None]:
def build_chart(genre, percentile=0.85):
    df = md[md.genres.apply(lambda x: genre in x)]
    return weighted_rating(df, percentile)

Let us see our method in action by displaying the Top 10 Romance Movies (Romance almost didn't feature at all in our Generic Top Chart despite  being one of the most popular movie genres).
### Top Romance Movies

In [None]:
build_chart('Romance').head(10)

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
10309,19404,Dilwale Dulhania Le Jayenge,1995-10-20,"['Comedy', 'Drama', 'Romance']",661,9.1,34.457024,8.701372
40245,372058,Your Name.,2016-08-26,"['Romance', 'Animation', 'Drama']",1030,8.5,34.461252,8.281258
351,13,Forrest Gump,1994-07-06,"['Comedy', 'Drama', 'Romance']",8147,8.2,48.307194,8.173547
1132,11216,Cinema Paradiso,1988-11-17,"['Drama', 'Romance']",834,8.2,14.177005,7.964387
40876,313369,La La Land,2016-11-29,"['Comedy', 'Drama', 'Music', 'Romance']",4745,7.9,19.681686,7.860576
22166,152601,Her,2013-12-18,"['Romance', 'Science Fiction', 'Drama']",4215,7.9,13.829515,7.855724
7208,38,Eternal Sunshine of the Spotless Mind,2004-03-19,"['Science Fiction', 'Drama', 'Romance']",3758,7.9,12.906327,7.850467
876,426,Vertigo,1958-05-09,"['Mystery', 'Romance', 'Thriller']",1162,8.0,18.20822,7.840579
3189,901,City Lights,1931-01-30,"['Comedy', 'Drama', 'Romance']",444,8.2,10.891524,7.7926
15530,31011,Mr. Nobody,2009-09-11,"['Science Fiction', 'Drama', 'Romance', 'Fanta...",1616,7.9,11.817059,7.788307


The top romance movie according to our metrics is Bollywood's **Dilwale Dulhania Le Jayenge**. This Shahrukh Khan starrer also happens to be one of my personal favorites.

## Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves *Dilwale Dulhania Le Jayenge*, *My Name is Khan* and *Kabhi Khushi Kabhi Gham*. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

I will build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me. 

In [None]:
try:  #small_movies_data
    smd = pd.read_csv('../input/movies-data/description.csv')
except FileNotFoundError:
    md = pd.read_csv('../input/the-movies-dataset/movies_metadata.csv',
                     skiprows=[19731, 29504, 35588],  #skip error data
                     dtype={'id': int},
                     usecols=['title', 'id', 'overview', 'tagline'])
    links_small = pd.read_csv('../input/the-movies-dataset/links_small.csv')['tmdbId']
    links_small = links_small.dropna().astype(int)
    smd = md[md['id'].isin(links_small)].copy()
    smd['description'] = smd['overview'].fillna('') + ' ' + smd['tagline'].fillna('')
    smd = smd[['id', 'title', 'description']].drop_duplicates()
    smd.to_csv('../input/movies-data/description.csv', index=False)
    smd = smd.reset_index(drop=True)

smd.shape

(9082, 3)


We have **9082** movies available in our small movies' metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

### Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [None]:
smd['description'] = smd['description'].fillna('')
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0, stop_words='english')

#### TF-IDF
[TF-IDF wiki](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

The weighting scheme of $tf$ is nature raw count

$idf = \ln {\frac{1+n}{1+df(t)}}+1 $

Then normalize to unit vector

In [None]:
tfidf_matrix = tf.fit_transform(smd['description'])
tfidf_matrix.shape

(9082, 267952)

#### Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [None]:
cosine_sim = linear_kernel(tfidf_matrix)
cosine_sim

array([[1.        , 0.00680302, 0.        , ..., 0.        , 0.        ,
        0.00477808],
       [0.00680302, 1.        , 0.01530688, ..., 0.        , 0.00175214,
        0.00367921],
       [0.        , 0.01530688, 1.        , ..., 0.00192587, 0.00221235,
        0.        ],
       ...,
       [0.        , 0.        , 0.00192587, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.00175214, 0.00221235, ..., 0.        , 1.        ,
        0.00146392],
       [0.00477808, 0.00367921, 0.        , ..., 0.        , 0.00146392,
        1.        ]])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.


In [None]:
def recommend(title):
    movie = smd[smd['title'] == title]
    if len(movie) > 1:
        print("There are duplications of same name. Choose index and use get_recommendations(idx)")
        print(movie)
    else:
        indexes = get_recommendations(movie.index[0])
        recommend_movies = smd.iloc[indexes]
        return recommend_movies[1:].set_index('id')


def get_recommendations(idx):
    # return movies index which similarity score bigger than 0.01
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    return [i[0] for i in sim_scores if i[1] > 0.01]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [None]:
recommend('The Godfather').head(10)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
240,The Godfather: Part II,In the continuing saga of the Corleone crime f...
112205,The Family,"The Manzoni family, a notorious mafia clan, is..."
15745,Made,Two aspiring boxers lifelong friends get invol...
16806,Johnny Dangerously,"Set in the 1930s, an honest, goodhearted man i..."
37557,Shanghai Triad,A provincial boy related to a Shanghai crime f...
14615,Fury,When a prisoner barely survives a lynch mob at...
14242,American Movie,AMERICAN MOVIE is the story of filmmaker Mark ...
242,The Godfather: Part III,In the midst of trying to legitimize his busin...
1958,8 Women,Eight women gather to celebrate Christmas in a...
10279,Summer of Sam,"Spike Lee's take on the ""Son of Sam"" murders i..."


In [None]:
recommend('The Dark Knight').head(10)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
49026,The Dark Knight Rises,Following the death of District Attorney Harve...
414,Batman Forever,The Dark Knight of Gotham City confronts a das...
364,Batman Returns,"Having defeated the Joker, Batman now faces th..."
142061,"Batman: The Dark Knight Returns, Part 2",Batman has stopped the reign of terror that Th...
40662,Batman: Under the Red Hood,Batman faces his ultimate challenge as the mys...
268,Batman,The Dark Knight of Gotham City begins his war ...
69735,Batman: Year One,Two men come to Gotham City: Bruce Wayne after...
14919,Batman: Mask of the Phantasm,An old flame of Bruce Wayne's strolls into tow...
820,JFK,New Orleans District Attorney Jim Garrison dis...
123025,"Batman: The Dark Knight Returns, Part 1",Batman has not been seen for ten years. A new ...


We see that for **The Dark Knight**, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked **The Dark Knight** probably likes it more because of Nolan and would hate **Batman Forever** and every other substandard movie in the Batman Franchise.

Therefore, we are going to use much more suggestive metadata than **Overview** and **Tagline**. In the next subsection, we will build a more sophisticated recommender that takes **genre**, **keywords**, **cast** and **crew** into consideration.

### Metadata Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [None]:
try:
    credits = pd.read_csv('../input/movies-data/credits_small.csv')
except FileNotFoundError:
    credits = pd.read_csv('../input/the-movies-dataset/credits.csv')
    links_small = pd.read_csv('../input/the-movies-dataset/links_small.csv')['tmdbId']
    links_small = links_small.dropna().astype(int)
    credits = credits[credits['id'].isin(links_small)]


    def get_director(x):
        for i in literal_eval(x):
            if i['job'] == 'Director':
                return i['name']
        return ''


    credits['crew'] = credits['crew'].apply(get_director)
    credits = credits.rename(columns={'crew': 'director'})
    credits['cast'] = credits['cast'].apply(lambda x: [i['name'] for i in literal_eval(x)[:3]])
    credits = credits.astype(str).drop_duplicates()

    keywords = pd.read_csv('../input/the-movies-dataset/keywords.csv')
    keywords = keywords[keywords['id'].isin(links_small)].drop_duplicates()
    keywords['keywords'] = keywords['keywords'].apply(lambda x: [i['name'] for i in literal_eval(x)])

    credits = keywords.astype(str).merge(credits)
    credits.to_csv('../input/movies-data/credits_small.csv', index=False)

We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

1. **Crew:** From the crew, we will only pick the director as our feature since the others don't contribute that much to the *feel* of the movie.
2. **Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list. 


In [None]:
credits[['cast', 'keywords']] = credits[['cast', 'keywords']].applymap(literal_eval)
credits.head()

Unnamed: 0,id,keywords,cast,director
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter
1,8844,"[board game, disappearance, based on children'...","[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch
3,31357,"[based on novel, interracial relationship, sin...","[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer


My approach to building the recommender is going to be extremely *hacky*. What I plan on doing is creating a metadata dump for every movie which consists of **genres, director, main actors and keywords.** I then use a **Count Vectorizer** to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps I follow in the preparation of my genres and credits data:
1. **Strip Spaces and Convert to Lowercase** from all our features. This way, our engine will not confuse between **Johnny Depp** and **Johnny Galecki.** 
2. **Mention Director 3 times** to give it more weight relative to the entire cast.

In [None]:
strip = lambda x: str(x).replace(" ", "").lower()
cast = credits['cast'].apply(lambda x: [strip(i) for i in x])
director = credits['director'].apply(lambda x: [strip(x)] * 3)

#### Keywords

We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculate the frequenct counts of every keyword that appears in the dataset.

In [None]:
s = pd.DataFrame(np.concatenate(credits['keywords'])).value_counts()
s[:5]

independent film        603
woman director          541
murder                  397
duringcreditsstinger    327
based on novel          309
dtype: int64

Keywords occur in frequencies ranging from 1 to 603. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. Finally, we will convert every word to its stem so that words such as *Dogs* and *Dog* are considered the same.

In [None]:
s = s[s > 1]

from nltk.stem.snowball import SnowballStemmer

stem = SnowballStemmer('english').stem


def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            for a in i.split():
                words.append(stem(a))
    return words

In [None]:
keywords = credits['keywords'].apply(filter_keywords)
md = pd.read_csv('../input/movies-data/metadata_small.csv', dtype=
{'id': int}, usecols=['id', 'genres'])
genres = credits.merge(md)[['id', 'genres']].drop_duplicates()
genres = genres['genres'].reset_index(drop=True).apply(literal_eval)
soup = keywords + cast + director + genres
soup = soup.apply(lambda x: ' '.join(x))
soup.head()

0    jealousi toy boy friendship friend rivalri boy...
1    board game disappear base on children book new...
2    fish best friend duringcreditssting waltermatt...
3    base on novel interraci relationship singl mot...
4    babi midlif crisi confid age daughter mother d...
dtype: object

In [None]:
count = CountVectorizer(ngram_range=(1, 2), min_df=0, stop_words='english')
count_matrix = count.fit_transform(soup)

cosine_sim = cosine_similarity(count_matrix)

We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results. Let us check for **The Dark Knight** again and see what recommendations I get this time around.

In [None]:
recommend('The Dark Knight').head(10)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
272,Batman Begins,"Driven by tragedy, billionaire Bruce Wayne ded..."
49026,The Dark Knight Rises,Following the death of District Attorney Harve...
40662,Batman: Under the Red Hood,Batman faces his ultimate challenge as the mys...
364,Batman Returns,"Having defeated the Joker, Batman now faces th..."
415,Batman & Robin,Along with crime-fighting partner Robin and ne...
1124,The Prestige,A mysterious story of two magicians whose inte...
11660,Following,"A struggling, unemployed young writer takes to..."
268,Batman,The Dark Knight of Gotham City begins his war ...
414,Batman Forever,The Dark Knight of Gotham City confronts a das...
1924,Superman,Mild-mannered Clark Kent works as a reporter a...


I am much more satisfied with the results I get this time around. The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations. I enjoyed watching **The Dark Knight** as well as some of the other ones in the list including **Batman Begins**, **The Prestige** and **The Dark Knight Rises**. 

We can of course experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

Let me also get recommendations for another movie, **Mean Girls** which happens to be my girlfriend's favorite movie.

In [None]:
recommend('Mean Girls').head(10)

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
24940,Head Over Heels,Ordinary single girl Amanda Pierce (Monica Pot...
10330,Freaky Friday,Mother and daughter bicker over everything -- ...
9007,Just Like Heaven,Shortly after David Abbott moves into his new ...
272693,The DUFF,Bianca's universe turns upside down when she l...
12556,Ghosts of Girlfriends Past,When notorious womanizer Connor Mead attends h...
58224,Mr. Popper's Penguins,"Jim Carrey stars as Tom Popper, a successful b..."
33344,The House of Yes,Jackie-O is anxiously awaiting the visit of he...
40205,16 Wishes,"The story about Abby Jensen, a girl who's been..."
16996,17 Again,"On the brink of a midlife crisis, 30-something..."
57214,Project X,Three high school seniors throw a party to mak...


#### Ratings and Popularity

One thing that we notice about our recommendation system is that it recommends movies regardless of *ratings* and *popularity*.

Therefore, we will add a mechanism to reorder and return movies which are popular and have had a good critical response.

First I will take the top 25 movies based on similarity scores. Then we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section. And using vote of the 60% as the value $m$ of these similar movies, then reorder.

In [None]:
def improved_recommendations(title):
    movies = recommend(title)[:25]
    md_s = pd.read_csv('../input/movies-data/metadata_small.csv', dtype=
    {'id': int, 'vote_count': int, 'vote_averages': float})
    md_s = md_s[md_s['id'].isin(movies.index)]
    return weighted_rating(md_s, 0.6)

In [None]:
improved_recommendations('The Dark Knight')

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
15480,27205,Inception,2010-07-14,"['Action', 'Thriller', 'Science Fiction', 'Mys...",14075,8.1,29.108149,7.962012
11354,1124,The Prestige,2006-10-19,"['Drama', 'Mystery', 'Thriller']",4510,8.0,16.94556,7.672174
18252,49026,The Dark Knight Rises,2012-07-16,"['Action', 'Crime', 'Drama', 'Thriller']",9263,7.6,20.58258,7.474523
10122,272,Batman Begins,2005-06-10,"['Action', 'Crime', 'Drama']",7511,7.5,28.505341,7.367953
15017,23483,Kick-Ass,2010-03-22,"['Action', 'Crime']",4747,7.1,17.26045,7.011273
585,268,Batman,1989-06-23,"['Fantasy', 'Action']",2145,7.0,19.10673,6.892344
1328,364,Batman Returns,1992-06-19,"['Action', 'Fantasy']",1706,6.6,15.001681,6.671623
21066,49521,Man of Steel,2013-06-12,"['Action', 'Adventure', 'Fantasy', 'Science Fi...",6462,6.5,18.538834,6.549214
21415,59859,Kick-Ass 2,2013-07-17,"['Action', 'Adventure', 'Crime']",2275,6.3,13.570264,6.484967
31068,209112,Batman v Superman: Dawn of Justice,2016-03-23,"['Action', 'Adventure', 'Fantasy']",7189,5.7,31.435879,5.890764


Let me also get the recommendations for **Mean Girls**, my girlfriend's favorite movie.

In [None]:
improved_recommendations('Mean Girls')

Unnamed: 0,id,title,release_date,genres,vote_count,vote_average,popularity,weighted_rating
27864,308369,Me and Earl and the Dying Girl,2015-06-12,"['Comedy', 'Drama']",962,7.7,12.503333,7.058764
27663,272693,The DUFF,2015-02-20,"['Romance', 'Comedy']",1372,6.8,8.592449,6.576531
15814,37735,Easy A,2010-09-10,['Comedy'],2282,6.7,15.138144,6.568039
18667,57214,Project X,2012-03-01,"['Comedy', 'Crime']",1624,6.5,9.803023,6.386496
11034,10947,High School Musical,2006-01-20,"['Comedy', 'Drama', 'Family', 'Music', 'TV Mov...",1048,6.1,10.187478,6.1
13044,11887,High School Musical 3: Senior Year,2008-10-22,"['Comedy', 'Drama', 'Family', 'Music', 'Romance']",858,6.1,7.343504,6.1
13602,16996,17 Again,2009-03-11,['Comedy'],1388,6.1,11.362762,6.1
6444,10330,Freaky Friday,2003-08-03,['Comedy'],919,6.0,7.867999,6.04118
17282,58224,Mr. Popper's Penguins,2011-06-17,"['Comedy', 'Family']",775,5.7,15.214342,5.881444
13843,12556,Ghosts of Girlfriends Past,2009-05-01,"['Fantasy', 'Comedy', 'Romance']",716,5.6,8.401493,5.836649


However, there is nothing much we can do about this. Therefore, we will conclude our Content Based Recommender section here and come back to it when we build a hybrid engine.

## Collaborative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are *close* to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations which is based on the idea that users similar to me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

In [None]:
from surprise import Reader, Dataset, SVD
from surprise import accuracy
from surprise.model_selection import train_test_split

ratings = pd.read_csv('../input/the-movies-dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


I will use the extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [None]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], Reader())

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.8957


0.8957343356552033

We get a mean **Root Mean Sqaure Error** about 0.89 which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

In [None]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f327579d290>

Let us pick user 1 and check the ratings s/he has given.



In [None]:
algo.predict(1, 302)

Prediction(uid=1, iid=302, r_ui=None, est=2.8052981581491636, details={'was_impossible': False})

For movie with ID 302, we get an estimated prediction of **2.8**. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

## Hybrid Recommender

![](https://www.toonpool.com/user/250/files/hybrid_20095.jpg)

In this section, I will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

* **Input:** User ID and the Title of a Movie
* **Output:** Similar movies sorted on the basis of expected ratings by that particular user.

In [None]:
id_map = pd.read_csv('../input/the-movies-dataset/links_small.csv',
                     usecols=['movieId', 'tmdbId'])
id_map = id_map.dropna().astype(int).set_index('tmdbId')


def hybrid(userid, title):
    movies = recommend(title)
    movies['est'] = [algo.predict(userid, id_map.loc[x]['movieId']).est for x in movies.index]
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [None]:
hybrid(1, 'Avatar')

Unnamed: 0_level_0,title,description,est
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3090,The Treasure of the Sierra Madre,"Fred C. Dobbs and Bob Curtin, both down on the...",3.662898
598,City of God,Cidade de Deus is a shantytown that started du...,3.567629
346,Seven Samurai,A samurai answers a village's request for prot...,3.55637
629,The Usual Suspects,"Held in an L.A. interrogation room, Verbal Kin...",3.544813
10683,Happiness,The lives of many individuals connected by the...,3.52275
424,Schindler's List,The true story of how businessman Oskar Schind...,3.515265
981,The Philadelphia Story,Philadelphia heiress Tracy Lord throws out her...,3.498918
16219,Gladiator 1992,A story of two teenagers trapped in the world ...,3.490696
55,Amores perros,Three different people in Mexico City are cata...,3.487625
122,The Lord of the Rings: The Return of the King,Aragorn is revealed as the heir to the ancient...,3.483339


In [None]:
hybrid(500, 'Avatar')

Unnamed: 0_level_0,title,description,est
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
637,Life Is Beautiful,A touching story of an Italian book seller of ...,4.442631
8392,My Neighbor Totoro,Two sisters move to the country with their fat...,4.111443
10445,Shadowlands,"C.S. Lewis, a world-renowned writer and profes...",4.019706
55,Amores perros,Three different people in Mexico City are cata...,3.978867
423,The Pianist,The Pianist is a film adapted from the biograp...,3.966658
713,The Piano,"After a long voyage from Scotland, pianist Ada...",3.937278
1913,The Sea Inside,The Sea Inside is about Spaniard Ramón Sampedr...,3.93117
1092,The Third Man,"Set in postwar Vienna, Austria, ""The Third Man...",3.913648
983,The Man Who Would Be King,A robust adventure about two British adventure...,3.909377
103,Taxi Driver,A mentally unstable Vietnam War veteran works ...,3.899645


We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.

## Conclusion

In this notebook, I have built 4 different recommendation engines based on different ideas and algorithms. They are as follows:

1. **Simple Recommender:** This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, in general and for a specific genre. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.
2. **Content Based Recommender:** We built two content based engines; one that took movie overview and taglines as input and the other which took metadata such as cast, crew, genre and keywords to come up with predictions. We also deviced a simple filter to give greater preference to movies with more votes and higher ratings.
3. **Collaborative Filtering:** We used the powerful Surprise Library to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.
4. **Hybrid Engine:** We brought together ideas from content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.

Previous -> [The Story of Film](https://www.kaggle.com/rounakbanik/the-story-of-film/)



