**Collaborative and Content based Approach for Movie Recommendation System Source Code File**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!unzip /content/drive/MyDrive/data.zip

Archive:  /content/drive/MyDrive/data.zip
replace credits.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

**1. Import libraries**

In [None]:
!pip install scikit-surprise



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import ast
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import SVD, Reader
from surprise import Dataset
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

**2. Load dataset**

In [None]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')
links_small = pd.read_csv('links_small.csv')
md = pd.read_csv('movies_metadata.csv')
ratings = pd.read_csv('ratings_small.csv')

**3. Understand dataset**

In [None]:
#Credits dataframe
credits.head()
#credits.iloc[0:3]
#credits['cast'].iloc[0:3]
#credits.iloc[:,0:2]

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [None]:
credits.columns

Index(['cast', 'crew', 'id'], dtype='object')

**cast**: Information about casting. Name of actor, gender and it's character name in movie.

**crew**: Information about crew members. Like who directed the movie, editor of the movie.

**id**: It's movie ID given by TMDb.

In [None]:
credits.shape

(45476, 3)

In [None]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


**Keywords dataframe**

In [None]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [None]:
keywords.columns

Index(['id', 'keywords'], dtype='object')

**id**: It's movie ID given by TMDb

**Keywords**: Tags/keywords for the movie. It list of tags/keywords

In [None]:
keywords.shape

(46419, 2)

In [None]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


**Link dataframe**

In [None]:
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
links_small.columns

Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')

**movieId**: It's serial number for movie

**imdbId**: Movie id given on IMDb platform

**tmdbId**: Movie id given on TMDb platform

In [None]:
links_small.shape

(9125, 3)

In [None]:
links_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9125 non-null   int64  
 1   imdbId   9125 non-null   int64  
 2   tmdbId   9112 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 214.0 KB


**Metadata dataframe**

In [None]:
md.iloc[0:3].transpose()

Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [None]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

Features

**adult**: Indicates if the movie is X-Rated or Adult.

**belongs_to_collection**: A stringified dictionary that gives information on
 the movie series the particular film belongs to.

**budget**: The budget of the movie in dollars.

**genres**: A stringified list of dictionaries that list out all the genres associated with the movie.

**homepage**: The Official Homepage of the move.

**id**: The ID of the movie.

**imdb_id**: The IMDB ID of the movie.

**original_language**: The language in which the movie was originally shot in.

**original_title**: The original title of the movie.

**overview**: A brief blurb of the movie.

**popularity**: The Popularity Score assigned by TMDB.

**poster_path**: The URL of the poster image.

**production_companies**: A stringified list of production companies involved with the making of the movie.

**production_countries**: A stringified list of countries where the movie was shot/produced in.

**release_date**: Theatrical Release Date of the movie.

**revenue**: The total revenue of the movie in dollars.

**runtime**: The runtime of the movie in minutes.

**spoken_languages**: A stringified list of spoken languages in the film.

**status**: The status of the movie (Released, To Be Released, Announced, etc.)

**tagline**: The tagline of the movie.

**title**: The Official Title of the movie.

**video**: Indicates if there is a video present of the movie with TMDB.

**vote_average**: The average rating of the movie.

**vote_count**: The number of votes by users, as counted by TMDB.

In [None]:
md.shape

(45466, 24)

In [None]:
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

**Ratings dataframe**

In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [None]:
ratings.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

**4. Pre-processing**
We will perform pre-processing as and when needed throughout

In [None]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i[
    'name'] for i in x] if isinstance(x, list) else [])

In [None]:
# this is V
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

# this is R
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

# this is C
C = vote_averages.mean()
C

5.244896612406511

In [None]:
m = vote_counts.quantile(0.95)
m

434.0

In [None]:
# Pre-processing step for getting year from date by splliting it using '-'

md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(
    lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [None]:
qualified = md[(md['vote_count'] >= m) &
               (md['vote_count'].notnull()) &
               (md['vote_average'].notnull())][['title',
                                                'year',
                                                'vote_count',
                                                'vote_average',
                                                'popularity',
                                                'genres']]

qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

In [None]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [None]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

**Top Movies**

In [None]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,"[Adventure, Fantasy, Action]",7.851924


In [None]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)
gen_md.head(3).transpose()

Unnamed: 0,0,0.1,0.2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ..."
budget,30000000,30000000,30000000
homepage,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story
id,862,862,862
imdb_id,tt0114709,tt0114709,tt0114709
original_language,en,en,en
original_title,Toy Story,Toy Story,Toy Story
overview,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ..."
popularity,21.946943,21.946943,21.946943




In [None]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)

    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) &
                   (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')

    qualified['wr'] = qualified.apply(lambda x:
                        (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C),
                        axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)

    return qualified

**Top 15 Romantic Movies**

In [None]:
build_chart('Animation').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
359,The Lion King,1994,5520,8,21.605761,7.909339
5481,Spirited Away,2001,3968,8,41.048867,7.875933
9698,Howl's Moving Castle,2004,2049,8,16.136048,7.772103
2884,Princess Mononoke,1997,2041,8,17.166725,7.771305
5833,My Neighbor Totoro,1988,1730,8,13.507299,7.735274
40251,Your Name.,2016,1030,8,34.461252,7.58982
5553,Grave of the Fireflies,1988,974,8,0.010902,7.570962
19901,Paperman,2012,734,8,7.198633,7.465676
39386,Piper,2016,487,8,11.243161,7.285132
20779,Wolf Children,2012,483,8,10.249498,7.281198


**5.1 Content based recommendation system**

In [None]:
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [None]:
## Pre-processing step

def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan


In [None]:
md['id'] = md['id'].apply(convert_int)
md[md['id'].isnull()]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[Carousel Productions, Vision View Entertainme...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,,,,,,,,,,NaT
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[Aniplex, GoHands, BROSTA TV, Mardock Scramble...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,,,,,,,,,,NaT
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[Odyssey Media, Pulser Productions, Rogue Stat...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,,,,,,,,,,NaT


In [None]:
md = md.drop([19730, 29503, 35587])

In [None]:
md['id'] = md['id'].astype('int')

In [None]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

We have 9099 movies available in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

**Content based recommendation system : Using movie description and taglines**

Let us first try to build a recommender using movie descriptions and taglines.

We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [None]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0.0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [None]:
tfidf_matrix.shape

(9099, 268124)

we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score.

Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [None]:
# http://scikit-learn.org/stable/modules/metrics.html#linear-kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim[0]
#cosine_sim.shape

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset.

The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [None]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])
#indices.head(2)

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]


* Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [None]:
get_recommendations('The Godfather').head(10)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object

In [None]:
get_recommendations('Superman II').head(10)

2114           Superman IV: The Quest for Peace
7718                          All Star Superman
6447                           Superman Returns
2113                               Superman III
2111                                   Superman
8227    Batman: The Dark Knight Returns, Part 2
8371       Justice League: Crisis on Two Earths
5045                         The Pick-up Artist
6984                                  Meet Dave
3724                                  Rock Star
Name: title, dtype: object

We see that for The Dark Knight, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations.


But unfortunately, that is all this system can do at the moment.


This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie.


Someone who liked The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard movie in the Batman Franchise.


Therefore, we are going to use much more suggestive metadata than Overview and Tagline.

In the next subsection, we will build a more sophisticated recommender that takes genre, keywords, cast and crew into consideration.

**Content based RS : Using movie description, taglines, keywords, cast, director and genres**

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets.

Let us prepare this data as our first step.

In [None]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [None]:
md.shape

(45463, 25)

In [None]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [None]:
smd = md[md['id'].isin(links_small)]
smd.shape

# smd = md[md['id'].isin(links_small['tmdbId'])]
# smd.shape

(9219, 28)

We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:


1. **Crew**: From the crew, we will only pick the director as our feature since the others don't contribute that much to the feel of the movie.


2. **Cast**: Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [None]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
smd['director'] = smd['crew'].apply(get_director)
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

Approach to building the recommender is going to be extremely hacky.


What I plan on doing is creating a metadata dump for every movie which consists of genres, director, main actors and keywords.


I then use a Count Vectorizer to create our count matrix


The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.


These are steps I follow in the preparation of my genres and credits data:


Strip Spaces and Convert to Lowercase from all our features. This way, our engine will not confuse between Johnny Depp and Johnny Galecki.

Mention Director 2 times to give it more weight relative to the entire cast.

In [None]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

**Keywords**

We will do a small amount of pre-processing of our keywords before putting them to any use.

we calculate the frequenct counts of every keyword that appears in the dataset.

In [None]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

Keywords occur in frequencies ranging from 1 to 610.

We do not have any use for keywords that occur only once.
Therefore, these can be safely removed.

Finally, we will convert every word to its stem so that words such as Dogs and Dog are considered the same.

In [None]:
s = s[s > 1]

In [None]:
# Just an example
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [None]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [None]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [None]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [None]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0.0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

We will reuse the get_recommendations function that we had written earlier.

Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results.

Let us check for The Dark Knight again and see what recommendations I get this time around.

In [None]:
get_recommendations('The Prestige').head(10)

6981          The Dark Knight
3381                  Memento
4145                 Insomnia
2085                Following
7648                Inception
8031    The Dark Knight Rises
6218            Batman Begins
8613             Interstellar
4553                   Spider
3724               Bad Timing
Name: title, dtype: object

I am much more satisfied with the results I get this time around.

The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations.

I enjoyed watching The Dark Knight as well as some of the other ones in the list including Batman Begins, The Prestige and The Dark Knight Rises.

**Improvment**

We can of course experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

get_recommendations('Insomnia').head(10)

In [None]:
get_recommendations('The DUFF').head(10)

7377                    I Love You, Beth Cooper
3712                       The Princess Diaries
5207                                 Mean Girls
6698                      It's a Boy Girl Thing
2181                               American Pie
7494    American Pie Presents: The Book of Love
7134         High School Musical 3: Senior Year
8440                        The Spectacular Now
8906                                Paper Towns
5163                       Just One of the Guys
Name: title, dtype: object

In [None]:
get_recommendations('Pulp Fiction').head(10)

1381            Jackie Brown
8905       The Hateful Eight
5200       Kill Bill: Vol. 2
898           Reservoir Dogs
4903       Kill Bill: Vol. 1
7280    Inglourious Basterds
6788             Death Proof
8310        Django Unchained
4595                   Basic
4764                S.W.A.T.
Name: title, dtype: object

Add Popularity and Ratings
One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that Batman and Robin has a lot of similar characters as compared to The Dark Knight but
it was a terrible movie that shouldn't be recommended to anyone.

Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of  m , we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [None]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]

    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) &
                       (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [None]:
improved_recommendations('Batman Begins')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.917588
6981,The Dark Knight,12269,8,2008,7.905871
8613,Interstellar,11187,8,2014,7.897107
6623,The Prestige,4510,8,2006,7.758148
3381,Memento,4168,8,2000,7.740175
8031,The Dark Knight Rises,9263,7,2012,6.921448
1134,Batman Returns,1706,6,1992,5.846862
8725,Teenage Mutant Ninja Turtles,2677,5,2014,5.034164
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.013943
1260,Batman & Robin,1447,4,1997,4.287233


In [None]:
improved_recommendations('Jackie Brown')

Unnamed: 0,title,vote_count,vote_average,year,wr
266,Pulp Fiction,8670,8,1994,7.86866
898,Reservoir Dogs,3821,8,1992,7.718986
8310,Django Unchained,10297,7,2012,6.929017
7280,Inglourious Basterds,6598,7,2009,6.891679
4903,Kill Bill: Vol. 1,5091,7,2003,6.862133
8905,The Hateful Eight,4405,7,2015,6.842588
5200,Kill Bill: Vol. 2,4061,7,2004,6.830542
8465,We're the Millers,3053,6,2013,5.906018
6788,Death Proof,1359,6,2007,5.817225
4764,S.W.A.T.,780,5,2003,5.08755


Unfortunately, Batman and Robin does not disappear from our recommendation list.

This is probably due to the fact that it is rated a 4, which is only slightly below average on TMDB.

It certainly doesn't deserve a 4 when amazing movies like The Dark Knight Rises has only a 7.

However, there is nothing much we can do about this. Therefore, we will conclude our Content Based Recommender section here

**CF based recommendation system**

**Our content based engine suffers from some severe limitations.**

It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who (s)he is.

Therefore, in this section, we will use Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the Surprise library that used extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [None]:
# surprise reader API to read the dataset
!pip install surprise
from surprise import Reader
reader = Reader()



In [None]:
from surprise.model_selection import train_test_split

# Load your data as a DatasetAutoFolds object
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Split the data into a trainset and a testset
trainset, testset = train_test_split(data, test_size=0.2)  # You can adjust the test_size as needed


In [None]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# Define the Reader object
reader = Reader(rating_scale=(1, 5))

# Load your data as a Dataset object
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Initialize the SVD algorithm
svd = SVD()

# Perform cross-validation
results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Print the results
for measure in ['test_rmse', 'test_mae']:
    print(f'{measure}: {results[measure].mean()}')


Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8961  0.9006  0.8933  0.8979  0.8991  0.8974  0.0025  
MAE (testset)     0.6885  0.6944  0.6881  0.6906  0.6933  0.6910  0.0025  
Fit time          1.60    1.70    1.49    1.72    2.47    1.80    0.35    
Test time         0.13    0.13    0.14    0.14    0.22    0.15    0.04    
test_rmse: 0.8973730626481708
test_mae: 0.6909843034021865


In [None]:
ratings[ratings['userId'] == 2]

Unnamed: 0,userId,movieId,rating,timestamp
20,2,10,4.0,835355493
21,2,17,5.0,835355681
22,2,39,5.0,835355604
23,2,47,4.0,835355552
24,2,50,4.0,835355586
...,...,...,...,...
91,2,592,5.0,835355395
92,2,593,3.0,835355511
93,2,616,3.0,835355932
94,2,661,4.0,835356141


In [None]:
svd.predict(1, 302)

Prediction(uid=1, iid=302, r_ui=None, est=2.8137772590462817, details={'was_impossible': False})

For movie with ID 302, we get an estimated prediction of 2.686. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have perceive the movie.

**Hybrid recommendation system**

In this section, will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

Input: User ID and the Title of a Movie

Output: Similar movies sorted on the basis of expected ratings by that particular user.

In [None]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [None]:
id_map = pd.read_csv('links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')
#id_map = id_map.set_index('tmdbId')

In [None]:
indices_map = id_map.set_index('id')

In [None]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'release_date', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [None]:
hybrid(1, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,release_date,id,est
522,Terminator 2: Judgment Day,4274.0,7.7,1991-07-01,280,3.438859
2014,Fantastic Planet,140.0,7.6,1973-05-01,16306,3.201594
922,The Abyss,822.0,7.1,1989-08-09,2756,3.167948
1011,The Terminator,4208.0,7.4,1984-10-26,218,3.137461
974,Aliens,3282.0,7.7,1986-07-18,679,3.084472
8401,Star Trek Into Darkness,4479.0,7.4,2013-05-05,54138,2.978432
8658,X-Men: Days of Future Past,6155.0,7.5,2014-05-15,127585,2.945409
1668,Return from Witch Mountain,38.0,5.6,1978-03-10,14822,2.921771
3060,Sinbad and the Eye of the Tiger,39.0,6.3,1977-07-15,11940,2.912727
344,True Lies,1138.0,6.8,1994-07-14,36955,2.860703


In [None]:
hybrid(1000, 'Replicant')

Unnamed: 0,title,vote_count,vote_average,release_date,id,est
8864,Mad Max: Fury Road,9629.0,7.3,2015-05-13,76341,3.80581
1131,Star Trek II: The Wrath of Khan,688.0,7.3,1982-06-03,154,3.738947
1816,Six-String Samurai,36.0,5.8,1998-09-18,24746,3.720884
9005,Independence Day: Resurgence,2550.0,4.9,2016-06-22,47933,3.653534
6576,The Covenant,295.0,5.2,2006-09-08,9954,3.582982
5931,Darkman II: The Return of Durant,44.0,4.5,1995-07-11,18998,3.532014
3650,Kickboxer,257.0,6.3,1989-04-20,10222,3.529719
7508,Blood: The Last Vampire,94.0,5.1,2009-04-02,1450,3.481354
817,Maximum Risk,104.0,5.3,1996-09-13,10861,3.476536
4017,Hawk the Slayer,13.0,4.5,1980-08-27,25628,3.47514


In [None]:
hybrid(3423, "The Terminator")

Unnamed: 0,title,vote_count,vote_average,release_date,id,est
522,Terminator 2: Judgment Day,4274.0,7.7,1991-07-01,280,3.978033
974,Aliens,3282.0,7.7,1986-07-18,679,3.962344
7502,The Book of Eli,2207.0,6.6,2010-01-14,20504,3.836348
922,The Abyss,822.0,7.1,1989-08-09,2756,3.674278
7488,Avatar,12114.0,7.2,2009-12-10,19995,3.673137
6394,District B13,572.0,6.5,2004-11-09,10045,3.663491
6622,Children of Men,2120.0,7.4,2006-09-22,9693,3.630543
2412,RoboCop,1494.0,7.1,1987-07-17,5548,3.600806
7296,Terminator Salvation,2496.0,5.9,2009-05-20,534,3.569167
5296,Zardoz,106.0,5.8,1974-02-06,4923,3.569036
