# CF (SVD) movie recommendation system

```
1. Import our dependencies
2. Load dataset
3. Understand dataset
4. Data Wrangling (Already done)
5. Pre-processing steps (We will perform the pre-processing steps as and when needed)
6. Build recommendation system
   6.1 Simple recommendation system
        6.1.1 Implement model
        6.1.2 Evaluate Result
    6.2 CF based recommendation system
        6.2.1 Implement model
        6.2.2 Evaluate Result
```   

## 1. Import libraries

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import ast 
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD #, evaluate

import warnings; warnings.simplefilter('ignore')

## 2. Load dataset

We have MovieLens datasets.

**The Full Dataset:** Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.

**The Small Dataset:** Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

We will build our Simple Recommender using movies from the Full Dataset 

In [2]:
credits = pd.read_csv('./input_data/movie_dataset/credits.csv')
keywords = pd.read_csv('./input_data/movie_dataset/keywords.csv')
links_small = pd.read_csv('./input_data/movie_dataset/links_small.csv')
md = pd.read_csv('./input_data/movie_dataset/movies_metadata.csv')
ratings = pd.read_csv('./input_data/movie_dataset/ratings_small.csv')

## 3. Understand dataset

#### Credits dataframe

In [3]:
credits.head()
#credits.iloc[0:3]
#credits['cast'].iloc[0:3]
#credits.iloc[:,0:2]

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [4]:
credits.columns

Index(['cast', 'crew', 'id'], dtype='object')

* **cast:** Information about casting. Name of actor, gender and it's character name in movie
* **crew:** Information about crew members. Like who directed the movie, editor of the movie and so on. 
* **id:** It's movie ID given by TMDb

In [5]:
credits.shape

(45476, 3)

In [6]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
cast    45476 non-null object
crew    45476 non-null object
id      45476 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


#### Keywords dataframe

In [7]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [8]:
keywords.columns

Index(['id', 'keywords'], dtype='object')

* **id:** It's movie ID given by TMDb
* **Keywords:** Tags/keywords for the movie. It list of tags/keywords 

In [9]:
keywords.shape

(46419, 2)

In [10]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
id          46419 non-null int64
keywords    46419 non-null object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


#### Link dataframe

In [11]:
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [12]:
links_small.columns

Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')

* **movieId:** It's serial number for movie
* **imdbId:** Movie id given on IMDb platform
* **tmdbId**: Movie id given on TMDb platform

In [13]:
links_small.shape

(9125, 3)

In [14]:
links_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 3 columns):
movieId    9125 non-null int64
imdbId     9125 non-null int64
tmdbId     9112 non-null float64
dtypes: float64(1), int64(2)
memory usage: 214.0 KB


#### Metadata dataframe

In [15]:
md.iloc[0:3].transpose()

Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [16]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

**Features**

* **adult:** Indicates if the movie is X-Rated or Adult.
* **belongs_to_collection:** A stringified dictionary that gives information on the movie series the particular film belongs to.
* **budget:** The budget of the movie in dollars.
* **genres:** A stringified list of dictionaries that list out all the genres associated with the movie.
* **homepage:** The Official Homepage of the move.
* **id:** The ID of the movie.
* **imdb_id:** The IMDB ID of the movie.
* **original_language:** The language in which the movie was originally shot in.
* **original_title:** The original title of the movie.
* **overview:** A brief blurb of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **poster_path:** The URL of the poster image.
* **production_companies:** A stringified list of production companies involved with the making of the movie.
* **production_countries:** A stringified list of countries where the movie was shot/produced in.
* **release_date:** Theatrical Release Date of the movie.
* **revenue:** The total revenue of the movie in dollars.
* **runtime:** The runtime of the movie in minutes.
* **spoken_languages:** A stringified list of spoken languages in the film.
* **status:** The status of the movie (Released, To Be Released, Announced, etc.)
* **tagline:** The tagline of the movie.
* **title:** The Official Title of the movie.
* **video:** Indicates if there is a video present of the movie with TMDB.
* **vote_average:** The average rating of the movie.
* **vote_count:** The number of votes by users, as counted by TMDB.

In [17]:
md.shape

(45466, 24)

In [18]:
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

#### Ratings dataframe

In [19]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [20]:
ratings.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

* **userId:** It is id for User
* **movieId:** It is TMDb movie id.
* **rating:** Rating given for the particular movie by specific user
* **timestamp:** Time stamp when rating has been given by user

In [21]:
ratings.shape

(100004, 4)

In [22]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
userId       100004 non-null int64
movieId      100004 non-null int64
rating       100004 non-null float64
timestamp    100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


## 4. Data Wrangling

* The Movie database [TMDb](https://www.themoviedb.org/movie/269149-zootopia?language=en)
* Already converted data from json to csv format

## 5. Pre-processing 

* We will perform pre-processing as and when needed throughout the 

## 6. Build recommendation system

### 6.1. Simple recommendation system

**Approach: **

* The Simple Recommender offers __generalized recommendations__ to every user __based on movie popularity and (sometimes) genre__. 

* The __basic idea__ behind this recommender is that __movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.__ 

* This model __does not give personalized recommendations__ based on the user.



**What we are actually doing: **

* The implementation of this model is extremely trivial. 
* All we have to do is __sort our movies based on ratings and popularity__ and display the top movies of our list. 
* As an added step, we can __pass in a genre argument to get the top movies of a particular genre.__

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [34]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i[
    'name'] for i in x] if isinstance(x, list) else [])

* I use the TMDB Ratings to come up with our Top Movies Chart. 
* I will use IMDB's weighted rating formula to construct my chart.
* Mathematically, it is represented as follows:



$\large Weighted\; Rating (WR) = (\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$
```
where,
    v is the number of votes for the movie
    m is the minimum votes required to be listed in the chart
    R is the average rating of the movie
    C is the mean vote across the whole report
```

In [35]:
# this is V
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

# this is R
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

# this is C
C = vote_averages.mean()
C

5.244896612406511

* The next step, we need to determine an appropriate value for `m`, the minimum votes required to be listed in the chart. 

* We will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.



In [36]:
m = vote_counts.quantile(0.95)
m

434.0

In [37]:
# Pre-processing step for getting year from date by splliting it using '-'

md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(
    lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [38]:
qualified = md[(md['vote_count'] >= m) & 
               (md['vote_count'].notnull()) & 
               (md['vote_average'].notnull())][['title', 
                                                'year', 
                                                'vote_count', 
                                                'vote_average', 
                                                'popularity', 
                                                'genres']]

qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

* Therefore, to qualify to be considered for the chart, a movie has to have at least __434 votes__ on TMDB. 
* We also see that the __average rating__ for __a movie on TMDB__ is __5.244 on a scale of 10__. 
* Here, only __2274 movies__ are qualify to be on our chart.

In [39]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [40]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [41]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

**Top Movies**

In [42]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.4235,"[Adventure, Fantasy, Action]",7.851924


* We see that three Christopher Nolan Films, __Inception__, __The Dark Knight__ and __Interstellar__ occur at the very top of our chart. 
* The chart also indicates a strong bias of TMDB Users towards particular genres and directors.

* Let us now construct our __function that builds charts for particular genres.__

* For this, we __relax__ our __default conditions to the 85th percentile instead of 95.__

In [43]:
'''
>>> s
     a   b
one  1.  2.
two  3.  4.

>>> s.stack()
one a    1
    b    2
two a    3
    b    4
'''
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)
gen_md.head(3).transpose()

Unnamed: 0,0,0.1,0.2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ..."
budget,30000000,30000000,30000000
homepage,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story
id,862,862,862
imdb_id,tt0114709,tt0114709,tt0114709
original_language,en,en,en
original_title,Toy Story,Toy Story,Toy Story
overview,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ..."
popularity,21.9469,21.9469,21.9469


In [44]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & 
                   (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: 
                        (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C),
                        axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the __Top 15 Romance Movies__ (Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).


**Top 15 Romantic Movies**

In [45]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,8.565285
351,Forrest Gump,1994,8147,8,48.3072,7.971357
876,Vertigo,1958,1162,8,18.2082,7.811667
40251,Your Name.,2016,1030,8,34.461252,7.789489
883,Some Like It Hot,1959,835,8,11.8451,7.745154
1132,Cinema Paradiso,1988,834,8,14.177,7.744878
19901,Paperman,2012,734,8,7.19863,7.713951
37863,Sing Street,2016,669,8,10.672862,7.689483
882,The Apartment,1960,498,8,11.9943,7.599317
38718,The Handmaiden,2016,453,8,16.727405,7.566166


## 6.2 CF based recommendation system

**Our content based engine suffers from some severe limitations.**

* It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

* Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who (s)he is.

* Therefore, in this section, we will use Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

* I will not be implementing Collaborative Filtering from scratch. Instead, I will use the __[Surprise library](http://surpriselib.com/)__ that used extremely powerful algorithms like __Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations__.

* Implementation of SVD for surprise library is given on this [link](https://github.com/NicolasHug/Surprise/blob/master/surprise/prediction_algorithms/matrix_factorization.pyx)

In [6]:
from surprise.model_selection import cross_validate
from surprise import SVD, Dataset

In [7]:
# surprise reader API to read the dataset
reader = Reader()

In [8]:
#Tutaj jest jakiś przykłd : https://surprise.readthedocs.io/en/stable/getting_started.html#getting-started

In [9]:
# < to było to co nie dzialalo : >
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
# data.split.KFold(n_folds=5)
#evaluate(svd, data, measures=['RMSE', 'MAE'])   

In [11]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8937  0.9045  0.8973  0.8961  0.8970  0.8977  0.0036  
MAE (testset)     0.6893  0.6957  0.6924  0.6895  0.6900  0.6914  0.0024  
Fit time          4.86    5.02    4.89    4.95    5.79    5.10    0.35    
Test time         0.23    0.25    0.18    0.26    0.18    0.22    0.03    


{'test_rmse': array([0.89366035, 0.90450731, 0.89732041, 0.89609743, 0.89702498]),
 'test_mae': array([0.68931768, 0.69566012, 0.69242544, 0.6894626 , 0.69003179]),
 'fit_time': (4.863731384277344,
  5.015045881271362,
  4.8904709815979,
  4.95223069190979,
  5.786862850189209),
 'test_time': (0.23265314102172852,
  0.25200486183166504,
  0.1753230094909668,
  0.2562243938446045,
  0.18491411209106445)}

In [13]:
trainset = data.build_full_trainset()
# svd.train(trainset)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fe5ff140410>

In [14]:
ratings[ratings['userId'] == 1][:5]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [94]:
np.mean(ratings[(ratings['movieId'] == 322) & (ratings['movieId'] > 0)]['rating'].values)
ratings[(ratings['movieId'] == 322) & (ratings['movieId'] > 0)]

Unnamed: 0,userId,movieId,rating,timestamp
1045,15,322,2.5,1120209787
10286,73,322,4.5,1311319243
23217,165,322,4.5,1111609877
23786,168,322,4.0,848879585
35481,254,322,3.0,845157913
44400,312,322,3.0,959933985
49653,363,322,2.0,938890637
51179,380,322,4.0,1055871639
53408,387,322,5.0,974674960
53736,388,322,4.0,946528365


In [96]:
ids = [363, 608, 168, 387]

for iid in ids:
    print(svd.predict(iid, 1))

user: 363        item: 1          r_ui = None   est = 4.62   {'was_impossible': False}
user: 608        item: 1          r_ui = None   est = 4.10   {'was_impossible': False}
user: 168        item: 1          r_ui = None   est = 3.80   {'was_impossible': False}
user: 387        item: 1          r_ui = None   est = 4.14   {'was_impossible': False}


* For movie with ID 31, we get an estimated prediction of 2.219. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have perceive the movie.

### Top n movies for given movie based on SVD matrices

In [21]:
import sklearn

In [22]:
user_matrix = svd.pu
movies_matrix = svd.qi

In [93]:
# def get_top_similar_movies_cf(movieId):
movies_similarities = sklearn.metrics.pairwise.cosine_similarity(movies_matrix)

title = 'Toy Story'
title_idx = indices.to_dict()
idx_title = dict([[v,k] for k,v in title_idx.items()])

idx = title_idx[title]
print(idx)
sim_scores = list(enumerate(movies_similarities[int(idx)]))
similarities = sorted(sim_scores, key=lambda x: x[1], reverse=True)
similarities[:10]

[(idx_title[x], y) for (x, y) in similarities[:10]]

0


[('Toy Story', 0.9999999999999998),
 ('Son of Rambow', 0.3861924955353094),
 ('Angels and Insects', 0.3406517895500779),
 ('Air America', 0.33922604719373073),
 ('Hans Christian Andersen', 0.3373456959071658),
 ("Don't Be a Menace to South Central While Drinking Your Juice in the Hood",
  0.3364723991194337),
 ('The Magnificent Seven', 0.33593056174179525),
 ('Kung Fu Hustle', 0.33174882428323554),
 ("Gregory's Girl", 0.32894314423927856),
 ('Waiting...', 0.3262511892499058)]