**What is Recommender System**

- A system that predicts ratings or preferences a user might give to an item
- Often these are sorted and presented as "top-N" recommendtions
- Also known as recommender engines, recommendation systems, recommendation platforms

**Examples**

- Amazon
- Netflix
- Amazon Prime
- Google Search
- Youtube

**Understanding users is important**

- *Asking for feedback (Explicit)*

> Pros:  
    >- More data

> Cons:  
>- Not a quality data
>- Random data
>- Not Standardised

- *Implicit Data*

> 1. Clicked


>> Pros:  
>> - Huge Data

>> Cons:
>>- Not a quality data
>>- Not exact reflection

> 2. Purchased


>> Pros:  
>> - Less Data

>> Cons:
>>- Very quality data
>>- Exact reflection

> 3. Consumed


>> Pros:  
>> - Good amount of data

>> Cons:
>>- Liitle quality data
>>- Good reflection


**Different Recommendation Systems**

- Recommanding things -- Amazon
- Recommanding Content -- Youtube, Netflix
- Recommanding Music -- Raaga, Saavn
- Recommanding People -- Facebook
- Recommanding Search Results -- Google

**Top N Recommendations**

- Recommendation are ranked
- Only Top N Recommendations used
- Examples : Amazon and Netflix


**Evaluting Recommender Systems**

- Train - Test Model

> Full data set (movie ratings, etc.)

>> Training set  --> Machine learning  --> Predictions

>> Test set --> Measure Accuracy --> Predictions


- K-fold cross validation

> Full data set (movie ratings, etc.)

>> fold 1  --> machine learning  --> measure accuracy  --> take average

>> fold 2  --> machine learning  --> measure accuracy  --> take average

>> fold k-1  --> machine learning  --> measure accuracy  --> take average

>> test set --> measure accuracy  --> take average


**Measuring Accuracy**

- Mean Absolute Error (MAE)

$$MAE=\sum_{i-1}^n \frac{ |y_{i}-x_i|}{n}$$

MAE - mean absolute error  
$y_i$ - prediction  
$x_i$ - true value  
n   - total number of data points  


| Predicted (y) | Actual (x) | Error |
| --- | --- | --- |
| 5| 4 | 1 |
| 3| 5 | -2 |
| 4| 4 | 0 |
| 5| 3 | 2 |
| 2| 3 | -1 |
| 1| 2 | -1 |

Mean Absolute Error - 1.166

- Root Mean Square Error/Deviation (RMSE/RMSD)

$$RMSD = \sqrt{\sum_{i-1}^N \frac{ (x_{i}-\hat x_i)^2}{N}}$$

RMSD - root-mean square deviation  
i    - variable i  
N    - number of non-missing data points  
$x_{i}$ - actual observation time series  
$\hat x_i$ - estimated time series

| estimated (x) | Actual (x) | Error |
| --- | --- | --- |
| 5| 4 | 1 |
| 3| 5 | -2 |
| 4| 4 | 0 |
| 5| 3 | 2 |
| 2| 3 | -1 |
| 1| 2 | -1 |

Root Mean Square Error/Deviation = 1.354006

**Top N Movies**

- Only popular movies
- Popular in particular condition
- Min vote conditions

## Importing Libraries

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
import warnings; warnings.simplefilter('ignore')

## Top N Recommendations

In [None]:
md=pd.read_csv('Dataset/movies_metadata.csv')
md.head()

In [None]:
# preprocessing
md['genres']=md['genres'].fillna('[]')
md.head(100)

In [None]:
# Eval Example
list1='[1,2,3,4,5]'
list1

In [None]:
list1[0]

In [None]:
list_eval=eval(list1)
list_eval

In [None]:
list_eval[0]

In [None]:
literal_eval(list1)[0]

In [None]:
# converting genres as list
md['genres']=md['genres'].apply(literal_eval)

In [None]:
md.head()

In [None]:
md['genres']=md['genres'].apply(lambda x: [i['name'] for i in x] if isinstance(x,list) else [])

In [None]:
md.head()

In [None]:
md[md['vote_count'].notnull()]

In [None]:
vote_count=md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_count

In [None]:
vote_average=md[md['vote_average'].notnull()]['vote_average'].astype('int')
vote_average

In [None]:
# recommending top movies
top_movies=md.copy()

In [None]:
# sorting in order of vote average
top_movies1=top_movies.sort_values('vote_average',ascending=False).head(250)

In [None]:
top_movies1

In [None]:
# min number of votes 1000
top_movies2=top_movies[top_movies['vote_count']>1000]

In [None]:
top_movies2

In [None]:
top_movies2.sort_values('vote_average',ascending=False).head(250)

**Weighted Rating**


$$W=\frac{R - v + C - m}{v + m}$$

where:  

- $W$ = Weighted rating
- $R$ = average for the movies as a number from 1 to 10 (mean) = (Rating)
- $v$ = number of votes for the movie = (votes)
- $m$ = minimum votes required to be listed in the Top 250 (currently 25,000)
- $C$ = the mean vote across the whole report (currently 7 0)

The $W$ in this formula is equivalent to a Bayesian posterior mean (see *Bayesian statistics*)

1. The next step is to determine an appropriate value of m, the minimum votes required to be listed in the chart. We will use **95 percentile** or Median as our cutoff. In other words, for a movie to feature in the charts. It must have more votes than at least 95% of the movies in the list
2. We will calculate C value as average votes across the dataset

In [None]:
C=vote_average.mean()
C

In [None]:
m = vote_count.quantile(0.95)
m

In [None]:
top_movies['year']=pd.to_datetime(top_movies['release_date'],errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [None]:
top_movies3=top_movies[(top_movies['vote_count']>=m) & (top_movies['vote_count'].notnull()) & (top_movies['vote_average'].notnull())][['title','year','vote_count','vote_average','popularity','genres']]
top_movies3['vote_count']=top_movies3['vote_count'].astype('int')
top_movies3['vote_average']=top_movies3['vote_average'].astype('int')
top_movies3.shape

In [None]:
def weighted_rating(x):
    v=x['vote_count']
    R=x['vote_average']
    return (v/(v+m)*R)+(m/(v+m)*C)

In [None]:
top_movies3['weight_rate']=top_movies3.apply(weighted_rating,axis=1)

In [None]:
top_movies3.head()

In [None]:
top_movies3=top_movies3.sort_values('weight_rate', ascending=False).head(10)
top_movies3.head(10)

## Top Movies

In [None]:
# genre = romance
genre_TM = top_movies.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1,drop=True)
genre_TM.name='genre'
genre_top_movies = top_movies.drop('genres',axis=1).join(genre_TM)

In [None]:
genre_top_movies

In [None]:
def build_chart(genre,percentile=0.85):
    df=genre_top_movies[genre_top_movies['genre']==genre]
    vote_counts=df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages=df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C=vote_averages.mean()
    m=vote_counts.quantile(percentile)
    
    qualified=df[(df['vote_count']>=m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title','year','vote_count','vote_average','popularity','genres']]
    qualified['vote_count']=qualified['vote_count'].astype('int')
    qualified['vote_average']=qualified['vote_average'].astype('int')
    
    qualified['wr']=qualified.apply(lambda x: (x['vote_count']/(x['vote_count'])))
    qualified=qualified.sort_values('wr',ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the Top 15 Romance Movies.  
(Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres)

## Top Genres Movies

In [None]:
build_chart('Animation').head(10)

In [None]:
build_chart('Family').head(10)

In [None]:
build_chart('Action').head(10)

## Content Based Recommender

In [None]:
links_small = pd.read_csv("Dataset/links_small.csv")
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [None]:
top_movies = top_movies.drop([19730,29503,35587])

In [None]:
# check EDA Notebook for how and why I got these indices
top_movies['id']=top_movies['id'].astype('int')

In [None]:
top_movies4=top_movies[top_movies['id'].isin(links_small)]
top_movies4.shape

In [None]:
top_movies4.head()

We have **9099** movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies

## Movie Description Based Recommender

Lets us first try to build a recommender using movies descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done quantatively

In [None]:
top_movies4['tagline']=top_movies4['tagline'].fillna('')
top_movies4['description']=top_movies4['overview']+top_movies4['tagline']
top_movies4['description']=top_movies4['description'].fillna('')

In [None]:
tf=TfidfVectorizer(analyzer='word',ngram_range=(1,2),min_df=0,stop_words='end')
tfidf_matrix=tf.fit_transform(top_movies4['description'])

In [None]:
tfidf_matrix

In [None]:
tfidf_matrix.shape

**Cosine Similarity**

We will be using the cosine similarity to calculate a numeric quantity that denotes the simlarity between two movies. Mathematically, it is defined as follows :

$$cosine(x,y)=/fact{}

Since we have used the TF-IDF Vectorize, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities sice it is muxh faster

In [None]:
cosine_sim = linear_kernel(tfidf_matrix,tfidf_matrix)

In [None]:
cosine_sim

In [None]:
cosine_sim[0]

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score

In [None]:
top_movies4 = top_movies4.reset_index()
titles=top_movies4['title']
indices=pd.Series(top_movies4.index, index=top_movies4['title'])

In [None]:
def get_recommendation(title):
    idx=indices
    sim_score=list(enumerate(cosine_sim[idx]))
    sim_score=sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_score=sim_score[1:31]
    movie_indices=[i[0] for i in sim_score]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are

In [None]:
get_recommendation('GoldenEye').head(10)

In [None]:
get_recommendation('The Apartment').head(10)

In [None]:
get_recommendation('The Godfather').head(10)

In [None]:
get_recommendation('The Dark Knight').head(10)

## Collaborative Filtering

Our content based engine suffers from some severe limitaions. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone quering our engine for recommendations based on a movie will recieve the same recommendations for that movie, regardless of who he/she is

Therefore, in the section, we will use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. It is based on the ideas that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not

We will notbe implementing collaborative Filtering from scartch. Instead We will use the **Surprise** Library that used extreamly powerfull algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE and give great recommendations

In [None]:
reader=Reader()

In [None]:
rating = pd.read_
rating.head()