# Recommendation Engine

#### Reference:
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/

#### Similarity Formulas

Content-Based Filtering

Here we see that similarity is measured using the cosine angle approach

$$sim(A,B)=cos(\theta)=\frac{A\cdot B}{\lVert A \rVert\lVert B \rVert}$$

Distance-Based Measurement

$$\text{Euclidean Distance}=\sqrt{(x_{1}-y_{1})^2+...+(x_{N}-y_{N})^2}$$

Person's Correlation Coefficient

$$sim(u,v)=\frac{\sum(r_ui-\bar{r_{u}})(r_{vi}-\bar{r_{v}})}{\sqrt{\sum(r_{ui}-\bar{r_{u}})^2}\sqrt{\sum(r_{vi}-\bar{r_{v}})^2}}$$


### Collaborative Filtering

$$P_{u,i}=\frac{\sum_{i=0}^N(r_{v,i}\cdot s_{u,v})}{\sum_{i=0}^N s_{u,v}}$$

where:

$P_{u,i}$ is the prediction of an item <br>
$R_{v,i}$ is the rating given by a user $v$ for movie $i$<br>
$S_{u,v}$ is the similarity between users<br>

In [1]:
import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users_df = pd.read_csv('./data/ml-100k/u.user', sep='|', names=u_cols,encoding='latin-1')

In [13]:
print(users_df.shape)
users_df.head()

(943, 5)


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [6]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_df = pd.read_csv('./data/ml-100k/u.data', sep='\t', names=r_cols,encoding='latin-1')

In [14]:
print(ratings_df.shape)
ratings_df.head()

(100000, 4)


Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [8]:
#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 
          'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 
          'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 
          'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items_df = pd.read_csv('./data/ml-100k/u.item', sep='|', names=i_cols, encoding='latin-1')

In [12]:
print(items_df.shape)
items_df.head()

(1682, 24)


Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [15]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_train_df = pd.read_csv('./data/ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test_df = pd.read_csv('./data/ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')
ratings_train_df.shape, ratings_test_df.shape

((90570, 4), (9430, 4))

In [16]:
n_users = ratings_df.user_id.unique().shape[0]
n_items = ratings_df.movie_id.unique().shape[0]

In [19]:
# initialize data_matrix
data_matrix = np.zeros((n_users, n_items))

for line in ratings_df.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[3]

In [25]:
from sklearn.metrics.pairwise import pairwise_distances 

user_similarity = pairwise_distances(data_matrix, metric='cosine')
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

In [29]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #We use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [30]:
user_prediction = predict(data_matrix, user_similarity, type='user')
item_prediction = predict(data_matrix, item_similarity, type='item')

In [43]:
test = pd.DataFrame(item_prediction)
test.columns = items_df['movie title']

### Matrix Factorization

Matrix Factorization allows us to create a rating system using **latent feature** extraction using a eigenvalue decomposition approach