# <span style="color:blue">Recommendation Systems

### Why we want recommendor systems

<td>
    <p style="text-align:center;"> <b> Long Tail Phenemon </b></p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/8/8a/Long_tail.svg" style="width: 600px;" align="right"/>
</td>

* Netflix 1M $ for improving their recommendation system by 10%
* 10-15 % increase in revenue

### How can we make recommendation
* Recommend top popular items
* Binary classifier to make recommendation
* Recommendation algorithms
    * Content-Based: Recommend Items to person P similar to items previously rated by P.
    * Collaborative Filtering
    * Latent factorization
* Hybrid Algorithms

### Similarity Matrix
* <a href="https://en.wikipedia.org/wiki/Cosine_similarity">Cosine Similarity</a>
* <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson Similarity</a>
* <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard Similarity</a>


In [108]:
# https://docs.scipy.org/doc/scipy/reference/spatial.distance.html
from scipy.spatial import distance
a = np.asarray([1, 0, 2, 3 ])
b = np.asarray([1, 1, 0, 2])
print(1- distance.cosine(a,b))
print(1- distance.correlation(a,b))
print(1- distance.jaccard(a,b))

0.763762615826
0.316227766017
0.25


### 1. Collaborative filtering
Similar users and ratings are determined by similar preferences exhibited by users i.e. movies watched, movies rated etc

<td>
<p style="text-align:center;"> Collaborative filtering</p>
<img src="https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif" />
</td>

### <span style="color:blue"> Movies Dataset

In [42]:
import pandas as pd
import numpy as np

In [23]:
movies = pd.read_csv('./ml-latest-small/movies.csv')
ratings = pd.read_csv('./ml-latest-small/ratings.csv')

In [24]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [32]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,1,1029,3.0,1260759179,Dumbo (1941),Animation|Children|Drama|Musical
2,1,1061,3.0,1260759182,Sleepers (1996),Thriller
3,1,1129,2.0,1260759185,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
4,1,1172,4.0,1260759205,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama


## Lets combine ratings and movies

In [26]:
ratings = pd.merge(ratings, movies, how = 'left', left_on='movieId', right_on='movieId')

In [31]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,1,1029,3.0,1260759179,Dumbo (1941),Animation|Children|Drama|Musical
2,1,1061,3.0,1260759182,Sleepers (1996),Thriller
3,1,1129,2.0,1260759185,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
4,1,1172,4.0,1260759205,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama


## Convert data into user movie rating matrix

In [33]:
data = ratings.pivot_table(columns=['movieId'], index=['userId'], values='rating')

In [34]:
data.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,


### How to deal with no rating i.e. NaN values in matrix
1. Fill with 0
2. Find set of N users similar to user X who have rated movie i <br>
    a) Average rating for N most similar user
    $$rating_{xi} = \frac{1}{N}{\sum_{k=1}^N r_{k}}$$
    b) Weighted Average for N most similar user
    $$rating_{xi} = \frac{\sum_{i, j=1}^N S_{ij}.r_{j}}{\sum_{i,j=1}^N S_{ij}}$$

In [37]:
data =data.fillna(0)

In [67]:
data.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculate user user similarity

In [39]:
from scipy.spatial.distance import cosine

In [62]:
xx = distance.pdist(data, metric='cosine')

In [72]:
DF_cosine = pd.DataFrame(distance.squareform(xx), columns=data.index, index=data.index)

In [73]:
xx.shape #671*670/2

(224785,)

In [75]:
DF_cosine.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,1.0,1.0,0.925518,0.983182,1.0,0.916116,1.0,0.987157,1.0,...,1.0,1.0,0.985526,0.956281,1.0,1.0,1.0,0.937083,1.0,0.982534
2,1.0,0.0,0.875705,0.881179,0.896354,1.0,0.787015,0.88681,0.886667,0.956787,...,0.522694,0.936798,0.922255,0.835838,0.533719,0.574538,0.915354,0.97586,0.829405,0.886825
3,1.0,0.875705,0.0,0.91836,0.848469,0.939309,0.845286,0.750219,0.865525,0.885328,...,0.838795,0.935802,0.823866,0.841643,0.822902,0.875438,0.875089,0.919016,0.863394,0.829807
4,0.925518,0.881179,0.91836,0.0,0.869351,0.920352,0.680255,0.808987,0.969583,0.862814,...,0.885681,0.952772,0.863421,0.74597,0.878095,0.911265,0.931517,0.895691,0.945488,0.788391
5,0.983182,0.896354,0.848469,0.869351,0.0,0.936204,0.904112,0.834288,0.913384,0.96763,...,0.808971,0.978858,0.853827,0.775755,0.860279,0.941748,0.957074,0.961642,0.937358,0.774914


In [81]:
def similarity_func(u, v):
    return 1 - distance.cosine(u,v)
xx = distance.pdist(data, similarity_func)
DF_cosine = pd.DataFrame(distance.squareform(xx), columns=data.index, index=data.index)


In [80]:
DF_cosine.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,1.0,1.0,0.925518,0.983182,1.0,0.916116,1.0,0.987157,1.0,...,1.0,1.0,0.985526,0.956281,1.0,1.0,1.0,0.937083,1.0,0.982534
2,1.0,0.0,0.875705,0.881179,0.896354,1.0,0.787015,0.88681,0.886667,0.956787,...,0.522694,0.936798,0.922255,0.835838,0.533719,0.574538,0.915354,0.97586,0.829405,0.886825
3,1.0,0.875705,0.0,0.91836,0.848469,0.939309,0.845286,0.750219,0.865525,0.885328,...,0.838795,0.935802,0.823866,0.841643,0.822902,0.875438,0.875089,0.919016,0.863394,0.829807
4,0.925518,0.881179,0.91836,0.0,0.869351,0.920352,0.680255,0.808987,0.969583,0.862814,...,0.885681,0.952772,0.863421,0.74597,0.878095,0.911265,0.931517,0.895691,0.945488,0.788391
5,0.983182,0.896354,0.848469,0.869351,0.0,0.936204,0.904112,0.834288,0.913384,0.96763,...,0.808971,0.978858,0.853827,0.775755,0.860279,0.941748,0.957074,0.961642,0.937358,0.774914


### Find N top similar users

In [86]:
def similar_user(userId, n):
    all_users = DF_cosine.loc[userId, :].sort_values(ascending = False)
    top_users = all_users.iloc[:n,]
    return top_users

In [90]:
similar_user(userId =310, n =10)

userId
325    0.187890
1      0.157524
290    0.141114
604    0.139093
276    0.133189
207    0.130650
330    0.128619
35     0.126856
167    0.110321
259    0.109939
Name: 310, dtype: float64

### Top rated movies of user

In [109]:
def top_rated_movies(userId, no_rating):
    user_rating = ratings.loc[ratings['userId']==userId].sort_values(by='rating', ascending=False)
    return user_rating.iloc[:no_rating,]
top_rated_movies(10, 10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
744,10,50,5.0,942766420,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
763,10,1358,5.0,942766420,Sling Blade (1996),Drama
786,10,2926,5.0,942767121,Hairspray (1988),Comedy|Drama
779,10,2826,5.0,942766109,"13th Warrior, The (1999)",Action|Adventure|Fantasy
778,10,2571,5.0,942766515,"Matrix, The (1999)",Action|Sci-Fi|Thriller
774,10,2344,5.0,942766991,Runaway Train (1985),Action|Adventure|Drama|Thriller
772,10,1923,5.0,942766515,There's Something About Mary (1998),Comedy|Romance
770,10,1719,5.0,942766472,"Sweet Hereafter, The (1997)",Drama
767,10,1611,5.0,942767029,My Own Private Idaho (1991),Drama|Romance
789,10,3019,4.0,942767571,Drugstore Cowboy (1989),Crime|Drama


### Exercise: Do similar for Item-Item collabrative filtering
1. In Practice, item-item outperforms user-user in many cases
2. Items are simpler than users <br>
    a) Items belongs to small set of generes, users have varied taste<br>
    b) Item similarity is more meaningful than user similarity

# 2. Content based recommender systems
#### Idea: Recommend items to customer x similar to previous items rated highly by X

# 3. Latent factorization

In [96]:
### # Singular-value decomposition
from numpy import array
from scipy.linalg import svd
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# SVD
U, s, VT = svd(A)
print(U)
print(s)
print(VT)

[[1 2]
 [3 4]
 [5 6]]
[[-0.2298477   0.88346102  0.40824829]
 [-0.52474482  0.24078249 -0.81649658]
 [-0.81964194 -0.40189603  0.40824829]]
[ 9.52551809  0.51430058]
[[-0.61962948 -0.78489445]
 [-0.78489445  0.61962948]]


### Problem:
https://datahack.analyticsvidhya.com/contest/practice-problem-recommendation-engine/#problem_statement

### Reference links
1. SVD : http://www.d.umn.edu/~mhampton/m4326svd_example.pdf
2. Long Tail phenomenon: https://en.wikipedia.org/wiki/Long_tail
3. https://www.wired.com/2004/10/tail/
4. Movielens Dataset: http://files.grouplens.org/datasets/movielens/ml-100k.zip
5. A Survey of Accuracy Evaluation Metrics of Recommendation Tasks http://www.bgu.ac.il/~shanigu/Publications/JMLRSurvey.pdf
6. ALS Implicit Collaborative Filtering https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe