# Collaborative filtering with embedding

The goal of this notebook is to review [lesson 5 of fast.ai courses](http://forums.fast.ai/t/wiki-lesson-5/9403) ([Note](https://medium.com/@hiromi_suenaga/deep-learning-2-part-1-lesson-5-dd904506bee8))

I will try to duplicate what he has on his lesson's notebook without looking at it (trying my best not to). Also I will use a larger dataset to train + add some of my experiments.

Collab filtering model + embedding layers will be built entirely with pytorch and will be trained (or fitted) using fast.ai framework so it's not really 'from scratch' like previous notebook I have in this repository. However, the core of this model is [pytorch neural network](https://github.com/anhquan0412/basic_model_scratch/blob/master/NN_pytorch.ipynb) and [gradient descent optimization](https://github.com/anhquan0412/basic_model_scratch/blob/master/linear_regression.ipynb) which I already covered.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.learner import * # fastai fit/predict function
from fastai.column_data import * # fastai columnar (structred) data loader
PATH = Path('data/large_ds/ml-latest')

Dataset contains 26+ million ratings and 753000+ tag applications across 45000+ movies
```
wget http://files.grouplens.org/datasets/movielens/ml-latest.zip
```

In [None]:
!ls data/large_ds/ml-latest

In [4]:
ratings = pd.read_csv(PATH/'ratings.csv')
movies= pd.read_csv(PATH/'movies.csv')

In [5]:
ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp
26024284,270896,58559,5.0,1257031564
26024285,270896,60069,5.0,1257032032
26024286,270896,63082,4.5,1257031764
26024287,270896,64957,4.5,1257033990
26024288,270896,71878,2.0,1257031858


## Simple EDA 

In [6]:

#top 10 users with most reviews and their average ratings
top=ratings.groupby(['userId']).rating.agg(['mean','count']).reset_index()
top_users = top.sort_values('count',ascending=False).head(10).reset_index().drop('index',axis=1)
top_users

Unnamed: 0,userId,mean,count
0,45811,3.198758,18276
1,8659,3.278424,9279
2,270123,2.597473,7638
3,179792,3.208317,7515
4,228291,3.220175,7410
5,243443,1.576028,6320
6,98415,2.804972,6094
7,229879,3.498257,6024
8,98787,2.43808,5814
9,172224,3.747851,5701


In [7]:
#top 10 most-reviewed movies w and their average ratings
top=ratings.groupby(['movieId']).rating.agg(['mean','count']).reset_index()
top_movies=top.sort_values('count',ascending=False).head(10).reset_index().drop('index',axis=1)

In [8]:
top_movies.merge(movies,on='movieId',how='left')

Unnamed: 0,movieId,mean,count,title,genres
0,356,4.052926,91921,Forrest Gump (1994),Comedy|Drama|Romance|War
1,318,4.429015,91082,"Shawshank Redemption, The (1994)",Crime|Drama
2,296,4.169975,87901,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
3,593,4.152246,84078,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
4,2571,4.154098,77960,"Matrix, The (1999)",Action|Sci-Fi|Thriller
5,260,4.132299,77045,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
6,480,3.660238,74355,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
7,527,4.266531,67662,Schindler's List (1993),Drama|War
8,110,4.016057,66512,Braveheart (1995),Action|Drama|War
9,1,3.888157,66008,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [10]:
crosstab = pd.merge(ratings,top_users,how='inner',on='userId')

In [11]:
crosstab = pd.merge(crosstab,top_movies,how='inner',on='movieId')

In [12]:
pd.crosstab(crosstab.userId, crosstab.movieId, crosstab.rating, aggfunc=np.sum)

movieId,1,110,260,296,318,356,480,527,593,2571
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8659,4.0,5.0,4.0,4.5,4.0,4.0,3.0,4.0,4.0,4.0
45811,4.0,3.5,4.0,4.5,4.5,3.5,4.0,4.5,4.0,4.0
98415,3.0,4.0,3.0,4.0,4.0,4.0,4.0,4.0,4.5,4.0
98787,4.0,4.0,3.0,5.0,5.0,2.0,2.5,3.5,4.0,4.0
172224,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
179792,5.0,4.0,5.0,5.0,4.5,4.5,4.0,5.0,5.0,5.0
228291,3.0,4.0,4.0,4.0,4.5,3.0,4.0,4.0,4.0,4.5
229879,5.0,1.0,5.0,5.0,5.0,4.0,4.0,5.0,5.0,5.0
243443,5.0,3.5,5.0,5.0,5.0,2.0,3.5,5.0,2.0,5.0
270123,3.5,4.0,0.5,5.0,5.0,3.5,3.0,4.5,5.0,1.5


# Embedding

The idea is to use a vector of size n_factors to describe each movie and each user. I.e movies can be categorized (by genre: action, romance ...), so are users. Thus we can represent all users with a matrix with shape (n_users,n_factors) and all movies with matrix with shape (n_movies,n_factors). From here we can build a (n_users,n_movies) matrix by dot-producting these 2 matrices and optimize loss with gradient descent

In [13]:
#build a movie rating dataset
from torch.utils.data import Dataset

In [22]:
def get_val_idxs(n,val_perc):
    np.random.seed(42)
    val_size = int(n*val_perc)
    return np.random.permutation(n)[:val_size]

In [23]:
val_idxs = get_val_idxs(len(ratings),.2)

### Create fastai model data obj

In [24]:
cf = CollabFilterDataset.from_csv(PATH, 'ratings.csv', 'userId', 'movieId', 'rating')
# learn = cf.get_learner(n_factors, val_idxs, 64, opt_fn=optim.Adam)
# learn.fit(1e-2, 2, wds=wd, cycle_len=1, cycle_mult=2)

In [27]:
#TODO: learn what get_data, get_model do. 
# OR at least write PSEUDO-CODE on what they do and try to explain each line

array([     0,      0,      0, ..., 270895, 270895, 270895])