# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [2]:
from pathlib import Path
import pandas as pd
import numpy as np

In [3]:
#PATH = Path("/data2/yinterian/ml-latest-small/")
PATH = Path("/Users/yinterian/teaching/deeplearning/data/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/links.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/movies.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/ratings.csv'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/README.txt'),
 PosixPath('/Users/yinterian/teaching/deeplearning/data/ml-latest-small/tags.csv')]

In [4]:
! head /Users/yinterian/teaching/deeplearning/data/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [5]:
data = pd.read_csv(PATH/"ratings.csv")

In [6]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Encoding data
This is similar to what you did for your hw1 in ML-2. We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [7]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [8]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [9]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [13]:
# to check my new implementation
df_t = pd.read_csv(PATH/"tiny_training2.csv")
df_v = pd.read_csv(PATH/"tiny_val2.csv")
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
df_v_e
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [14]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [19]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[ 0.1274, -0.9213, -0.3831],
        [ 0.0266, -2.3243, -1.3237],
        [ 1.4337, -0.2975,  0.9381],
        [-1.3270, -0.6006, -2.0270],
        [-1.7503, -1.0709,  0.9440],
        [ 1.4559, -0.5809,  1.2154],
        [-2.0870,  0.6142,  0.4349],
        [ 0.4187, -1.1870,  1.5891],
        [ 0.5743,  1.5168,  0.3606],
        [-0.6813, -1.4604,  0.1742]])

In [21]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[ 0.0266, -2.3243, -1.3237],
         [ 0.1274, -0.9213, -0.3831],
         [ 0.0266, -2.3243, -1.3237],
         [-1.7503, -1.0709,  0.9440],
         [ 1.4559, -0.5809,  1.2154],
         [ 0.0266, -2.3243, -1.3237]]])

## Matrix factorization model

In [23]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [24]:
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [25]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [26]:
U = user_emb(users)
V = item_emb(items)

In [27]:
U

tensor([[-2.4946, -0.7714, -0.8066],
        [-2.4946, -0.7714, -0.8066],
        [ 0.0052, -1.6827, -0.5875],
        [ 0.0052, -1.6827, -0.5875],
        [-1.3836, -0.5361, -0.8511],
        [-1.3836, -0.5361, -0.8511],
        [-0.5880, -0.0892,  0.5410],
        [-0.5880, -0.0892,  0.5410],
        [-2.1033,  0.1085,  0.6517],
        [-2.1033,  0.1085,  0.6517],
        [-0.9263, -0.1871, -0.7512],
        [-0.4121, -1.5100, -0.2265],
        [-0.4121, -1.5100, -0.2265]])

In [28]:
# element wise multiplication
U*V 

tensor([[ 0.2949,  0.9180, -0.0335],
        [-1.6675, -0.2541,  0.8226],
        [ 0.0035, -0.5542,  0.5992],
        [ 0.0088,  0.0916,  1.0724],
        [ 0.1636,  0.6379, -0.0354],
        [-0.9249, -0.1766,  0.8681],
        [ 0.0695,  0.1062,  0.0225],
        [-0.7514,  0.0763, -0.8405],
        [ 0.2486, -0.1292,  0.0271],
        [-2.6878, -0.0928, -1.0125],
        [-1.1837,  0.1599,  1.1672],
        [-0.2755, -0.4973,  0.2310],
        [-0.5266,  1.2911,  0.3519]])

In [29]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([ 1.1793, -1.0990,  0.0485,  1.1729,  0.7661, -0.2334,  0.1982,
        -1.5156,  0.1466, -3.7931,  0.1434, -0.5418,  1.1164])

## Training MF model

In [30]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [32]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [33]:
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.item()) # used to be loss.data[0]
    test_loss(model, unsqueeze)

In [37]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([79799])


torch.Size([79799, 1])

In [38]:
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.item())

In [39]:
train_epocs(model, epochs=10, lr=0.1)

13.232446670532227
5.1240081787109375
2.3716681003570557
3.452590227127075
0.9085177779197693
1.8066924810409546
2.7481913566589355
2.2786033153533936
1.1577228307724
0.9235450029373169
test loss 1.949 


In [40]:
train_epocs(model, epochs=15, lr=0.01)

1.7041014432907104
1.0517804622650146
0.7496703863143921
0.6944168210029602
0.7590113878250122
0.8394144773483276
0.8818832039833069
0.8758020401000977
0.8340171575546265
0.7773081064224243
0.7250978350639343
0.6902596950531006
0.6768329739570618
0.6804254651069641
0.6914937496185303
test loss 0.894 


In [41]:
train_epocs(model, epochs=15, lr=0.01)

0.7001344561576843
0.6618912816047668
0.6680132746696472
0.6450618505477905
0.637506902217865
0.6443958878517151
0.6401748657226562
0.6250552535057068
0.6138514876365662
0.6125682592391968
0.6133320927619934
0.6076730489730835
0.596286952495575
0.5853856205940247
0.5785586833953857
test loss 0.823 


## MF with bias

In [42]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [43]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [44]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5)

13.231574058532715
4.370060443878174
3.4933559894561768
2.466240167617798
0.7887011170387268
1.8164491653442383
2.523233413696289
2.1421215534210205
1.2755482196807861
0.9039565324783325
test loss 1.536 


In [45]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

1.2822842597961426
0.8576214909553528
0.6944698095321655
0.6960582137107849
0.7555546164512634
0.800784707069397
0.8071447014808655
0.7803676724433899
0.73752760887146
0.6959859132766724
test loss 0.824 


In [46]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6678686141967773
0.659873366355896
0.653201699256897
0.6477892994880676
0.6435018181800842
0.6401415467262268
0.6375073790550232
0.6354086995124817
0.6336813569068909
0.6322091817855835
test loss 0.810 


In [47]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6308879256248474
0.6289032697677612
0.6273546814918518
0.6259253025054932
0.6244949102401733
0.623039186000824
0.6215739846229553
0.6200994253158569
0.6186307668685913
0.6171799302101135
test loss 0.810 


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [49]:
# Note here there is no matrix multiplication, we could potentially make the embeddings of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = F.relu(torch.cat([U, V], dim=1))
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [50]:
model = CollabFNet(num_users, num_items, emb_size=100) #.cuda()

In [51]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-5, unsqueeze=True) 

14.313246726989746
9.313093185424805
4.695026397705078
1.8279616832733154
1.5242719650268555
3.1435651779174805
4.060601234436035
3.6443469524383545
2.6306838989257812
1.7230418920516968
1.2659281492233276
1.251680612564087
1.4934701919555664
1.7863976955413818
1.988066554069519
2.037182569503784
1.93730890750885
1.72995924949646
1.480161428451538
1.253540277481079
test loss 1.100 


In [93]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-6, unsqueeze=True)

1.5990087985992432
1.0634599924087524
1.3196831941604614
1.271923303604126
1.0663458108901978
0.9884128570556641
1.050430417060852
1.0983532667160034
1.0568546056747437
0.9677290916442871
0.904969334602356
0.9057765603065491
0.940522313117981
0.9427617788314819
0.9031198620796204
0.8513400554656982
0.8315029144287109
0.8408262729644775
0.8549620509147644
0.8462169170379639
test loss 0.858 


In [94]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-6, unsqueeze=True)

0.8195101618766785
0.8015755414962769
0.7948648929595947
0.7984756827354431
0.7993371486663818
0.7992619276046753
0.7974718809127808
0.793518602848053
0.7894296050071716
0.7882153987884521
test loss 0.829 


In [95]:
train_epocs(model, epochs=20, lr=0.001, wd=1e-6, unsqueeze=True)

0.7880702614784241
0.7873403429985046
0.7867722511291504
0.7839091420173645
0.7838999032974243
0.7837383151054382
0.782160222530365
0.7795288562774658
0.7787625193595886
0.7775472402572632
0.7760753631591797
0.7758849859237671
0.7750654816627502
0.7734890580177307
0.7722294330596924
0.7731775045394897
0.7680322527885437
0.767857551574707
0.7685062289237976
0.7683979272842407
test loss 0.817 


## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct one? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)