# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
#PATH = Path("/data2/yinterian/ml-latest-small/")
PATH = Path("../data/ml-latest-small/")
list(PATH.iterdir())

[PosixPath('../data/ml-latest-small/links.csv'),
 PosixPath('../data/ml-latest-small/movies.csv'),
 PosixPath('../data/ml-latest-small/ratings.csv'),
 PosixPath('../data/ml-latest-small/README.txt'),
 PosixPath('../data/ml-latest-small/tags.csv')]

In [3]:
! head ../data/ml-latest-small/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [4]:
data = pd.read_csv(PATH/"ratings.csv")

In [5]:
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Encoding data
This is similar to what you did for your hw1 in ML-2. We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [6]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [7]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [8]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [10]:
# to check my new implementation
df_t = data.copy().loc[:0.8*len(data), :]
df_v = data.copy().loc[0.8*len(data):, :]
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
df_v_e
df_t_e

Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,2.5,1260759144
1,0,1,3.0,1260759179
2,0,2,3.0,1260759182
3,0,3,2.0,1260759185
4,0,4,4.0,1260759205
5,0,5,2.0,1260759151
6,0,6,2.0,1260759187
7,0,7,2.0,1260759148
8,0,8,3.5,1260759125
9,0,9,2.0,1260759131


In [11]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [12]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [13]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[ 0.4607,  0.4401, -0.2978],
        [-0.1626,  0.0059, -1.2982],
        [-1.7284,  0.8867,  0.1278],
        [-0.8376,  0.5045, -0.6134],
        [-0.0422,  2.3787, -0.3094],
        [-0.6806,  0.0942, -2.5577],
        [-0.9507, -2.3724,  1.5018],
        [-0.1955,  2.2122, -0.3637],
        [-0.8683,  0.7169, -0.7770],
        [-0.2438,  0.9544, -0.0693]])

In [14]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[-0.1626,  0.0059, -1.2982],
         [ 0.4607,  0.4401, -0.2978],
         [-0.1626,  0.0059, -1.2982],
         [-0.0422,  2.3787, -0.3094],
         [-0.6806,  0.0942, -2.5577],
         [-0.1626,  0.0059, -1.2982]]])

## Matrix factorization model

In [15]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [16]:
df_t_e

Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,2.5,1260759144
1,0,1,3.0,1260759179
2,0,2,3.0,1260759182
3,0,3,2.0,1260759185
4,0,4,4.0,1260759205
5,0,5,2.0,1260759151
6,0,6,2.0,1260759187
7,0,7,2.0,1260759148
8,0,8,3.5,1260759125
9,0,9,2.0,1260759131


In [28]:
num_users = 547
num_items = len(df_t_e.movieId.unique())
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [30]:
U = user_emb(users)
V = item_emb(items)

In [31]:
U

tensor([[ 1.9982,  0.5643,  0.8779],
        [ 1.9982,  0.5643,  0.8779],
        [ 1.9982,  0.5643,  0.8779],
        ...,
        [ 0.3930, -0.7139,  0.9420],
        [ 0.3930, -0.7139,  0.9420],
        [ 0.3930, -0.7139,  0.9420]])

In [32]:
# element wise multiplication
U*V 

tensor([[-3.5188e-01,  2.5098e-01,  2.2399e-01],
        [-3.0469e+00, -4.1524e-01,  1.4471e+00],
        [ 9.2608e-01, -1.7749e-02,  5.6746e-01],
        ...,
        [ 6.4791e-01, -1.3206e+00, -2.8638e-01],
        [-4.9191e-01, -6.3665e-01, -1.8377e+00],
        [ 1.1919e-01, -3.0660e-01,  1.1518e+00]])

In [33]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([ 1.2308e-01, -2.0151e+00,  1.4758e+00,  ..., -9.5903e-01,
        -2.9663e+00,  9.6438e-01])

## Training MF model

In [34]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [35]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [36]:
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.item()) # used to be loss.data[0]
    test_loss(model, unsqueeze)

In [37]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([79799])


torch.Size([79799, 1])

In [38]:
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.item())

In [39]:
train_epocs(model, epochs=10, lr=0.1)

13.232620239257812
5.12353515625
2.366079092025757
3.451049566268921
0.9083771109580994
1.806140661239624
2.7457008361816406
2.2743043899536133
1.1538151502609253
0.9226882457733154
test loss 1.948 


In [40]:
train_epocs(model, epochs=15, lr=0.01)

1.704465627670288
1.0514826774597168
0.7490484118461609
0.6938858032226562
0.7588717341423035
0.8396615982055664
0.8823750019073486
0.8763357996940613
0.8344537019729614
0.7775422930717468
0.7251110672950745
0.6901363134384155
0.6766709089279175
0.6802999973297119
0.6914540529251099
test loss 0.894 


In [41]:
train_epocs(model, epochs=15, lr=0.01)

0.7001524567604065
0.6620396971702576
0.6681154370307922
0.6450682282447815
0.637492835521698
0.6444087624549866
0.6401302814483643
0.624942421913147
0.6137296557426453
0.6124340891838074
0.6131712794303894
0.607420802116394
0.5959214568138123
0.5849270820617676
0.5780478119850159
test loss 0.822 


## MF with bias

In [42]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [43]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [44]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5)

13.232122421264648
4.367374420166016
3.5029284954071045
2.4658331871032715
0.7884842157363892
1.8176424503326416
2.5257182121276855
2.1455800533294678
1.2787482738494873
0.9046235680580139
test loss 1.537 


In [45]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

1.2809542417526245
0.8578965067863464
0.6950302720069885
0.695842981338501
0.7546069622039795
0.799615204334259
0.8063098192214966
0.7802164554595947
0.7380183339118958
0.6968126893043518
test loss 0.826 


In [46]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6685947775840759
0.6605850458145142
0.6538867354393005
0.6484557390213013
0.6441375017166138
0.6407522559165955
0.6380965709686279
0.6359747648239136
0.6342198848724365
0.6327106356620789
test loss 0.811 


In [47]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6313657164573669
0.629359245300293
0.6277955770492554
0.6263596415519714
0.6249339580535889
0.6234814524650574
0.622002899646759
0.6205369830131531
0.6190817952156067
0.6176358461380005
test loss 0.811 


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [48]:
# Note here there is no matrix multiplication, we could potentially make the embeddings of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = F.relu(torch.cat([U, V], dim=1))
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [49]:
model = CollabFNet(num_users, num_items, emb_size=100) #.cuda()

In [50]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-5, unsqueeze=True) 

16.525218963623047
12.997052192687988
10.415555953979492
7.905101299285889
5.584404468536377
3.608074188232422
2.1496787071228027
1.3737196922302246
1.3723626136779785
1.988907814025879
2.7540407180786133
3.1817948818206787
3.154493570327759
2.7792327404022217
2.252976894378662
1.7543985843658447
1.3854148387908936
1.181867241859436
1.1361830234527588
1.1938914060592651
test loss 1.298 


In [51]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-6, unsqueeze=True)

1.3106966018676758
1.1218172311782837
1.2025728225708008
1.0749306678771973
1.0072535276412964
1.030681848526001
1.0328874588012695
0.9771342277526855
0.9197388887405396
0.9066900014877319
0.9182590842247009
0.9045007824897766
0.8664149641990662
0.83624267578125
0.8346632122993469
0.8377017974853516
0.827072262763977
0.8043923377990723
0.7875887155532837
0.7855656147003174
test loss 0.834 


In [52]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-6, unsqueeze=True)

0.7866000533103943
0.7743768692016602
0.7719104290008545
0.7727627158164978
0.7748465538024902
0.7732114195823669
0.769763708114624
0.7679502964019775
0.7653291821479797
0.7663291692733765
test loss 0.816 


In [53]:
train_epocs(model, epochs=20, lr=0.001, wd=1e-6, unsqueeze=True)

0.7628862261772156
0.762824535369873
0.7636146545410156
0.7605041265487671
0.7590332627296448
0.7604009509086609
0.7590035200119019
0.7582464814186096
0.7596232295036316
0.7557948231697083
0.7566511631011963
0.7551475167274475
0.7540274858474731
0.7542350888252258
0.7531538605690002
0.7521809935569763
0.7521690726280212
0.7507472634315491
0.7515261173248291
0.7497649788856506
test loss 0.807 


## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct one? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)

In [75]:
import torch
from torch.utils import data

class Dataset(data.Dataset):
    '''Characterizes a dataset for PyTorch'''
    def __init__(self, df):
        '''Initialization'''
        self.df = df

    def __len__(self):
        'Denotes the total number of samples'
        return len(self.df)

    def __getitem__(self, index):
        'Generates one sample of data'
        # Select sample
        X = np.array(self.df.loc[index, ['userId', 'movieId', 'timestamp']])
        y = self.df.loc[index, 'rating']

        return X, y

In [76]:
d = Dataset(train)

In [77]:
d[0]

(array([1.00000000e+00, 3.10000000e+01, 1.26075914e+09]), 2.5)

In [79]:
dl = data.DataLoader(d, batch_size=16)

In [80]:
dl

<torch.utils.data.dataloader.DataLoader at 0x10ebb5b38>