# Collaborative Filtering with Neural Networks

In this notebook we will write a matrix factorization model in pytorch to solve a recommendation problem. Then we will write a more general neural model for the same problem.

The MovieLens dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. https://grouplens.org/datasets/movielens/. To get the data:

`wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip`

## MovieLens dataset

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
#PATH = Path("/data2/yinterian/ml-latest-small/")
PATH = Path("./data/")
list(PATH.iterdir())

[PosixPath('data/movie_titles.csv'),
 PosixPath('data/combined_data_3.txt'),
 PosixPath('data/combined_data_4.txt'),
 PosixPath('data/combined_data_1.txt'),
 PosixPath('data/probe.txt'),
 PosixPath('data/combined_data_2.txt'),
 PosixPath('data/README'),
 PosixPath('data/qualifying.txt')]

In [3]:
! head -n 5 ./data/combined_data_1.txt

1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26


In [4]:
df1 = pd.read_csv(PATH/"combined_data_1.txt", names = ['userId','rating','date'], index_col = None)
df2 = pd.read_csv(PATH/"combined_data_2.txt", names = ['userId','rating','date'], index_col = None)
df3 = pd.read_csv(PATH/"combined_data_3.txt", names = ['userId','rating','date'], index_col = None)
df4 = pd.read_csv(PATH/"combined_data_4.txt", names = ['userId','rating','date'], index_col = None)

In [5]:
def append_frames(*args):
    df = args[0]
    for frame in args[1:]: 
        df.append(frame)
        # frame = None  TODO: needed or not?
    return df

In [6]:
df = df1
df1 = None #deallocation
df = df.append(df2)
df2 = None #deallocation
df = df.append(df3)
df3 = None #deallocation
df = df.append(df4)
df4 = None #deallocation

In [9]:
df.shape

(100498277, 3)

In [11]:
movie_list = []
for row in df.iterrows():
    print(row.rating)
    #if row['rating'].isnull():
    #    movie_id = row['userid']    
    #movie_list.append(movie_id)

KeyboardInterrupt: 

In [None]:
movie_list

### Encoding data
This is similar to what you did for your hw1 in ML-2. We enconde the data to have contiguous ids for users and movies. You can think about this as a categorical encoding of our two categorical variables userId and movieId.

In [22]:
# split train and validation before encoding
np.random.seed(3)
msk = np.random.rand(len(data)) < 0.8
train = data[msk].copy()
val = data[~msk].copy()

In [23]:
# here is a handy function modified from fast.ai
def proc_col(col, train_col=None):
    """Encodes a pandas column with continous ids. 
    """
    if train_col is not None:
        uniq = train_col.unique()
    else:
        uniq = col.unique()
    name2idx = {o:i for i,o in enumerate(uniq)}
    return name2idx, np.array([name2idx.get(x, -1) for x in col]), len(uniq)

In [24]:
def encode_data(df, train=None):
    """ Encodes rating data with continous user and movie ids. 
    If train is provided, encodes df with the same encoding as train.
    """
    df = df.copy()
    for col_name in ["userId", "movieId"]:
        train_col = None
        if train is not None:
            train_col = train[col_name]
        _,col,_ = proc_col(df[col_name], train_col)
        df[col_name] = col
        df = df[df[col_name] >= 0]
    return df

In [25]:
# to check my new implementation
df_t = pd.read_csv(PATH/"tiny_training2.csv")
df_v = pd.read_csv(PATH/"tiny_val2.csv")
df_t_e = encode_data(df_t)
df_v_e = encode_data(df_v, df_t)
df_v_e
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [26]:
df_train = encode_data(train)
df_val = encode_data(val, train)

## Embedding layer

In [27]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [28]:
# an Embedding module containing 10 users or items embedding size 3
# embedding will be initialized at random
embed = nn.Embedding(10, 3)
embed.weight

Parameter containing:
tensor([[ 0.5668,  1.1124, -0.1699],
        [ 1.1765,  0.5040,  0.9035],
        [ 1.4946,  1.5960,  0.8000],
        [ 0.0909,  0.0676,  0.7423],
        [-1.2926, -0.8995, -1.8530],
        [ 2.1562, -0.4480, -0.5879],
        [-0.3005, -1.3247, -0.6917],
        [ 1.0012,  1.2667, -0.0332],
        [ 1.0857,  0.9327, -1.1834],
        [ 1.5228,  1.1021,  1.2582]])

In [29]:
# given a list of ids we can "look up" the embedding corresponing to each id
# can you see that some vectors are the same?
a = torch.LongTensor([[1,0,1,4,5,1]])
embed(a)

tensor([[[ 1.1765,  0.5040,  0.9035],
         [ 0.5668,  1.1124, -0.1699],
         [ 1.1765,  0.5040,  0.9035],
         [-1.2926, -0.8995, -1.8530],
         [ 2.1562, -0.4480, -0.5879],
         [ 1.1765,  0.5040,  0.9035]]])

## Matrix factorization model

In [30]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        # initlializing weights
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        
    def forward(self, u, v):
        u = self.user_emb(u)
        v = self.item_emb(v)
        return (u*v).sum(1)   

## Debugging MF model

In [31]:
df_t_e

Unnamed: 0,userId,movieId,rating
0,0,0,4
1,0,1,5
2,1,1,5
3,1,2,3
4,2,0,4
5,2,1,4
6,3,0,5
7,3,3,2
8,4,0,1
9,4,3,4


In [32]:
num_users = 7
num_items = 4
emb_size = 3

user_emb = nn.Embedding(num_users, emb_size)
item_emb = nn.Embedding(num_items, emb_size)
users = torch.LongTensor(df_t_e.userId.values)
items = torch.LongTensor(df_t_e.movieId.values)

In [33]:
U = user_emb(users)
V = item_emb(items)

In [34]:
U

tensor([[-2.1413,  1.7033,  1.6539],
        [-2.1413,  1.7033,  1.6539],
        [-0.1758, -0.7780,  0.2131],
        [-0.1758, -0.7780,  0.2131],
        [ 0.8613,  0.0918,  0.9533],
        [ 0.8613,  0.0918,  0.9533],
        [ 1.4302, -0.0488,  0.0058],
        [ 1.4302, -0.0488,  0.0058],
        [ 0.1648, -0.0850, -0.1325],
        [ 0.1648, -0.0850, -0.1325],
        [ 0.4948, -0.3853, -0.4136],
        [-0.0212,  0.2011,  0.0991],
        [-0.0212,  0.2011,  0.0991]])

In [35]:
# element wise multiplication
U*V 

tensor([[ 0.1917,  1.7381,  4.1773],
        [ 0.9848,  0.3952,  1.3052],
        [ 0.0809, -0.1805,  0.1682],
        [-0.0652,  0.8857,  0.2049],
        [-0.0771,  0.0937,  2.4077],
        [-0.3961,  0.0213,  0.7523],
        [-0.1280, -0.0498,  0.0147],
        [ 0.9522, -0.0419, -0.0011],
        [-0.0148, -0.0868, -0.3346],
        [ 0.1097, -0.0731,  0.0248],
        [ 0.3294, -0.3311,  0.0775],
        [ 0.0097,  0.0467,  0.0782],
        [-0.0141,  0.1729, -0.0186]])

In [36]:
# what we want is a dot product per row
(U*V).sum(1) 

tensor([ 6.1071,  2.6853,  0.0685,  1.0254,  2.4243,  0.3775, -0.1631,
         0.9092, -0.4362,  0.0615,  0.0758,  0.1346,  0.1402])

## Training MF model

In [37]:
num_users = len(df_train.userId.unique())
num_items = len(df_train.movieId.unique())
print(num_users, num_items) 

671 8442


In [38]:
model = MF(num_users, num_items, emb_size=100)  # if you have a GPU .cuda()

In [39]:
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr, weight_decay=wd)
    model.train()
    for i in range(epochs):
        users = torch.LongTensor(df_train.userId.values)  #.cuda()
        items = torch.LongTensor(df_train.movieId.values) #.cuda()
        ratings = torch.FloatTensor(df_train.rating.values)  #.cuda()
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items)
        loss = F.mse_loss(y_hat, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(loss.item()) # used to be loss.data[0]
    test_loss(model, unsqueeze)

In [40]:
# Here is what unsqueeze does
ratings = torch.FloatTensor(df_train.rating.values)
print(ratings.shape)
ratings = ratings.unsqueeze(1) #.cuda()
ratings.shape

torch.Size([79799])


torch.Size([79799, 1])

In [41]:
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(df_val.userId.values) # .cuda()
    items = torch.LongTensor(df_val.movieId.values) #.cuda()
    ratings = torch.FloatTensor(df_val.rating.values) #.cuda()
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items)
    loss = F.mse_loss(y_hat, ratings)
    print("test loss %.3f " % loss.item())

In [42]:
train_epocs(model, epochs=10, lr=0.1)

13.231671333312988
5.121272563934326
2.371032238006592
3.449235439300537
0.9082512259483337
1.8069171905517578
2.746382474899292
2.2749269008636475
1.1544880867004395
0.9236623048782349
test loss 1.951 


In [43]:
train_epocs(model, epochs=15, lr=0.01)

1.7053922414779663
1.0522794723510742
0.7495632171630859
0.6941799521446228
0.7591373324394226
0.8399957418441772
0.882774293422699
0.8768144249916077
0.834937334060669
0.7779412865638733
0.7253367900848389
0.690159261226654
0.6765652298927307
0.6802441477775574
0.6915913820266724
test loss 0.894 


In [44]:
train_epocs(model, epochs=15, lr=0.01)

0.7005525231361389
0.662117063999176
0.6683859825134277
0.6455367803573608
0.6378748416900635
0.6447293758392334
0.6405875086784363
0.625559389591217
0.6143425107002258
0.6130089163780212
0.6137982606887817
0.6082141399383545
0.5968793034553528
0.5859522223472595
0.5790725946426392
test loss 0.822 


## MF with bias

In [45]:
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.item_bias = nn.Embedding(num_items, 1)
        # init 
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [46]:
model = MF_bias(num_users, num_items, emb_size=100) #.cuda()

In [47]:
train_epocs(model, epochs=10, lr=0.1, wd=1e-5)

13.241252899169922
4.386600971221924
3.456671714782715
2.4759907722473145
0.7868954539299011
1.8075655698776245
2.514157772064209
2.132805585861206
1.266887903213501
0.900151252746582
test loss 1.536 


In [48]:
train_epocs(model, epochs=10, lr=0.01, wd=1e-5)

1.2837954759597778
0.8581016063690186
0.6939929127693176
0.6953377723693848
0.7551701664924622
0.8009182810783386
0.8077378273010254
0.7811856269836426
0.7382373213768005
0.6962903738021851
test loss 0.825 


In [49]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6676116585731506
0.6595560908317566
0.6528341174125671
0.6473916172981262
0.6430826783180237
0.6397225856781006
0.6371058225631714
0.6350427865982056
0.6333566904067993
0.6319129467010498
test loss 0.811 


In [50]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-5)

0.6306264996528625
0.6286192536354065
0.6270679235458374
0.6256557703018188
0.6242409348487854
0.6227988004684448
0.6213244795799255
0.6198606491088867
0.6184088587760925
0.6169559359550476
test loss 0.811 


Note that these models are susceptible to weight initialization, optimization algorithm and regularization.

## Neural Network Model

In [51]:
# Note here there is no matrix multiplication, we could potentially make the embeddings of different sizes.
# Here we could get better results by keep playing with regularization.
    
class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
        self.drop2 = nn.Dropout(0.0)
        
    def forward(self, u, v):
        U = self.user_emb(u)
        V = self.item_emb(v)
        x = F.relu(torch.cat([U, V], dim=1))
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        x = self.lin2(x)
        return x

In [52]:
model = CollabFNet(num_users, num_items, emb_size=100) #.cuda()

In [53]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-5, unsqueeze=True) 

12.267468452453613
7.332353115081787
3.9531960487365723
1.809013843536377
1.2874162197113037
2.171198606491089
3.0952234268188477
3.1948397159576416
2.6650784015655518
1.9631448984146118
1.4193719625473022
1.1771756410598755
1.1993615627288818
1.3730884790420532
1.5763182640075684
1.720729947090149
1.76218843460083
1.7043043375015259
1.5659388303756714
1.386714220046997
test loss 1.211 


In [54]:
train_epocs(model, epochs=20, lr=0.01, wd=1e-6, unsqueeze=True)

1.2145981788635254
1.2607530355453491
1.103450059890747
0.9955425262451172
1.0669831037521362
1.0572514533996582
0.9663847088813782
0.9212606549263
0.9510968327522278
0.9548993706703186
0.9007816910743713
0.8605215549468994
0.8662114143371582
0.8781496286392212
0.8602548241615295
0.8229435086250305
0.8100877404212952
0.8181719183921814
0.8187110424041748
0.7972913384437561
test loss 0.826 


In [55]:
train_epocs(model, epochs=10, lr=0.001, wd=1e-6, unsqueeze=True)

0.7805511355400085
0.7833136916160583
0.7782210111618042
0.778822124004364
0.7788718342781067
0.7764604091644287
0.7743875980377197
0.7731351256370544
0.7729506492614746
0.7721666097640991
test loss 0.821 


In [56]:
train_epocs(model, epochs=20, lr=0.001, wd=1e-6, unsqueeze=True)

0.7696303725242615
0.7755081057548523
0.7689151763916016
0.7694798111915588
0.7690362334251404
0.7689986228942871
0.7659075856208801
0.7651707530021667
0.7668241262435913
0.7639697790145874
0.7629351019859314
0.7614362239837646
0.7608759999275208
0.7593751549720764
0.7577551007270813
0.759416401386261
0.7596520185470581
0.7581142783164978
0.7591732144355774
0.7557303309440613
test loss 0.813 


# Building a Dataset

In [61]:
!ls data

links.csv   ratings.csv  tags.csv	     tiny_val2.csv
movies.csv  README.txt	 tiny_training2.csv


In [66]:
pd.read_csv('data/ratings.csv')

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [162]:
from torch.utils import data
class CFData(data.Dataset):
    def __init__(self, df):
        self.df = df
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self,index):
        row = self.df.iloc[index]    
        X = [row.userId,row.movieId]
        y = row.rating
        return X,y

In [163]:
users = pd.read_csv('data/ratings.csv')

In [164]:
len(users)

100004

In [165]:
users_train = users.loc[:80003]
users_test = users.loc[80004:]

In [166]:
train_ds = CFData(users_train)
test_ds = CFData(users_test)

In [167]:
from torch.utils.data import Dataset, DataLoader
batch_size = 1
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
# for test we use shuffle=False
test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)

In [168]:
train_dl = iter(train_loader)
x, y = next(train_dl)

In [169]:
print(x)

[tensor([ 531.], dtype=torch.float64), tensor([ 41997.], dtype=torch.float64)]


## TODO
* use t-sne to visualize embeddings

# Lab
* Can you use `tags.csv` and `timestamp` to improve your predictions?
* Play with the hyperparameters
* Look at fastai version of this network and try his transformation https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb
* You may need a dataloader if you data is larger. Can you construct one? Here is an example:
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel.html
* Work with the largest dataset http://files.grouplens.org/datasets/movielens/ml-latest.zip

# References
* This notebook is based on [lesson 5 of Jeremy Howard's Deep Learning Course](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb)