# Collaborative Filtering model on MovieLens

Download the 20m [movielens dataset](http://files.grouplens.org/datasets/movielens/ml-20m.zip)

You can use the aria2c or wget to download

In [1]:
# %cd /data
# !!aria2c -x5 http://files.grouplens.org/datasets/movielens/ml-20m.zip
# !!unzip ml-20m.zip

In [2]:
import pandas as pd
import numpy as np
import os
import torch
from p3self.matchbox import Trainer

In [3]:
DATA = "/data/ml-latest-small/"
BS = 4000
DIM = 50
CUDA = torch.cuda.is_available()
print(CUDA)

True


In [4]:
files = os.listdir(DATA)
files

['movies.csv', 'links.csv', 'tags.csv', 'ratings.csv', 'README.txt']

In [5]:
data = dict()
for f in files:
    if f[-3:]=="csv":
        data[f.split(".")[0]] = pd.read_csv(DATA+f)

### Check Data

In [6]:
from IPython.display import display
list(display(k,v.sample(5)) for k,v in data.items())

'movies'

Unnamed: 0,movieId,title,genres
337,373,Red Rock West (1992),Thriller
7349,71535,Zombieland (2009),Action|Comedy|Horror
546,620,Scream of Stone (Cerro Torre: Schrei aus Stein...,Drama
7876,89203,Magic Trip (2011),Documentary
3796,4867,Riding in Cars with Boys (2001),Comedy|Drama


'links'

Unnamed: 0,movieId,imdbId,tmdbId
6195,34517,61398,42689.0
6710,53138,455596,14208.0
4088,5354,64117,28289.0
7211,68194,1226271,21641.0
1129,1391,116996,75.0


'tags'

Unnamed: 0,userId,movieId,tag,timestamp
530,364,118997,musical,1444530098
37,94,64957,original plot,1291781246
367,364,1176,Krzysztof Kieslowski,1444528941
1173,547,103372,toplist13,1383625950
160,212,60684,dystopia,1253926517


'ratings'

Unnamed: 0,userId,movieId,rating,timestamp
43312,310,628,3.0,1414188046
53037,384,27851,3.5,1154367666
15114,99,588,3.0,938586006
29225,212,60040,3.0,1227938927
16704,105,7149,3.5,1100606316


[None, None, None, None]

## Model on rating

In [7]:
data["ratings"].sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
91262,605,3717,1.0,980174247
62952,456,6365,4.5,1432308271
78232,544,79132,4.0,1435786926
76078,529,953,3.0,959965606
92837,615,81562,3.5,1454913597


In [8]:
len(data["ratings"])

100004

In [9]:
userId = list(set(data["ratings"]["userId"]))
movieId = list(set(data["ratings"]["movieId"]))
print(len(userId),len(movieId))

671 9066


### Mapping
user to index, movie to index, index to user, index to movie

In [10]:
u2i = dict((v,k) for k,v in enumerate(userId))
m2i = dict((v,k) for k,v in enumerate(movieId))
i2u = dict((k,v) for k,v in enumerate(userId))
i2m = dict((k,v) for k,v in enumerate(movieId))

In [11]:
from torch.utils.data import DataLoader,Dataset

### Separate train/valid dataset

In [12]:
train_pick = np.random.rand(len(data["ratings"]))>.2
valid_pick = ~train_pick

In [13]:
train_pick,valid_pick

(array([ True, False, False, ..., False, False,  True]),
 array([False,  True,  True, ...,  True,  True, False]))

In [14]:
train_df = data["ratings"][train_pick].reset_index()
valid_df = data["ratings"][valid_pick].reset_index()

### Data generator

In [15]:
class reco_data(Dataset):
    def __init__(self,df):
        self.df=df
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self,idx):
        row = self.df.loc[idx]
        return u2i[int(row["userId"])],m2i[int(row["movieId"])],row["rating"]/5

In [16]:
train = reco_data(train_df)
valid = reco_data(valid_df)

## Basic Cross Filtering

In [17]:
from torch import nn

In [18]:
class embeddings(nn.Module):
    def __init__(self):
        super(embeddings,self).__init__()
        self.emb_u = nn.Embedding(len(userId), DIM)
        self.emb_m = nn.Embedding(len(movieId), DIM)
        
    def forward(self,u,m):
        return self.emb_u(u),self.emb_m(m)

In [29]:
class cf(nn.Module):
    def __init__(self):
        super(cf,self).__init__()
        self.ebd = embeddings()
    
    def forward(self,u,m):
        u_vec,m_vec = self.ebd(u,m)
        return u_vec * m_vec
    
class cfnn(nn.Module):
    def __init__(self):
        super(cfnn,self).__init__()
        self.cf = cf()
        self.fcb = nn.Sequential(*[nn.Linear(DIM,512,bias=False),
                                   nn.BatchNorm1d(512),
                                   nn.LeakyReLU(inplace=True),
                                   nn.Linear(512,1,bias=False),
                                   nn.BatchNorm1d(1),
                                   nn.Sigmoid()
                                  ],
                                )
    
    def forward(self,u,m):
        x = self.cf(u,m)
        return self.fcb(x)

In [30]:
cfmodel = cfnn()

In [31]:
from torch.optim import Adam
mse = nn.MSELoss()
opt = Adam(cfmodel.parameters(),amsgrad=True)
if CUDA:
    cfmodel.cuda()

Step function for train and valid

In [32]:
def action(*args,**kwargs):
    u,m,y = args[0]
    opt.zero_grad()
    if CUDA:
        u,m,y  = u.cuda(),m.cuda(),y.cuda()
        
    y_ = cfmodel(u,m) # prediction
    
    loss = mse(y_,y.unsqueeze(-1).float())
    
    loss.backward()
    opt.step()
    
    return {"mse":loss.item()}

def val_action(*args,**kwargs):
    u,m,y = args[0]
    if CUDA:
        u,m,y = u.cuda(),m.cuda(),y.cuda()
    y_ = cfmodel(u,m)
    loss = mse(y_,y.unsqueeze(-1).float())
    
    return {"mse":loss.item()}

In [33]:
trainer = Trainer(train, val_dataset=valid, batch_size=BS, print_on = 5)

trainer.action = action
trainer.val_action = val_action

In [34]:
trainer.train(10)

⭐[ep_0_i_19]	mse	0.094: 100%|██████████| 20/20 [00:13<00:00,  1.47it/s]
😎[val_ep_0_i_5]	mse	0.092: 100%|██████████| 6/6 [00:03<00:00,  1.79it/s]
⭐[ep_1_i_19]	mse	0.088: 100%|██████████| 20/20 [00:13<00:00,  1.48it/s]
😎[val_ep_1_i_5]	mse	0.087: 100%|██████████| 6/6 [00:03<00:00,  1.77it/s]
⭐[ep_2_i_19]	mse	0.084: 100%|██████████| 20/20 [00:13<00:00,  1.47it/s]
😎[val_ep_2_i_5]	mse	0.084: 100%|██████████| 6/6 [00:03<00:00,  1.79it/s]
⭐[ep_3_i_19]	mse	0.082: 100%|██████████| 20/20 [00:13<00:00,  1.48it/s]
😎[val_ep_3_i_5]	mse	0.082: 100%|██████████| 6/6 [00:03<00:00,  1.78it/s]
⭐[ep_4_i_19]	mse	0.079: 100%|██████████| 20/20 [00:13<00:00,  1.47it/s]
😎[val_ep_4_i_5]	mse	0.079: 100%|██████████| 6/6 [00:03<00:00,  1.77it/s]
⭐[ep_5_i_19]	mse	0.075: 100%|██████████| 20/20 [00:13<00:00,  1.46it/s]
😎[val_ep_5_i_5]	mse	0.077: 100%|██████████| 6/6 [00:03<00:00,  1.78it/s]
⭐[ep_6_i_19]	mse	0.073: 100%|██████████| 20/20 [00:13<00:00,  1.47it/s]
😎[val_ep_6_i_5]	mse	0.075: 100%|██████████| 6/6 [00:03<00:

### Wider NN

Change the hidden layer to 1024

In [23]:
class cfnn2(nn.Module):
    def __init__(self):
        super(cfnn2,self).__init__()
        self.cf = cf()
        self.fcb = nn.Sequential(*[nn.Linear(DIM,1024,bias=False),
                                   nn.BatchNorm1d(1024),
                                   nn.LeakyReLU(inplace=True),
                                   nn.Linear(1024,1,bias=False),
                                   nn.BatchNorm1d(1),
                                   nn.Sigmoid()
                                  ],
                                )
    
    def forward(self,u,m):
        x = self.cf(u,m)
        return self.fcb(x)

In [24]:
cfmodel = cfnn2()

from torch.optim import Adam
mse = nn.MSELoss()
opt = Adam(cfmodel.parameters(),amsgrad=True)
if CUDA:
    cfmodel.cuda()

trainer = Trainer(train, val_dataset=valid, batch_size=BS, print_on = 5)

trainer.action = action
trainer.val_action = val_action

trainer.train(10)

⭐[ep_0_i_19]	mse	0.107: 100%|██████████| 21/21 [00:13<00:00,  1.54it/s]
😎[val_ep_0_i_4]	mse	0.108: 100%|██████████| 5/5 [00:03<00:00,  1.49it/s]
⭐[ep_1_i_19]	mse	0.100: 100%|██████████| 21/21 [00:13<00:00,  1.53it/s]
😎[val_ep_1_i_4]	mse	0.101: 100%|██████████| 5/5 [00:03<00:00,  1.44it/s]
⭐[ep_2_i_19]	mse	0.092: 100%|██████████| 21/21 [00:13<00:00,  1.53it/s]
😎[val_ep_2_i_4]	mse	0.095: 100%|██████████| 5/5 [00:03<00:00,  1.50it/s]
⭐[ep_3_i_19]	mse	0.087: 100%|██████████| 21/21 [00:13<00:00,  1.54it/s]
😎[val_ep_3_i_4]	mse	0.090: 100%|██████████| 5/5 [00:03<00:00,  1.47it/s]
⭐[ep_4_i_19]	mse	0.082: 100%|██████████| 21/21 [00:13<00:00,  1.52it/s]
😎[val_ep_4_i_4]	mse	0.087: 100%|██████████| 5/5 [00:03<00:00,  1.48it/s]
⭐[ep_5_i_19]	mse	0.079: 100%|██████████| 21/21 [00:13<00:00,  1.54it/s]
😎[val_ep_5_i_4]	mse	0.084: 100%|██████████| 5/5 [00:03<00:00,  1.49it/s]
⭐[ep_6_i_19]	mse	0.076: 100%|██████████| 21/21 [00:13<00:00,  1.55it/s]
😎[val_ep_6_i_4]	mse	0.081: 100%|██████████| 5/5 [00:03<00:

### Add dropout

In [25]:
class cfnn3(nn.Module):
    def __init__(self):
        super(cfnn3,self).__init__()
        self.cf = cf()
        self.fcb = nn.Sequential(*[nn.Linear(DIM,512,bias=False),
                                   nn.BatchNorm1d(512),
                                   nn.LeakyReLU(inplace=True),
                                   nn.Dropout(.3),
                                   nn.Linear(512,1,bias=False),
                                   nn.BatchNorm1d(1),
                                   nn.Sigmoid()
                                  ],
                                )
    
    def forward(self,u,m):
        x = self.cf(u,m)
        return self.fcb(x)

In [26]:
cfmodel = cfnn3()

from torch.optim import Adam
mse = nn.MSELoss()
opt = Adam(cfmodel.parameters(),amsgrad=True)
if CUDA:
    cfmodel.cuda()

trainer = Trainer(train, val_dataset=valid, batch_size=BS, print_on = 5)

trainer.action = action
trainer.val_action = val_action

trainer.train(10)

⭐[ep_0_i_19]	mse	0.097: 100%|██████████| 21/21 [00:13<00:00,  1.55it/s]
😎[val_ep_0_i_4]	mse	0.098: 100%|██████████| 5/5 [00:03<00:00,  1.48it/s]
⭐[ep_1_i_19]	mse	0.093: 100%|██████████| 21/21 [00:13<00:00,  1.57it/s]
😎[val_ep_1_i_4]	mse	0.094: 100%|██████████| 5/5 [00:03<00:00,  1.52it/s]
⭐[ep_2_i_19]	mse	0.089: 100%|██████████| 21/21 [00:13<00:00,  1.59it/s]
😎[val_ep_2_i_4]	mse	0.091: 100%|██████████| 5/5 [00:03<00:00,  1.54it/s]
⭐[ep_3_i_19]	mse	0.086: 100%|██████████| 21/21 [00:13<00:00,  1.58it/s]
😎[val_ep_3_i_4]	mse	0.087: 100%|██████████| 5/5 [00:03<00:00,  1.47it/s]
⭐[ep_4_i_19]	mse	0.082: 100%|██████████| 21/21 [00:13<00:00,  1.56it/s]
😎[val_ep_4_i_4]	mse	0.084: 100%|██████████| 5/5 [00:03<00:00,  1.52it/s]
⭐[ep_5_i_19]	mse	0.079: 100%|██████████| 21/21 [00:13<00:00,  1.57it/s]
😎[val_ep_5_i_4]	mse	0.081: 100%|██████████| 5/5 [00:03<00:00,  1.52it/s]
⭐[ep_6_i_19]	mse	0.076: 100%|██████████| 21/21 [00:13<00:00,  1.56it/s]
😎[val_ep_6_i_4]	mse	0.078: 100%|██████████| 5/5 [00:03<00:

### No Sigmoid as final activation

In [26]:
class cfnn4(nn.Module):
    def __init__(self):
        super(cfnn4,self).__init__()
        self.cf = cf()
        self.fcb = nn.Sequential(*[nn.Linear(DIM,512,bias=False),
                                   nn.BatchNorm1d(512),
                                   nn.LeakyReLU(inplace=True),
                                   nn.Dropout(.3),
                                   nn.Linear(512,1,bias=False),
                                   nn.BatchNorm1d(1),
                                  ],
                                )
    
    def forward(self,u,m):
        x = self.cf(u,m)
        return self.fcb(x)

In [27]:
cfmodel = cfnn4()

from torch.optim import Adam
mse = nn.MSELoss()
opt = Adam(cfmodel.parameters(),amsgrad=True)
if CUDA:
    cfmodel.cuda()

trainer = Trainer(train, val_dataset=valid, batch_size=BS, print_on = 5)

trainer.action = action
trainer.val_action = val_action

trainer.train(10)

⭐[ep_0_i_19]	mse	0.603: 100%|██████████| 20/20 [00:13<00:00,  1.47it/s]
😎[val_ep_0_i_5]	mse	0.593: 100%|██████████| 6/6 [00:03<00:00,  1.76it/s]
⭐[ep_1_i_19]	mse	0.558: 100%|██████████| 20/20 [00:13<00:00,  1.47it/s]
😎[val_ep_1_i_5]	mse	0.554: 100%|██████████| 6/6 [00:03<00:00,  1.78it/s]
⭐[ep_2_i_19]	mse	0.519: 100%|██████████| 20/20 [00:13<00:00,  1.45it/s]
😎[val_ep_2_i_5]	mse	0.517: 100%|██████████| 6/6 [00:03<00:00,  1.77it/s]
⭐[ep_3_i_19]	mse	0.482: 100%|██████████| 20/20 [00:13<00:00,  1.45it/s]
😎[val_ep_3_i_5]	mse	0.483: 100%|██████████| 6/6 [00:03<00:00,  1.78it/s]
⭐[ep_4_i_19]	mse	0.450: 100%|██████████| 20/20 [00:13<00:00,  1.47it/s]
😎[val_ep_4_i_5]	mse	0.452: 100%|██████████| 6/6 [00:03<00:00,  1.77it/s]
⭐[ep_5_i_19]	mse	0.424: 100%|██████████| 20/20 [00:13<00:00,  1.46it/s]
😎[val_ep_5_i_5]	mse	0.422: 100%|██████████| 6/6 [00:03<00:00,  1.72it/s]
⭐[ep_6_i_19]	mse	0.394: 100%|██████████| 20/20 [00:13<00:00,  1.46it/s]
😎[val_ep_6_i_5]	mse	0.395: 100%|██████████| 6/6 [00:03<00:

### No neural network at all, linear output

In [28]:
class cf_model(nn.Module):
    def __init__(self):
        super(cf_model,self).__init__()
        self.cf = cf()
        self.fcb = nn.Sequential(*[nn.Linear(DIM,1,bias=False),
                                   nn.Sigmoid(),])
        
    def forward(self,u,m):
        x = self.cf(u,m)
        return self.fcb(x)
    
cfmodel = cf_model()

from torch.optim import Adam
mse = nn.MSELoss()
opt = Adam(cfmodel.parameters(),amsgrad=True)
if CUDA:
    cfmodel.cuda()

trainer = Trainer(train, val_dataset=valid, batch_size=BS, print_on = 5)

trainer.action = action
trainer.val_action = val_action

trainer.train(10)

⭐[ep_0_i_19]	mse	0.101: 100%|██████████| 20/20 [00:13<00:00,  1.50it/s]
😎[val_ep_0_i_5]	mse	0.100: 100%|██████████| 6/6 [00:03<00:00,  1.81it/s]
⭐[ep_1_i_19]	mse	0.095: 100%|██████████| 20/20 [00:13<00:00,  1.50it/s]
😎[val_ep_1_i_5]	mse	0.096: 100%|██████████| 6/6 [00:03<00:00,  1.78it/s]
⭐[ep_2_i_19]	mse	0.092: 100%|██████████| 20/20 [00:13<00:00,  1.48it/s]
😎[val_ep_2_i_5]	mse	0.093: 100%|██████████| 6/6 [00:03<00:00,  1.71it/s]
⭐[ep_3_i_19]	mse	0.090: 100%|██████████| 20/20 [00:13<00:00,  1.48it/s]
😎[val_ep_3_i_5]	mse	0.091: 100%|██████████| 6/6 [00:03<00:00,  1.80it/s]
⭐[ep_4_i_19]	mse	0.089: 100%|██████████| 20/20 [00:13<00:00,  1.48it/s]
😎[val_ep_4_i_5]	mse	0.090: 100%|██████████| 6/6 [00:03<00:00,  1.74it/s]
⭐[ep_5_i_19]	mse	0.089: 100%|██████████| 20/20 [00:13<00:00,  1.49it/s]
😎[val_ep_5_i_5]	mse	0.089: 100%|██████████| 6/6 [00:03<00:00,  1.78it/s]
⭐[ep_6_i_19]	mse	0.088: 100%|██████████| 20/20 [00:13<00:00,  1.48it/s]
😎[val_ep_6_i_5]	mse	0.089: 100%|██████████| 6/6 [00:03<00:

So far the 1st model is best on valid inference score

### Visualize

In [40]:
movie_arr = cfmodel.cf.ebd.emb_m.weight.data.cpu().numpy()

In [42]:
np.save("/data/ml-latest-small/movie_arr.npy",movie_arr)

In [46]:
len(i2m),movie_arr.shape

(9066, (9066, 50))

In [49]:
data["movies"].sample(5)

Unnamed: 0,movieId,title,genres
5683,25850,Holiday (1938),Comedy|Drama|Romance
4818,6800,Cobra (1986),Action|Crime
3105,3886,Steal This Movie! (2000),Drama
6137,33380,25 Watts (2001),Comedy|Drama
1148,1414,Mother (1996),Comedy


In [51]:
ratings = data["ratings"]

In [82]:
ratings_avg = pd.pivot_table(ratings,values="rating",index=["movieId"],aggfunc="mean")

In [83]:
ratings_avg["arr_id"] = ratings_avg.reset_index()["movieId"].apply(lambda x:m2i[x])

In [86]:
mvrt = pd.merge(ratings_avg.reset_index(),data["movies"],on="movieId")

In [87]:
mvrt.sort_values(by="rating",ascending=False)

Unnamed: 0,movieId,rating,arr_id,title,genres
9065,163949,5.0,,The Beatles: Eight Days a Week - The Touring Y...,Documentary
7297,71180,5.0,,Padre padrone (1977),Drama
6629,51471,5.0,,Amazing Grace (2006),Drama|Romance
6662,52617,5.0,,Woman on the Beach (Haebyeonui yeoin) (2006),Comedy|Drama
6704,53887,5.0,,O Lucky Man! (1973),Comedy|Drama|Fantasy|Musical
6717,54251,5.0,,Dorian Blues (2004),Comedy
6726,54328,5.0,,My Best Friend (Mon meilleur ami) (2006),Comedy
6785,55555,5.0,,"Edge of Heaven, The (Auf der anderen Seite) (2...",Drama
6836,56869,5.0,,Drained (O cheiro do Ralo) (2006),Comedy
6843,57038,5.0,,To the Left of the Father (Lavoura Arcaica) (2...,Drama
