### Collaborative Filtering with Neural Nets
* Users rate some movies and give a score between 0 and 5
* Goal is to formulate a model that helps us predict ratings 
* We will use an embedding matrix framework where
  * We have an embedding matrix for users where each user is represented by a set of embeddings
  * Same for movies
* We leverage the [Movielens data](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)
* Code adapted from Jeremy Howard's fastai MOOC
* We achieve an RMSE better than most benchmarks out there

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
cd ~/fastai

/home/paperspace/fastai


In [3]:
from fastai.learner import *
from fastai.column_data import *

In [4]:
cd ~

/home/paperspace


In [7]:
path='collab_filter/ml-latest-small/'

In [8]:
ratings=pd.read_csv(path+'ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [9]:
# convert userid's and movieid's to contiguous values
u_uniq=ratings.userId.unique()
user2idx={o:i for i,o in enumerate(u_uniq)}
ratings.userId=ratings.userId.apply(lambda x:user2idx[x])

m_uniq=ratings.movieId.unique()
movie2idx={o:i for i,o in enumerate(m_uniq)}
ratings.movieId=ratings.movieId.apply(lambda x:movie2idx[x])

n_users=int(ratings.userId.nunique())
n_movies=int(ratings.movieId.nunique())

### Using a Neural Net for Collaborative Filtering
* Typical approach would be some type of matrix factorization
* With a NN we can be much more flexible. At the simplest level, the user and movie embeddings are concatenated and become an input into the NN

In [24]:
# Creating an Embedding NN class in Pytorch

def get_emb(ni,nf):
    e=nn.Embedding(ni,nf)
    e.weight.data.uniform_(0,0.05)
    return e

class EmbeddingNet(nn.Module):
    def __init__(self, n_users, n_movies, nh=10, p1=0.5, p2=0.5):
        super().__init__()
        (self.u, self.m)=[get_emb(*o) for o in [(n_users, n_factors), (n_movies, n_factors)]]
        self.lin1=nn.Linear(n_factors*2, nh)
        self.lin2=nn.Linear(nh,1)
        self.drop1=nn.Dropout(p1)
        self.drop2=nn.Dropout(p2)
    
    def forward(self, cats, conts):
        users, movies=cats[:,0], cats[:,1]
        x=self.drop1(torch.cat([self.u(users), self.m(movies)], dim=1))
        x=self.drop2(F.relu(self.lin1(x)))
        return F.sigmoid(self.lin2(x)) * (max_rating-min_rating+1) + min_rating-0.5

In [12]:
min_rating, max_rating=ratings.rating.min(), ratings.rating.max()

In [13]:
# get validation indices
val_idxs=get_cv_idxs(len(ratings))

In [14]:
# size of embedding vector
n_factors=50

In [15]:
# create model data object
x=ratings.drop(['rating','timestamp'], axis=1)
y=ratings['rating'].astype(np.float32)

In [25]:
data=ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], bs=64)

In [30]:
wd=2e-5
model=EmbeddingNet(n_users, n_movies).cuda()
opt=optim.Adam(model.parameters(), 1e-3, weight_decay=wd)

In [31]:
fit(model, data, 5, opt, F.mse_loss)

epoch      trn_loss   val_loss                                  
    0      0.892676   0.813727  
    1      0.825806   0.794682                                  
    2      0.82637    0.788509                                  
    3      0.815075   0.790227                                  
    4      0.798862   0.789157                                  



[array([ 0.78916])]

In [32]:
set_lrs(opt, 1e-3)

In [33]:
fit(model, data, 10, opt, F.mse_loss)

epoch      trn_loss   val_loss                                  
    0      0.774881   0.788783  
    1      0.769469   0.784497                                  
    2      0.791912   0.781929                                  
    3      0.802668   0.78494                                   
    4      0.768773   0.783111                                  
    5      0.820132   0.785482                                  
    6      0.769839   0.780364                                  
    7      0.796886   0.783825                                  
    8      0.773444   0.782315                                  
    9      0.788565   0.784105                                  



[array([ 0.7841])]

In [35]:
#RMSE
math.sqrt(.784105)

0.8854970355681605