## Preprocessing

In [1]:
%reload_ext autoreload
%autoreload 2
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from scipy import sparse

In [2]:
ratings = pd.read_csv('foo.csv')
ratings.drop(['index'],axis=1, inplace=True)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,44,1,4
1,61,1,3
2,67,1,4
3,72,1,3
4,86,1,5


Here I encode the user and movie ids. They're already integers, but I want them to start at zero to make life with matrices easier later on. I also drop the timestamp since I won't be using it.

In [3]:
from sklearn.preprocessing import LabelEncoder
user_enc = LabelEncoder()
movie_enc = LabelEncoder()
user_enc.fit(ratings.userId.unique())
movie_enc.fit(ratings.movieId.unique())
n_users = len(user_enc.classes_)
n_movies = len(movie_enc.classes_)

In [4]:
ratings.userId = user_enc.transform(ratings.userId)
ratings.movieId = movie_enc.transform(ratings.movieId)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,43,0,4
1,60,0,3
2,66,0,4
3,71,0,3
4,85,0,5


I randomly select 20% of the data to hold out as a validation set. This set won't be seen by the models during training, and will be used to measure how well the model performs on unseen data. The dataframe `ratings_val` and `ratings_trn` hold the validation and training sets respectively.

In [5]:
def get_val_idxs(n, pc=0.2):
    """Randomly selects idxs for validation set"""
    np.random.seed(42)
    idxs = np.random.permutation(n)
    return idxs[:int(n*pc)]

In [6]:
val_idxs = get_val_idxs(len(ratings))
mask = np.zeros(len(ratings), dtype=bool)
mask[val_idxs] = True
ratings_val = ratings[mask]
ratings_trn = ratings[~mask]

This function computes the RMSE score and displays it for us.

In [7]:
def rmse(pred,true):
    """Computes and prints the RMSE score"""
    score = np.sqrt(mean_squared_error(pred,true))
    print('RMSE = {:.3f}'.format(score))

I also create a matrix representation of the ratings. Each row contains the ratings for a particular user and each column contain ratings for a particular movie. Movie's which a user hasn't rated are represented with a zero. Because there are so many possible combinations of movies and users, this matrix is very sparse. Conveniently, SciPy has a class specifically for representing matrices like these.

In [8]:
# Training ratings matrix
R_trn = sparse.csr_matrix((ratings_trn.rating,
                                (ratings_trn.userId,ratings_trn.movieId)),
                                shape=(n_users, n_movies))
# All ratings matrix
R = sparse.csr_matrix((ratings.rating,
                                (ratings.userId,ratings.movieId)),
                                shape=(n_users, n_movies))

## Global, User, and Item Average

To start off, it's always a good idea to have some lower bound results to compare to. For user-item prediction tasks like this one, there are three that are commonly used: global average, user average, and item average. I tried them each below. It turns out that user average does quite a bit better than the other two, so I'll use this as a lower bound benchmark score.

In [None]:
global_average = np.mean(ratings_trn.rating)
pred = [global_average]*len(ratings_val)
rmse(pred,ratings_val.rating)

In [None]:
user_average = ratings_trn.groupby(['userId'])['rating'].mean()
pred = ratings_val.apply(lambda row: user_average[row.userId], axis=1)
rmse(pred,ratings_val.rating)

In [None]:
item_avg_trn = ratings_trn.groupby(['movieId'])['rating'].mean()
item_avg = np.full(n_movies,global_average)
item_avg[item_avg_trn.index] = item_avg_trn.values
pred = ratings_val.apply(lambda row: item_avg[int(row.movieId)], axis=1)
rmse(pred,ratings_val.rating)

## KNN

First up is a nearest neighbours model. This model handles both user to user and item to item collaborative filtering. I try out both and compare the results. In each case, the user or movies are mapped to a lower-rank vector representation using sklearn's TruncatedSVD. The distance metric used for determining the 'closeness' of users/movies is cosine similarity.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD

class KNN(object):
    def __init__(self, user_to_user=True):
        self.user_to_user = user_to_user
        
    def fit(self, ratings, n_components=10):
        # For user to user collaborative filtering,
        # make rows the user ratings.
        # For item to item collaborative filtering,
        # make rows the movie ratings.
        self.ratings = ratings if self.user_to_user else ratings.T
        
        # Get indices of rated movies
        where_zero = np.where(self.ratings == 0)
        self.rated_idxs = np.ones(self.ratings.shape, dtype=bool)
        self.rated_idxs[where_zero] = False
        
        # Compute the mean row ratings 
        sums = self.ratings.sum(axis=1)
        counts = self.rated_idxs.sum(axis=1)
        self.means = np.true_divide(sums,counts, where=counts!=0)
        self.means[self.means<0.5] = 0
        
        # Center ratings about the mean
        ratings_centered = self.ratings - np.expand_dims(self.means, axis=1)
        ratings_centered[~self.rated_idxs] = 0
        
        # Perform dimensionality reduction with SVD
        SVD = TruncatedSVD(n_components=n_components, random_state=17)
        collab_vectors = SVD.fit_transform(self.ratings)
        self.collab_vectors = sparse.csr_matrix(collab_vectors)

        # Some rows have no ratings, and therefore
        # a zero mean saved in knn.means.
        # Replace those with the average mean.
        zero_mean_idxs = np.where(self.means==0)[0]
        sum_means = np.sum(self.means)
        count = len(self.means) - len(zero_mean_idxs)
        avg_means = sum_means / count
        self.means[zero_mean_idxs] = avg_means
        
    def get_similar(self, k, target_idx, about_idx):
        # Get rows with ratings at the about_idx column
        rated_idxs = np.where(self.rated_idxs[:,about_idx] == True)[0]
        # if there are no collaborators, return None
        if len(rated_idxs) == 0:
            raise ValueError('No similar users')
        # get the similarity values between target
        # and users who have seen the movie
        sims = cosine_similarity(self.collab_vectors[target_idx],
                                self.collab_vectors[rated_idxs])
        # get the k most similar users
        top_k = sorted(zip(sims.ravel(),rated_idxs))[-k:]
        top_k = [(sim,idx) for sim,idx in top_k if sim != 0]
        if not top_k:
            raise ValueError('No similar users')
        [sims,sim_idxs] = [i for i in zip(*top_k)]
        return list(sims), list(sim_idxs)
    
    def predict(self, k, user_idx, item_idx):
        # If user to user, then users are stored
        # in the rows and items in the columns.
        # Vice versa otherwise.
        if self.user_to_user:
            row, col = int(user_idx), int(item_idx)
        else:
            row, col = int(item_idx), int(user_idx)
        # Get the k most similar users 
        try:
            sims, sim_idxs = self.get_similar(k, row, col)
        except ValueError:
            # If there are no similar users
            # return the mean ratings
            rating = self.means[row]
        else:
            rating = np.sum(self.ratings[sim_idxs, col] * sims) / np.sum(sims)
        return rating

### Item to Item Collaborative Filtering

Here I create the `KNN` model and fit it to the data. It takes a matrix representation of the ratings as input, so I feed in `R_trn` which we constructed above. `KNN.fit()` is doing all the heavy lifting of preparing the data for predictions later. It normalizes the matrix, deals with empty columns/rows, and performs the dimensionality reduction. The `user_to_user` parameter controls whether or not the model is doing user-to-user or item-to-item collaborative filtering.

In [None]:
knn = KNN(user_to_user=False)
knn.fit(R_trn.toarray())

In [None]:
k = 25 # number of neighbours to compare to
pred = ratings_val.apply(lambda row: knn.predict(k,row.userId,row.movieId), axis=1)
rmse(pred,ratings_val.rating)

In [None]:
k = 50 # number of neighbours to compare to
pred = ratings_val.apply(lambda row: knn.predict(k,row.userId,row.movieId), axis=1)
rmse(pred,ratings_val.rating)

In [None]:
k = 237 # number of neighbours to compare to
pred = ratings_val.apply(lambda row: knn.predict(k,row.userId,row.movieId), axis=1)
rmse(pred,ratings_val.rating)

### User to User Collaborative Filtering

Same thing as above but using user to user collaborative filtering now.

In [None]:
knn = KNN(user_to_user=True)
knn.fit(R_trn.toarray())

In [None]:
k = 25 # number of neighbours to compare to
pred = ratings_val.apply(lambda row: knn.predict(k,row.userId,row.movieId), axis=1)
rmse(pred,ratings_val.rating)

In [None]:
k = 50 # number of neighbours to compare to
pred = ratings_val.apply(lambda row: knn.predict(k,row.userId,row.movieId), axis=1)
rmse(pred,ratings_val.rating)

## Set up for Pytorch Models

The rest of the models are coded up using Pytorch. I start by importing in everything I need, and creating dataloaders for the training and validation set. 

In [9]:
import torch
from torch.utils.data import TensorDataset, DataLoader
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [30]:
# Training set dataloader
m = torch.from_numpy(ratings_trn.movieId.values).long()
u = torch.from_numpy(ratings_trn.userId.values).long()
y = torch.from_numpy(ratings_trn.rating.values).view(-1,1).float()
x = torch.stack([u,m],dim=1)
dataset = TensorDataset(x,y)
trainloader = DataLoader(dataset, batch_size=64, 
                         shuffle=True, num_workers=2)
# Validation set dataloader
m = torch.from_numpy(ratings_val.movieId.values).long()
u = torch.from_numpy(ratings_val.userId.values).long()
y = torch.from_numpy(ratings_val.rating.values).view(-1,1).float()
x = torch.stack([u,m],dim=1)
dataset = TensorDataset(x,y)
validloader = DataLoader(dataset, batch_size=64, 
                         shuffle=True, num_workers=2)

I also created some helper classes and functions to streamline the training process. I didn't include them in this notebook to keep things a little cleaner, and they're not necessary for understanding the models. But if you want to dig into those, you can find them in the same [repo](https://github.com/TipTop314/movie-lens) as this notebook. 

In [None]:
from learning import ModelOptimizer, CosAnneal, fit_model

`Model Optimizer` wraps around a pytorch optimizer. It's mostly just book keeping.<br>
`CosAnneal` is a learning rate scheduler which adapts the learning rate during training.<br>
`fit_model` takes the model and performs however many epochs of training you tell it to.<br>

## Baseline

This model starts with a global average rating across all movies and users. During training, the model learns baseline offsets for each user and movie. These offsets represent how far the ratings of a particular user/movie tend to be away from the global average. The sum of the global average, user baseline, and movie baseline is the final prediction.

In [None]:
class Baseline(nn.Module):
    def __init__(self, mu, n_users, n_movies):
        super(Baseline, self).__init__()
        #self.linear = nn.Linear(1, 1)  # input and output is 1 dimension
        self.mu = Variable(torch.Tensor([mu]), requires_grad=False)
        self.bu = nn.Parameter(torch.zeros(n_users))
        self.bi = nn.Parameter(torch.zeros(n_movies))
        
    def forward(self, userId, movieId):
        out = self.mu + self.bu[userId] + self.bi[movieId]
        return out

In [None]:
# Build model
mu = ratings_trn.rating.mean()
model =Baseline(mu, n_users, n_movies)
# Instantiate optimizer and learning rate scheduler
#opt = ModelOptimizer(optim.Adam, model, lr=1e-2, wd=2e-4)
#sched = CosAnneal(opt, len(trainloader))

In [29]:
opt = optim.Adam(lr=1e-2, weight_decay=2e-4, params=model.parameters())
loss_fn = torch.nn.MSELoss()

epochs = 20
for epoch in range(epochs):
    for i, data in enumerate(trainloader):
        inputs,labels = data
        #outputs = model(inputs[:,0].float(),inputs[:,1].float)
        outputs = model(inputs[:,0].reshape(-1,1), inputs[:,1].reshape(-1,1))
        opt.zero_grad()
        loss = loss_fn(outputs, labels)
        loss.backward()
        opt.step()

TypeError: mul() received an invalid combination of arguments - got (numpy.int64), but expected one of:
 * (Tensor other)
      didn't match because some of the arguments have invalid types: ([31;1mnumpy.int64[0m)
 * (float other)
      didn't match because some of the arguments have invalid types: ([31;1mnumpy.int64[0m)


In [None]:
# get prediction and score
u = Variable(torch.LongTensor(ratings_val.userId.values))
m = Variable(torch.LongTensor(ratings_val.movieId.values))
pred_val = model(u, m).data.numpy()
rmse(pred_val, ratings_val.rating.values)

## SVD

This model learns a vector representation for each user and movie. The idea here is that we can learn abstract features for the movies, and learn user's preferences for each of them. If we know the type of movie a user likes (represented by a vector) and how much each movie corresponds to each movie type (also represented by a vector), these may be useful for predicting a user's preference for movies they haven't seen.

In [18]:
def get_emb(n_embeds, embed_size):
    embed = nn.Embedding(n_embeds, embed_size)
    embed.weight.data.uniform_(-0.05,0.05)
    return embed

class SVD(nn.Module):
    def __init__(self, n_users, n_movies, r_min, r_max, n_factors=150):
        super().__init__()
        self.u = get_emb(n_users, n_factors)
        self.m = get_emb(n_movies, n_factors)
        self.ub = nn.Parameter(torch.zeros(n_users))
        self.mb = nn.Parameter(torch.zeros(n_movies))
        self.r_min = r_min
        self.r_max = r_max
        
    def forward(self, user_idxs, movie_idxs):
        um = (self.u(user_idxs)*self.m(movie_idxs)).sum(1)
        r = um + self.ub[user_idxs] + self.mb[movie_idxs]
        return F.sigmoid(r) * (self.r_max - self.r_min) + self.r_min

In [19]:
# Build model
r_min, r_max = ratings_trn.rating.min(), ratings_trn.rating.max()
model = SVD(n_users, n_movies, r_min, r_max)
# Instantiate optimizer and learning rate scheduler
#opt = ModelOptimizer(optim.Adam, model, lr=1e-2, wd=2e-4)
#sched = CosAnneal(opt, len(trainloader))

In [36]:
opt = optim.Adam(lr=1e-2, weight_decay=2e-4, params=model.parameters())
loss_fn = torch.nn.MSELoss()

epochs = 20
for epoch in range(epochs):
    for i, data in enumerate(trainloader):
        inputs,labels = data
        print(inputs[:,0].float())
        outputs = model(inputs[:,0].float(),inputs[:,1].float())
        opt.zero_grad()
        loss = loss_fn(outputs, labels)
        loss.backward()
        opt.step()

tensor([ 7535.,  3276.,  5975.,  8895.,  2075.,  3365.,   928.,  6095.,
         6560.,  6793.,  6037.,  1621.,  5880.,   351.,  1402.,  4704.,
         1326.,  1587.,  5818.,  8804.,  6054.,  4378.,  3097.,  3344.,
         1105.,  9576.,   338.,  9854.,  4711.,  8953.,  1153.,  6349.,
         3180.,  2119.,  8589.,  4702.,  6760.,  5148.,  7193.,  5197.,
         2942.,  4086.,   588.,  6058.,  9359.,   783.,  7000.,  7199.,
         2446.,  5759.,   755.,   423.,  5453.,  6572.,  1280.,   469.,
         1555.,  2606.,  5620.,  4791.,  7531.,  9366.,  9423.,  5802.])


RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got CPUFloatTensor instead (while checking arguments for embedding)

In [None]:
# get prediction and score
u = Variable(torch.LongTensor(ratings_val.userId.values))
m = Variable(torch.LongTensor(ratings_val.movieId.values))
pred_val = model(u, m).data.numpy()
rmse(pred_val, ratings_val.rating.values)

## SVDnet

This model just builds off the same idea as SVD. It starts by learning vector representations for the users and movies, but instead of multiplying the vectors together, we feed them into a neural network. Once again, we're trying to learn useful user and movie vectors, but now we're also learning a network which can take those vectors as input and output movie ratings. This whole stack can be optimized through backpropagation.

In [None]:
class SVDNet(nn.Module):
    def __init__(self, n_users, n_movies, r_min, r_max,
                 n_factors=50, nh=10, p1=0.05, p2=0.5):
        super().__init__()
        self.r_min = r_min
        self.r_max = r_max
        # User and Movie Embeddings
        self.u = get_emb(n_users, n_factors)
        self.m = get_emb(n_movies, n_factors)
        # Network layers
        self.lin1 = nn.Linear(n_factors*2, nh)
        self.lin2 = nn.Linear(nh, 1)
        self.drop1 = nn.Dropout(p1)
        self.drop2 = nn.Dropout(p2)

    def forward(self, user_idxs, movie_idxs):
        # concatenate user and movie embeddings
        x = torch.cat([self.u(user_idxs), self.m(movie_idxs)], dim=1)
        # feed through network
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.drop2(x)
        # force output to be within the ratings range
        out = F.sigmoid(self.lin2(x)) * (self.r_max - self.r_min) + self.r_min
        return out

In [None]:
# Build model
r_min, r_max = ratings_trn.rating.min(), ratings_trn.rating.max()
model = SVDNet(n_users, n_movies, r_min, r_max, nh=10)
# Instantiate optimizer and learning rate scheduler

In [None]:
opt = optim.Adam(lr=1e-2, weight_decay=2e-4, params=model.parameters())
loss_fn = torch.nn.MSELoss()

epochs = 20
for epoch in range(epochs):
    for i, data in enumerate(trainloader):
        inputs,labels = data
        #outputs = model(inputs[:,0].float(),inputs[:,1].float)
        outputs = model(torch.LongTensor(inputs[:,0]).reshape(-1,1), torch.LongTensor(inputs[:,1]).reshape(-1,1))
        opt.zero_grad()
        loss = loss_fn(outputs, labels)
        loss.backward()
        opt.step()

In [None]:
# get prediction and score
u = Variable(torch.LongTensor(ratings_val.userId.values))
m = Variable(torch.LongTensor(ratings_val.movieId.values))
pred_val = model(u, m).data.numpy()
rmse(pred_val, ratings_val.rating.values)

## Results

| Model | RMSE |
|  -- | -- |
| UserAvg | 0.961 |
| Benchmark | 0.899 |
| kNN | 0.912 |
| Baseline | 0.894 |
| SVD | 0.881 |
| SVDnet | 0.890 |

All the models performed better than a blanket prediction of the user average, so that's a good start! The first thing I noticed was that there's a clear difference between `kNN` and the last three models which all learn user and movie specific representations (note that `Baseline` also learns a representation, it's just a scalar one). `SVDnet` performed best on some validation sets (not shown here) but not others. Even when it did manage to perform better, it often took some hyperparameter tweaking to get it there. So plain `SVD` wins this round. Not only did it score the highest, but it was also the most consistent and with a simple and clean architecture to boot!