# Collaborative filtering
-----

In this example, we'll build a quick explicit feedback recommender system: that is, a model that takes into account explicit feedback signals (like ratings) to recommend new content.

We'll use an approach first made popular by the [Netflix prize](http://www.netflixprize.com/) contest: [matrix factorization](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf). 

The basic idea is very simple:

1. Start with user-item-rating triplets, conveying the information that user _i_ gave some item _j_ rating _r_.
2. Represent both users and items as high-dimensional vectors of numbers. For example, a user could be represented by `[0.3, -1.2, 0.5]` and an item by `[1.0, -0.3, -0.6]`.
3. The representations should be chosen so that, when we multiplied together (via [dot products](https://en.wikipedia.org/wiki/Dot_product)), we can recover the original ratings.
4. The utility of the model then is derived from the fact that if we multiply the user vector of a user with the item vector of some item they _have not_ rated, we hope to obtain a predicition for the rating they would have given to it had they seen it.

![collaborative filtering](matrix_factorization.png)
source:[ampcamp.berkeley](http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html)

## 1. Preparations

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os.path as op

from zipfile import ZipFile
try:
    from urllib.request import urlretrieve
except ImportError:  # Python 2 compat
    from urllib import urlretrieve

# this line need to be changed:
data_folder = '/home/lelarge/data/'

ML_100K_URL = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
ML_100K_FILENAME = op.join(data_folder,ML_100K_URL.rsplit('/', 1)[1])
ML_100K_FOLDER = op.join(data_folder,'ml-100k')

We start with importing a famous dataset, the [Movielens 100k dataset](https://grouplens.org/datasets/movielens/100k/). It contains 100,000 ratings (between 1 and 5) given to 1683 movies by 944 users:

In [None]:
if not op.exists(ML_100K_FILENAME):
    print('Downloading %s to %s...' % (ML_100K_URL, ML_100K_FILENAME))
    urlretrieve(ML_100K_URL, ML_100K_FILENAME)

if not op.exists(ML_100K_FOLDER):
    print('Extracting %s to %s...' % (ML_100K_FILENAME, ML_100K_FOLDER))
    ZipFile(ML_100K_FILENAME).extractall(data_folder)

Other datasets, see: [Movielens](https://grouplens.org/datasets/movielens/)

Possible soft to benchmark: [Lenskit](http://lenskit.org/)

## 2. Data analysis and formating

[Python Data Analysis Library](http://pandas.pydata.org/)

In [None]:
import pandas as pd

all_ratings = pd.read_csv(op.join(ML_100K_FOLDER, 'u.data'), sep='\t',
                          names=["user_id", "item_id", "ratings", "timestamp"])
all_ratings.head()

Let's check out a few macro-stats of our dataset

In [None]:
#number of entries
len(all_ratings)

In [None]:
all_ratings['ratings'].describe()

In [None]:
# number of unique rating values
len(all_ratings['ratings'].unique())

In [None]:
all_ratings['user_id'].describe()

In [None]:
# number of unique users
total_user_id = len(all_ratings['user_id'].unique())
print(total_user_id)

In [None]:
all_ratings['item_id'].describe()

In [None]:
# number of unique rated items
total_item_id = len(all_ratings['item_id'].unique())
print(total_item_id)

For spliting the data into _train_ and _test_ we'll be using a pre-defined function from [scikit-learn](http://scikit-learn.org/stable/)

In [None]:
from sklearn.model_selection import train_test_split

ratings_train, ratings_test = train_test_split(
    all_ratings, test_size=0.2, random_state=42)

user_id_train = ratings_train['user_id']
item_id_train = ratings_train['item_id']
rating_train = ratings_train['ratings']

user_id_test = ratings_test['user_id']
item_id_test = ratings_test['item_id']
rating_test = ratings_test['ratings']

In [None]:
len(user_id_train)

In [None]:
len(user_id_train.unique())

In [None]:
len(item_id_train.unique())

We see that all the movies are not rated in the train set.

In [None]:
user_id_train.iloc[:5]

In [None]:
item_id_train.iloc[:5]

In [None]:
rating_train.iloc[:5]

## 3. The model

We can feed our dataset to the `FactorizationModel` class - a sklearn-like object that allows us to train and evaluate the explicit factorization models.

Internally, the model uses the `Model_dot`(class to represents users and items. It's composed of a 4 `embedding` layers:

- a `(num_users x latent_dim)` embedding layer to represent users,
- a `(num_items x latent_dim)` embedding layer to represent items,
- a `(num_users x 1)` embedding layer to represent user biases, and
- a `(num_items x 1)` embedding layer to represent item biases.

In [None]:
import torch.nn as nn
import torch

Let's generate [Embeddings](http://pytorch.org/docs/master/nn.html#embedding) for the users, _i.e._ a fixed-sized vector describing the user

In [None]:
embedding_dim = 3
embedding_user = nn.Embedding(total_user_id, embedding_dim)
input = torch.LongTensor([[1,2,4,5],[4,3,2,0]])
embedding_user(input)

Make sure to check out ```torch_utils.py``` file to find the helper functions used in this notebook.

In [None]:
import imp
import torch_utils; imp.reload(torch_utils)

from torch_utils import ScaledEmbedding, ZeroEmbedding

class DotModel(nn.Module):
    
    def __init__(self,
                 num_users,
                 num_items,
                 embedding_dim=32):
        
        super(DotModel, self).__init__()
        
        self.embedding_dim = embedding_dim
        
        self.user_embeddings = ScaledEmbedding(num_users, embedding_dim)
        self.item_embeddings = ScaledEmbedding(num_items, embedding_dim)
        self.user_biases = ZeroEmbedding(num_users, 1)
        self.item_biases = ZeroEmbedding(num_items, 1)
                
        
    def forward(self, user_ids, item_ids):
        
        #
        # your code here
        #

        return dot + user_bias + item_bias


In [None]:
dot_net = DotModel(total_user_id,total_item_id)

In [None]:
user_id_tensor = torch.from_numpy(np.array([1,2,3,1]))
item_id_tensor = torch.from_numpy(np.array([1,3,4,1]))

In [None]:
dot_net(user_id_tensor,item_id_tensor)

In [None]:
import imp
import numpy as np

import torch.optim as optim

import torch_utils; imp.reload(torch_utils)
from torch_utils import gpu, minibatch, shuffle, regression_loss

class FactorizationModel(object):
    
    def __init__(self,
                 embedding_dim=32,
                 n_iter=10,
                 batch_size=256,
                 l2=0.0,
                 learning_rate=1e-2,
                 use_cuda=False,
                 net=None,
                 num_users=None,
                 num_items=None, 
                 random_state=None):
        
        self._embedding_dim = embedding_dim
        self._n_iter = n_iter
        self._learning_rate = learning_rate
        self._batch_size = batch_size
        self._l2 = l2
        self._use_cuda = use_cuda
        
        self._num_users = num_users
        self._num_items = num_items
        self._net = net
        self._optimizer = None
        self._loss_func = None
        self._random_state = random_state or np.random.RandomState()
             
        
    def _initialize(self):
        if self._net is None:
            self._net = gpu(DotModel(self._num_users, self._num_items, self._embedding_dim),self._use_cuda)
        
        self._optimizer = optim.Adam(
                self._net.parameters(),
                lr=self._learning_rate,
                weight_decay=self._l2
            )
        
        self._loss_func = regression_loss
    
    @property
    def _initialized(self):
        return self._optimizer is not None
    
    def __repr__(self):
        return _repr_model(self)
    
    def fit(self, user_ids, item_ids, ratings, verbose=True):
        
        user_ids = user_ids.astype(np.int64)
        item_ids = item_ids.astype(np.int64)
        
        if not self._initialized:
            self._initialize()
            
        for epoch_num in range(self._n_iter):
            users, items, ratingss = shuffle(user_ids,
                                            item_ids,
                                            ratings)

            user_ids_tensor = gpu(torch.from_numpy(users),
                                  self._use_cuda)
            item_ids_tensor = gpu(torch.from_numpy(items),
                                  self._use_cuda)
            ratings_tensor = gpu(torch.from_numpy(ratingss),
                                 self._use_cuda)
            epoch_loss = 0.0

            for (minibatch_num,
                 (batch_user,
                  batch_item,
                  batch_rating)) in enumerate(minibatch(self._batch_size,
                                                         user_ids_tensor,
                                                         item_ids_tensor,
                                                         ratings_tensor)):
                
                
                # to be completed...
                predictions = 
                #
                loss = 
                epoch_loss = 
                #
                #
                
            
            epoch_loss = epoch_loss / (minibatch_num + 1)

            if verbose:
                print('Epoch {}: loss {}'.format(epoch_num, epoch_loss))
        
            if np.isnan(epoch_loss) or epoch_loss == 0.0:
                raise ValueError('Degenerate epoch loss: {}'
                                 .format(epoch_loss))
    
    
    def test(self,user_ids, item_ids, ratings):
        self._net.train(False)
        user_ids = user_ids.astype(np.int64)
        item_ids = item_ids.astype(np.int64)
        
        user_ids_tensor = gpu(torch.from_numpy(user_ids),
                                  self._use_cuda)
        item_ids_tensor = gpu(torch.from_numpy(item_ids),
                                  self._use_cuda)
        ratings_tensor = gpu(torch.from_numpy(ratings),
                                 self._use_cuda)
               
        predictions = self._net(user_ids_tensor, item_ids_tensor)
        
        loss = self._loss_func(ratings_tensor, predictions)
        return loss.data.item()

In [None]:
model = FactorizationModel(embedding_dim=128,  # latent dimensionality
                                   n_iter=10,  # number of epochs of training
                                   batch_size=1024,  # minibatch size
                                   learning_rate=1e-3,
                                   l2=1e-9,  # strength of L2 regularization
                                   use_cuda=torch.cuda.is_available(),
                                   num_users=total_user_id+1,
                                   num_items=total_item_id+1)

In [None]:
user_ids_train_np = user_id_train.as_matrix().astype(np.int32)
item_ids_train_np = item_id_train.as_matrix().astype(np.int32)
ratings_train_np = rating_train.as_matrix().astype(np.float32)

In [None]:
model.fit(user_ids_train_np, item_ids_train_np, ratings_train_np)

In [None]:
print(model._net)

In [None]:
user_ids_test_np = user_id_test.as_matrix().astype(np.int64)
item_ids_test_np = item_id_test.as_matrix().astype(np.int64)
ratings_test_np = rating_test.as_matrix().astype(np.float32)
model.test(user_ids_test_np, item_ids_test_np, ratings_test_np  )

It looks like we are already overfitting...

## 4. Analysing and interpreting the results

In [None]:
user_emb_np = model._net.user_embeddings.weight.data.numpy()

In [None]:
item_emb_np = model._net.item_embeddings.weight.data.numpy()

[How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)

In [None]:
from sklearn.manifold import TSNE

item_tsne = TSNE(perplexity=30).fit_transform(item_emb_np)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
plt.scatter(item_tsne[:, 0], item_tsne[:, 1]);
plt.xticks(()); plt.yticks(());
plt.show()

Getting the name of the movies (there must be a better way, please provide alternate solutions!)

In [None]:
df = pd.read_csv(op.join(ML_100K_FOLDER, 'u.item'), sep='|', names=['item_id', 'item_name','date','','','','','','','','','','','','','','','','','','','','',''],encoding = "ISO-8859-1")
movies_names = df.loc[:,['item_id', 'item_name']]
movies_names = movies_names.set_index(['item_id'])
movies_names.head()

In [None]:
item_bias_np = model._net.item_biases.weight.data.numpy()

In [None]:
movies_names['biases'] = pd.Series(item_bias_np[1:].T[0], index=movies_names.index)

In [None]:
movies_names.head()

In [None]:
movies_names.shape

In [None]:
indices_item_train = np.sort(item_id_train.unique())
movies_names = movies_names.loc[indices_item_train]
movies_names.shape

In [None]:
movies_names = movies_names.sort_values(ascending=False,by=['biases'])

Best movies

In [None]:
movies_names.head(10)

Worse movies

In [None]:
movies_names.tail(10)

## 5. SPOTLIGHT

The code written above is a simplified version of [SPOTLIGHT](https://github.com/maciejkula/spotlight)

Once you installed it with: `conda install -c maciejkula -c pytorch spotlight=0.1.5`, you can compare the results...

In [None]:
from spotlight.datasets.movielens import get_movielens_dataset

dataset = get_movielens_dataset(variant='100K')

from spotlight.cross_validation import random_train_test_split

train, test = random_train_test_split(dataset, random_state=np.random.RandomState(42))

In [None]:
model = FactorizationModel(embedding_dim=128,  # latent dimensionality
                                   n_iter=10,  # number of epochs of training
                                   batch_size=1024,  # minibatch size
                                   learning_rate=1e-3,
                                   l2=1e-9,  # strength of L2 regularization
                                   use_cuda=torch.cuda.is_available(),
                                   num_users=total_user_id+1,
                                   num_items=total_item_id+1)

In [None]:
model.fit(train.user_ids,train.item_ids,train.ratings)

In [None]:
import torch

from spotlight.factorization.explicit import ExplicitFactorizationModel

model_spot = ExplicitFactorizationModel(loss='regression',
                                   embedding_dim=128,  # latent dimensionality
                                   n_iter=10,  # number of epochs of training
                                   batch_size=1024,  # minibatch size
                                   l2=1e-9,  # strength of L2 regularization
                                   learning_rate=1e-3,
                                   use_cuda=torch.cuda.is_available())

In [None]:
model_spot.fit(train, verbose=True)

In [None]:
item_emb_spot_np = model_spot._net.item_embeddings.weight.data.numpy()
item_bias_spot_np = model_spot._net.item_biases.weight.data.numpy()

In [None]:
movies_names = df.loc[:,['item_id', 'item_name']]
movies_names = movies_names.set_index(['item_id'])
movies_names['biases_S'] = pd.Series(item_bias_spot_np[1:].T[0], index=movies_names.index)

In [None]:
indices_item_train = np.sort(item_id_train.unique())
movies_names = movies_names.loc[indices_item_train]
movies_names.shape

In [None]:
movies_names = movies_names.sort_values(ascending=False,by=['biases_S'])

In [None]:
movies_names.head(10)

In [None]:
movies_names.tail(10)

## 6. Further Analysis

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
movie_pca = pca.fit(item_emb_np.T).components_

In [None]:
fac0 = movie_pca[0]

In [None]:
item_comp = [(f, movies_names.loc[i,'item_name']) for f,i in zip(fac0, indices_item_train)]

In [None]:
import operator;
sorted(item_comp, key=operator.itemgetter(0))[:10]

In [None]:
sorted(item_comp, key=operator.itemgetter(0), reverse=True)[:10]

In [None]:
fac1 = movie_pca[1]
item_comp = [(f, movies_names.loc[i,'item_name']) for f,i in zip(fac1, indices_item_train)]

In [None]:
sorted(item_comp, key=operator.itemgetter(0))[:10]

In [None]:
sorted(item_comp, key=operator.itemgetter(0), reverse=True)[:10]

In [None]:
fac2 = movie_pca[2]

In [None]:
start=50; end=100
X = fac0[start:end]
Y = fac2[start:end]
plt.figure(figsize=(15,15))
plt.scatter(X, Y)
for i, x, y in zip(indices_item_train[start:end], X, Y):
    plt.text(x,y,movies_names.loc[i,'item_name'], color=np.random.rand(3)*0.7, fontsize=14)
plt.show()

## Exercise

Previous analysis is not very convincing. Redo it with the dataset ml-latest-small.zip

This dataset contains much more movies and less users. Each user rated at least 20 movies.

To make the analysis of the factors more interesting, restrict it to the top 2000 most popular movies.

You should obtain something like:

![PCA movies](pca_movies.png)

# 7. Neural Net

In [None]:
import models; imp.reload(models)
from models import DeepModel
import ExplicitFactorizationModel; imp.reload(ExplicitFactorizationModel)
from ExplicitFactorizationModel import ExplicitFactorizationModel
from torch_utils import l1_loss

model = ExplicitFactorizationModel(embedding_dim=128,  # latent dimensionality
                                   n_iter=10,  # number of epochs of training
                                   batch_size=1024,  # minibatch size
                                   learning_rate=1e-3,
                                   l2=1e-9,  # strength of L2 regularization
                                   use_cuda=torch.cuda.is_available(),
                                   num_users=total_user_id+1,
                                   num_items=total_item_id+1,
                                  loss = l1_loss)

In [None]:
model.fit(user_ids_train_np, item_ids_train_np, ratings_train_np,
          user_ids_test_np, item_ids_test_np, ratings_test_np)

In [None]:
model_deep = ExplicitFactorizationModel(embedding_dim=128,  # latent dimensionality
                                   n_iter=10,  # number of epochs of training
                                   batch_size=1024,  # minibatch size
                                   learning_rate=1e-4,
                                   l2=1e-9,  # strength of L2 regularization
                                   use_cuda=torch.cuda.is_available(),
                                   num_users=total_user_id+1,
                                   num_items=total_item_id+1,
                                   net=DeepModel(total_user_id+1,total_item_id+1,128),
                                       loss=l1_loss)

In [None]:
model_deep.fit(user_ids_train_np, item_ids_train_np, ratings_train_np,
          user_ids_test_np, item_ids_test_np, ratings_test_np)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

test_preds = model.predict(user_ids_test_np, item_ids_test_np)
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))

In [None]:
test_preds_deep = model_deep.predict(user_ids_test_np, item_ids_test_np)
print("Final test MSE: %0.3f" % mean_squared_error(test_preds_deep, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds_deep, rating_test))

In [None]:
print(model_deep._net)

## Exercise

Add another layer and compare.

### Finding most similar items
Finding k most similar items to a point in embedding space

- Write in numpy a function to compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between two points in embedding space
- Write a function which computes the euclidean distance between a point in embedding space and all other points
- Write a most similar function, which returns the k item names with lowest euclidean distance / highest cosine similarity.
- Try with a movie index, such as 181 (Return of the Jedi). What do you observe? Don't expect miracles on such a small training set but still give your results on the forum!

Notes:
- you may use `np.linalg.norm` to compute the norm of vector, and you may specify the `axis=`
- the numpy function `np.argsort(...)` enables to compute the sorted indices of a vector