In this notebook we explore the k-means models defined in `Report.ipynb`.

### Preprocessing

Let us first load the data and carry out a few preprocessing steps:

In [1]:
# necessary import - helper functions we implemented
from lib import preprocessing
from lib import utils
from lib import MEMF_k_means as Mk
import pandas as pd

In [2]:
# sample ~1 M ratings
data = preprocessing.sample_data("ratings.csv", random_state = 0)

In [3]:
# compute the total number of ratings
tot_ratings = utils.total_ratings(data)
# compute the number of users
tot_users = utils.unique_users(data)
# compute the number of items
tot_items = utils.unique_movies(data)

We hash the item and user ids so that the set of unique hashed ids becomes `range(tot_items)` and `range(tot_users)` for more convenient manipulation in matrix factorization:

In [4]:
# hash the data
hashed_data = utils.hash_data(data)

In [5]:
# create train and validation sets
train, test = preprocessing.custom_sampled_train_test_split(hashed_data,
                                                            random_state = 0)

In [6]:
# create the hashing table used to retrieve true ids
hashing_table = pd.DataFrame({"old_item": train["old_item"],
                              "item": train["item"]})

In [7]:
# create rating vectors
y_train = train['rating']
y_test =  test['rating']

### K-means MEMF with one cluster per movie (K-means-MEMF-1)

Optimization on the RMSE of the model's prediction on the validation set let to the following choice of number of clusters (see `model_tuning.py` for the code used to obtain the RMSE score for each number of cluster tries):
* `n_clusters` = 8

The small number of clusters could suggest that this model does not make the most out of dividing the data into clusters and we will try to see if it is indeed the case in the following.

Let us instantiate this model and fit it to the training data. Then we peek at the distribution of the number of movies per cluster.

In [8]:
M1 = Mk.MEMF_k_means(n_clusters = 8,
                  tot_users = tot_users, 
                  tot_items = tot_items)

In [9]:
# create the clusters

# this function for this model only transforms the data into a sparse matrix
reduced_sparse_train = M1.reduce_dimension(train)

M1.compute_membership(hashing_table, reduced_sparse_train)
M1.group_by_cluster(hashing_table)

# peek at the distribution of clusters
for i in range(len(M1.clusters_)):
    print(len(M1.clusters_[i]))

100%|██████████| 22243/22243 [16:15<00:00, 22.79it/s]
22243it [00:18, 1197.98it/s]

1
342
1
1
1
1
21895
1





This results suggests that the clustering phase did poorly. Looking at this we thought that there must be some outliers that create their own clusters systematically. But upon closer inspection, by removing those movies and reconducting the clustering to see if the distribution would become more even across the clusters, it turned out to still be as imbalanced. We believe the imbalance is due to the high sparsity of the ratings of movies that leads to considerably different movie vectors and always leads to better clustering results by isolating a few in their own clusters. 

**We will address this when extending our models at a later stage**.

We now fit the model to the entire training data:

In [10]:
M1.fit(train, y_train)

Let us see how our model performs on different segments. We will compare the RMSE for different groups and the RMSE on the whole data.

*Users with many ratings: users with more than 40 ratings*

In [11]:
rmse1 = -utils.predict_popular(M1, train, test, y_test, item = False, threshold = 40)
print(rmse1)

100%|██████████| 20641/20641 [13:10<00:00, 26.10it/s]

0.8826976685021714





*The most popular movies : movies with more than 1000 ratings*

In [12]:
rmse2 = -utils.predict_popular(M1, train, test, y_test, item = True, threshold = 1000)
print(rmse2)

100%|██████████| 29195/29195 [12:48<00:00, 36.97it/s]

0.8957380139400553





*The most scarcely rated movies: movies with less than 10 ratings*

In [13]:
rmse3 = -utils.predict_popular(M1, train, test, y_test, item = True, threshold = 10, ascending = True)
print(rmse3)

100%|██████████| 4347/4347 [03:13<00:00, 22.12it/s]

1.2206227176643751





The RMSE over the whole data is:

In [14]:
rmse_tot = -M1.score(test, y_test, hashing_table)
print(rmse_tot)

100%|██████████| 173374/173374 [1:33:04<00:00, 31.05it/s]

0.9432428841405675





As expected, the model performs better on users and items for which a lot of ground-truth training data is available. It does much more poorly on rare movies. We will keep these results in mind to compare with the other models we explore below.

### K-means MEMF with one main cluster per movie and weighted biases (K-means-MEMF-2)

The parameters of this model are the following:
* `n_clusters`: the number of movie clusters (`k`)
* `n_clusters_prediction`: the number of closest clusters considered in the prediction weighted sum (`k'` in the discussion of the model definition in `Report.ipynb`)
* `weight_penalty`: the parameter of the weight function used to penalize irrelevant clusters ($\lambda$ in the discussion of the model definition in `Report.ipynb`)

Optimization on the RMSE of the model's prediction on the validation set let to the following choice of parameters (see `model_tuning.py` for the code used to obtain the RMSE score for the - small - combination of parameters tried):
* `n_clusters` = 12
* `n_clusters_prediction` = 2
* `weight_penalty` = 1.

This model seemed to perform better with a larger number of cluster than the previous one. It still remains very small compared to the number of movies (close to 20,000). The number of closest clusters used in the prediction seems reasonable. When thinking of genres for example, it would be seem reasonable to use three genres to determine the ratings for a movie - although we don't know yet if our clusters here correspond to something genre-like.

Let us instantiate this model and fit it to the training data:

In [15]:
M2 = Mk.MEMF_k_means(n_clusters = 12,
                  n_clusters_prediction = 2,
                  weight_penalty = 1.,
                  tot_users = tot_users, 
                  tot_items = tot_items)

In [16]:
# create the clusters

# this function for this model only transforms the data into a sparse matrix
reduced_sparse_train = M2.reduce_dimension(train)

M2.compute_membership(hashing_table, reduced_sparse_train)
M2.group_by_cluster(hashing_table)

# peek at the distribution of clusters
for i in range(len(M2.clusters_)):
    print(len(M2.clusters_[i]))

100%|██████████| 22243/22243 [27:02<00:00, 12.93it/s]
22243it [00:21, 1020.88it/s]

1
22232
1
1
1
1
1
1
1
1
1
1





**The distribution of movies per cluster clearly demonstrates that this model does not leverage clustering at all: there is clearly something fundamentally wrong with how we cluster movies. We believe the sparsity of the matrix with the addition of  numerous outliers really prevents a useful fragmentation of movies into cluster. This is something we address when extending the model.**

The prediction score should somehow differ from basic matrix factorization nevertheless because the two nearest clusters are used in the predictions: meaning that eleven or less movies impact all the other ratings (only one movie per cluster for those eleven clusters) through user biases on these movies. This does not seem like a good idea but let us see of the model performs nevertheless.

In [17]:
M2.fit(train, y_train)

*Users with many ratings: users with more than 40 ratings*

In [18]:
rmse1 = -utils.predict_popular(M2, train, test, y_test, item = False, threshold = 40)
print(rmse1)

100%|██████████| 20641/20641 [18:34<00:00, 18.52it/s]

0.922616505417432





*The most popular movies : movies with more than 1000 ratings*

In [19]:
rmse2 = -utils.predict_popular(M2, train, test, y_test, item = True, threshold = 1000)
print(rmse2)
# 0.9189

100%|██████████| 29195/29195 [26:10<00:00, 19.07it/s]

0.9176434575082244





*The most scarcely rated movies: movies with less than 10 ratings*

In [20]:
rmse3 = -utils.predict_popular(M2, train, test, y_test, item = True, threshold = 10, ascending = True)
print(rmse3)

100%|██████████| 4347/4347 [04:18<00:00, 16.84it/s]

1.151220562694912





The RMSE over the whole data is:

In [21]:
rmse_tot = -M2.score(test, y_test, hashing_table)
print(rmse_tot)

100%|██████████| 173374/173374 [1:59:10<00:00, 24.32it/s] 

0.9493229192226168





Below is a table summarizing the RMSE of the two models on each segment and over the whole testing set:

|  Segment              | K-means-MEMF-1| K-means-MEMF-2 |
|:---------------------:|:-------------:|:--------------:|
| Popular movies        | **0.8957**        | 0.9176         |
| Scarcely rated movies | 1.2206        | **1.1512**         |
| Active users          | **0.8827**        | 0.9226         |
| Overall               | **0.9432**        | 0.9493         |

The second model performs better than the first one of scarcely rated movies. It might be an artefact of adjustements using second clusters that benefits scarcely rated movies (especially the outliers that belong to a cluster of their own) but not the other movies tremendously.

The first model performs better on all other segments and overall (although not very significantly overall).

### K-means MEMF with weighted predictions (K-means-MEMF-3)

The parameters for this model are the same as the previous one. We just have to set the boolean parameter `fit_all` from its default `False` to `True` to make sure it includes movies in all their closest clusters' models.

Optimization on the RMSE of the model's prediction on the validation set let to the following choice of parameters (see `model_tuning.py` for the code used to obtain the RMSE score for the - small - combination of parameters tried):
* `n_clusters` = 8
* `n_clusters_prediction` = 3
* `weight_penalty` = 1.

Let us instantiate this model and fit it to the training data:

In [22]:
M3 = Mk.MEMF_k_means(n_clusters = 8,
                  n_clusters_prediction = 3,
                  weight_penalty = 1.,
                  fit_all = True,
                  tot_users = tot_users, 
                  tot_items = tot_items)

In [23]:
# create the clusters

# this function for this model only transforms the data into a sparse matrix
reduced_sparse_train = M3.reduce_dimension(train)

M3.compute_membership(hashing_table, reduced_sparse_train)
M3.group_by_cluster(hashing_table)

M3.fit(train, y_train)

100%|██████████| 22243/22243 [20:35<00:00, 18.01it/s]
100%|██████████| 22243/22243 [00:00<00:00, 487210.05it/s]


*Users with many ratings: users with more than 40 ratings*

In [24]:
rmse1 = -utils.predict_popular(M3, train, test, y_test, item = False, threshold = 40)
print(rmse1)

100%|██████████| 20641/20641 [55:05<00:00,  5.65it/s]

0.8625250124869523





*The most popular movies : movies with more than 1000 ratings*

In [25]:
rmse2 = -utils.predict_popular(M3, train, test, y_test, item = True, threshold = 1000)
print(rmse2)

100%|██████████| 29195/29195 [1:06:50<00:00,  6.69it/s]

0.8852002191368674





*The most scarcely rated movies: movies with less than 10 ratings*

In [26]:
rmse3 = -utils.predict_popular(M3, train, test, y_test, item = True, threshold = 10, ascending = True)
print(rmse3)

100%|██████████| 4347/4347 [10:59<00:00,  6.62it/s]

1.2158495105911074





The RMSE over the whole data is:

In [27]:
rmse_tot = -M3.score(test, y_test, hashing_table)
print(rmse_tot)

100%|██████████| 173374/173374 [5:43:44<00:00,  8.74it/s]  

0.9185784269336655





Below is a table summarizing the RMSE of the three k-means models on each segment and over the whole testing set:

|  Segment              | K-means-MEMF-1| K-means-MEMF-2 | K-means-MEMF-3 |
|:---------------------:|:-------------:|:--------------:|:--------------:|
| Popular movies        | 0.8957        | 0.9176         | **0.8852**         |
| Scarcely rated movies | 1.2206        | **1.1512**         | 1.2158         |
| Active users          | 0.8827        | 0.9226         | **0.8625**         |
| Overall               | 0.9432        | 0.9493         | **0.9186**         |

The third model is the most efficient in terms of accuracy (RMSE). Its overall performance beats the others quite significantly.

It seems that model 3 does not improve predictions from model 1 on scarcely rated movies significantly. This suggest that the fault lies in the sparsity of the data itself and that it might be very hard to significantly improve performance on those movies with a strong risk of overfitting over the little ratings.

The models perform very well on active users compared to overall performance, which is comforting in that those users are amongst those we wish to target with this recommender system born from an incentive of customer retention. 