In this notebook we explore the MEMF-genres model defined in Report.ipynb

### Preprocessing

Let us first load the data and carry out a few preprocessing steps

In [1]:
from lib import MEMF
from lib import MF
from lib import utils
from lib import preprocessing as prepro
from lib import metrics

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import math
from tqdm import tqdm

In [3]:
ratings = pd.read_csv('../ml-latest/ratings_small.csv')

In [4]:
ratings.head()

Unnamed: 0,user,item,rating
0,18,6,3.0
1,18,21,3.0
2,18,161,4.0
3,18,216,3.0
4,18,230,4.0


In [5]:
hashed_ratings = utils.hash_data(ratings)

In [6]:
t, reduced_ratings = prepro.separate_elements_with_few_ratings(1, hashed_ratings, element="item")

In [7]:
X_train, X_test,_,_ = train_test_split(reduced_ratings, 
                                       reduced_ratings['item'], 
                                       test_size = 0.15, 
                                       stratify = reduced_ratings['item'],
                                       random_state = 596)

In [8]:
X_train = pd.concat([t, X_train])

In [9]:
X_test[-X_test['user'].isin(X_train['user'].unique())]

Unnamed: 0,old_user,old_item,user,item,rating
2212,7026,590,254,24,4.0
2213,7026,592,254,25,3.0
2211,7026,161,254,6,5.0


In [10]:
X_train = pd.concat([pd.DataFrame({'old_user' : [7026], 'old_item' : [590],
                                   'user': [254], 'item' : [24], 
                                   'rating' : [4.0]}), X_train], 
                    ignore_index = True)
X_test = X_test[-(np.logical_and(X_test['user'] == 254, X_test['item'] == 24))]

In [11]:
y_train = X_train['rating']
y_test = X_test['rating']
X_train = X_train.drop('rating', axis = 1)
X_test = X_test.drop('rating', axis = 1)

### MEMF-genres

Now let's use the MEMF-genres class.
In this setting, the clusters are the different genres movies can be assigned to. Each movie can be assigned to multiple genres, therefore to multiple clusters.

In [13]:
# getting the movie descriptor file
movie_file = "../ml-latest/clean_movies.csv"

In [14]:
# Initializing the MEMF-genre model with our movie descriptor 
base = MEMF.MEMF_genres(genre_file=movie_file)

In [15]:
# number of items  
base.unique_items

58020

In [16]:
base.data.movieId.size

106107

In [17]:
# defining the clusters, here only reading the movie descriptor file
base.define_clusters()

Number of Clusters: 20


In [18]:
# each movie is linked to a set of genres, i.e. a set of clusters
base.compute_membership()

100%|██████████| 58098/58098 [00:58<00:00, 986.46it/s] 


In [29]:
# As explained, each cluster is its own MF problem
# We have 20 of those in that case
base.fit(X_train, y_train.values)

Adventure matrix factorization is over.
Animation matrix factorization is over.
Children matrix factorization is over.
Comedy matrix factorization is over.
Fantasy matrix factorization is over.
Romance matrix factorization is over.
Drama matrix factorization is over.
Action matrix factorization is over.
Crime matrix factorization is over.
Thriller matrix factorization is over.
Horror matrix factorization is over.
Mystery matrix factorization is over.
Sci-Fi matrix factorization is over.
IMAX matrix factorization is over.
Documentary matrix factorization is over.
War matrix factorization is over.
Musical matrix factorization is over.
Western matrix factorization is over.
Film-Noir matrix factorization is over.
(no genres listed) matrix factorization is over.


In [30]:
y_pred = base.predict(X_test)

100%|██████████| 13325/13325 [01:00<00:00, 219.36it/s]


In [31]:
scores = base.score(X_test, y_test)

100%|██████████| 13325/13325 [00:59<00:00, 224.84it/s]


In [32]:
print(scores)

{'RMSE': 0.9479060358071487, 'bias': -0.052623715989096216, 'standard deviation': 0.9464441860115801}


The RMSE evaluates at 0.9479 and is mainly due to standard deviation. The bias counts for almost nothing.

Now if we want to evaluate the diversity of our model, we can use the Chi-Sqaure statistic as it is defined in `Report.ipynb`

In [41]:
diversity = metrics.compute_diversity(base, X_test)

100%|██████████| 467/467 [00:01<00:00, 283.29it/s]
100%|██████████| 467/467 [00:01<00:00, 291.05it/s]
100%|██████████| 467/467 [00:01<00:00, 303.93it/s]
100%|██████████| 467/467 [00:01<00:00, 305.14it/s]
100%|██████████| 467/467 [00:01<00:00, 295.77it/s]
100%|██████████| 467/467 [00:01<00:00, 293.17it/s]


In [42]:
print(diversity)

151456.4602480644


This value will have to be compared with other models: we will treat this metric only in relative terms.

### Normalized Discounted Cumulative Gain


Now we also want to compute the NDCG for our model to get a better idea of its performance. 

In [35]:
##### with the in-class function
NDCG = metrics.NDCG()

In [36]:
NDCG.create_ranking(X_train, X_test, y_train, y_test)

100%|██████████| 10000/10000 [00:06<00:00, 1632.02it/s]


In [37]:
NDCG.dico_ranking_[254]

{6: 1, 24: 2, 25: 3}

In [38]:
NDCG.NDCG(X_train, X_test, y_train, y_test, y_pred)

100%|██████████| 10000/10000 [00:06<00:00, 1575.76it/s]
100%|██████████| 10000/10000 [00:06<00:00, 1583.01it/s]


0.9544001902587957