In [1]:
import multiprocessing as mp
import pandas as pd
import numpy as np

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from funk_svd.dataset import fetch_ml_ratings
from funk_svd.utils import _timer
from funk_svd import SVD

## Import data from MovieLens 20M dataset

[MovieLens 20M Dataset Research Paper]("http://files.grouplens.org/papers/harper-tiis2015.pdf")

In [2]:
%%time

df = fetch_ml_ratings(variant='20m', verbose=True)
print()

In [3]:
df.head()

In [4]:
df.tail()

## Perform a train/val/test split

There are 138,493 different users in the MovieLens20m dataset, each of them having rated at least 20 movies. Let's sample the 4 last ratings per user and randomly split them between validation and test sets. 

To do so, we need to query our DataFrame for each user and then select their 4 last ratings. With so much users it's naturally quite expensive... hopefully it's possible to parallelize it as iterations are independant, allowing us to save some time (especially if you have good computing ressources). I'm using an Intel Core i7-8565U CPU (4 physical cores) on a 16GB laptop.

<img src="https://www.dlapiper.com/~/media/images/insights/publications/2015/warning.jpg?la=en&hash=6F2E30889FD9E0B11016A1712E6E583575717C54" width="23" align="left">

&nbsp; If you want to run this notebook with **Windows**, you won't be able to use `multiprocessing.Pool` because it's lacking `fork` method. For simplicity you can just do it sequentially.

In [5]:
@_timer(text='')
def compute_val_test_mask(data, i, n_process, n_rate=4):
    val_test_mask = []
    users = data['u_id'].unique()
    
    for u_id in users:
        u_subset = data[data['u_id'] == u_id].copy()
        val_test_mask += u_subset.iloc[-n_rate:].index.tolist()
        
    print(f'Process {i} done in', end=' ')
    return val_test_mask

In [6]:
users = df['u_id'].unique()

seed = 3
np.random.seed(seed)
np.random.shuffle(users)

n_process = 12
pool = mp.Pool(processes=n_process)

df_splitted = [
    df.query('u_id.isin(@users_subset)')
    for users_subset in np.array_split(users, n_process)
]

results = [
    pool.apply_async(compute_val_test_mask, args=(data, i, n_process))
    for i, data in zip(range(n_process), df_splitted)
]

results = [p.get() for p in results]
val_test_mask = [item for sublist in results for item in sublist]

In [7]:
train = df.drop(val_test_mask)
val = df.loc[val_test_mask].sample(frac=0.5, random_state=seed)
test = df.loc[val_test_mask].drop(val.index.tolist())

## Modelization

Let's fit our model.

In [8]:
svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
          early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

svd.fit(X=train, X_val=val)

Predict test set and compute results.

In [9]:
%%time

pred = svd.predict(test)

rmse = np.sqrt(mean_squared_error(test['rating'], pred))
mae = mean_absolute_error(test['rating'], pred)

print(f'Test RMSE: {rmse:.2f}')
print(f'Test MAE:  {mae:.2f}')
print()

## Comparison with Surprise library

In [10]:
from surprise import Dataset
from surprise import Reader
from surprise import SVD

Format data according Surprise way.

In [11]:
%%time

reader = Reader(rating_scale=(1, 5))

trainset = Dataset.load_from_df(train[['u_id', 'i_id', 'rating']],
                               reader=reader).build_full_trainset()

testset = Dataset.load_from_df(test[['u_id', 'i_id', 'rating']], reader=reader)
testset = testset.construct_testset(testset.raw_ratings)

Fit the model with the same parameters.

In [12]:
%%time

svd = SVD(lr_all=.001, reg_all=0.005, n_epochs=46, n_factors=15, verbose=True)
svd.fit(trainset)
print()

Predict test set and compute results.

In [13]:
%%time

pred = svd.test(testset)
y_true = [p.r_ui for p in pred]
y_hat = [p.est for p in pred]

rmse = np.sqrt(mean_squared_error(y_true, y_hat))
mae = mean_absolute_error(y_true, y_hat)

print(f'Test RMSE: {rmse:.2f}')
print(f'Test MAE:  {mae:.2f}')
print()

Accuracy performance is naturally equivalent, difference stands in the computation time, `Numba` allowing us to run more than 10 times faster than with cython.

| Movielens 20M | RMSE   | MAE    | Time          |
|:--------------|:------:|:------:|--------------:|
| Surprise      |  0.88  |  0.68  | 10 min 40 sec |
| Funk-svd      |  0.88  |  0.68  |        42 sec |