In [1]:
import pandas as pd
import numpy as np

from scipy import spatial

## Collaborative Filtering - MovieLens 100k

- Find similar users or items
- Predict the ratings of the items that are not yet rated by a user

1. How do we determine which users or items are similar to one another?

2. Given that we know which users are similar, how do we determine the rating a user would give to an item based on the ratings of similar users?

3. How do we measure the accuracy of the ratings we calculate?

### Memory based algorithms

To find the rating *R* that a user *U* would give to an item *I*, the approach includes:

- Finding users similar to *U* who have rated the item *I*

Calculate similarity (e.g. using cosine distance) based on rating vectors

- Calculating the rating *R* based on the ratings of users found in the previous step

Calculate weighted average of user ratings based on similarity

#### User-based vs Item-based collaborative filtering

- User-based: For a user *U*, with a set of similar users determined on rating vectors consisting of given item ratings, the rating for an item *I*, which hasn't been rated, is found by picking out N users from the similarity list who have rated the item *I* and calculating the rating based on these N ratings.

- Item-based: For an item *I*, with a set of similar items determined based on rating vectors consisting of received user ratings, the rating by a user *U*, who hasn't rated it, is found by picking out N items from the similarity list that have been rated by *U* and calculating the rating based on these N ratings.

In a system where there are more users than items, item-based filtering is faster and more stable than user-based. It's also known to perform better than the user-based approach when the ratings matrix is sparse (which is very likely when you have many items).

The item-based approach performs poorly on browsing or entertainment based scenarios, however. Such scenarios are better resolved with matrix factorisation techniques, or hybrid recommenders that also take into account the content of the data like the genre, by using content-based filtering.

### Model Based

These involve a step to reduce or compress the large but sparse user-item matrix, e.g. using singular value decomposition (SVD), principal component analysis (PCA), NMF, Autoencoders etc.

In [2]:
from surprise import Dataset
from surprise import Reader

ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}

df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))

# load Pandas dataframe
data = Dataset.load_from_df(df[["user", "item", "rating"]], reader)
# load the built-in Movielens-100k data
movielens = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to C:\Users\chloe/.surprise_data/ml-100k


In [3]:
movielens

<surprise.dataset.DatasetAutoFolds at 0x1bebd8d7fa0>

In [4]:
from surprise import KNNWithMeans

# use item-based cosine-similarity
sim_options = {
    "name": "cosine",
    "user_based": False
}

algo = KNNWithMeans(sim_options=sim_options)

In [5]:
trainingSet = data.build_full_trainset()

algo.fit(trainingSet)
prediction = algo.predict('E', 2)
prediction.est

Computing the cosine similarity matrix...
Done computing similarity matrix.


4.15

Grid search model selection

In [7]:
from surprise.model_selection import GridSearchCV

sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True]
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans,
                  param_grid,
                  measures=["rmse", "mae"],
                  cv=3
                  )

gs.fit(movielens)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Grid search for best SVD model

In [8]:
from surprise import SVD

param_grid = {
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.4, 0.6]
}
gs = GridSearchCV(SVD,
                  param_grid,
                  measures=["rmse", "mae"],
                  cv=3)

gs.fit(movielens)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.9630865575647292
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
