# Test KNN and SVD algorithms in surprise

Follow this link: https://realpython.com/build-recommendation-engine-collaborative-filtering/

How to do the test? Not complete.

In [1]:
from scipy import spatial

In [2]:
a = [1, 2]
b = [2, 4]
c = [2.5, 4]
d = [4.5, 5]

In [3]:
print(spatial.distance.euclidean(c, a))

print(spatial.distance.euclidean(c, b))

print(spatial.distance.euclidean(c, d))

2.5
0.5
2.23606797749979


In [4]:
print(spatial.distance.cosine(c,a))

print(spatial.distance.cosine(c,b))

print(spatial.distance.cosine(c,d))

print(spatial.distance.cosine(a,b))

0.004504527406047898
0.004504527406047898
0.015137225946083022
0.0


## Matrix factorization

- singular value decomposition (SVD)

In [5]:
import pandas as pd
from surprise import Dataset
from surprise import Reader

# This is the same data that was plotted for similarity earlier
# with one new user "E" who has rated only movie 1
ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}

df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))

In [6]:
df

Unnamed: 0,item,user,rating
0,1,A,1.0
1,2,A,2.0
2,1,B,2.0
3,2,B,4.0
4,1,C,2.5
5,2,C,4.0
6,1,D,4.5
7,2,D,5.0
8,1,E,3.0


In [7]:
reader

<surprise.reader.Reader at 0x7f9d9281d190>

In [8]:
# Loads Pandas dataframe
data = Dataset.load_from_df(df[["user", "item", "rating"]], reader)
# Loads the builtin Movielens-100k data
movielens = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/dongzhang/.surprise_data/ml-100k


In [10]:
data

<surprise.dataset.DatasetAutoFolds at 0x7f9d929b1dc0>

## K-Nearest Neighbours (k-NN)

- **name** contains the similarity metric to use. Options are cosine, msd, pearson, or pearson_baseline. The default is msd.

- **user_based** is a boolean that tells whether the approach will be user-based or item-based. The default is True, which means the user-based approach will be used.

- **min_support** is the minimum number of common items needed between users to consider them for similarity. For the item-based approach, this corresponds to the minimum number of common users for two items.

In [11]:
from surprise import KNNWithMeans

# To use item-based cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": False,  # Compute  similarities between items
}
algo = KNNWithMeans(sim_options=sim_options)

In [12]:
trainingSet = data.build_full_trainset()

In [14]:
algo.fit(trainingSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f9d92a0d760>

### Case study: how the user E would rate the movie 2

In [17]:
prediction = algo.predict('E', 2)
prediction

Prediction(uid='E', iid=2, r_ui=None, est=4.15, details={'actual_k': 1, 'was_impossible': False})

In [18]:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise.model_selection import GridSearchCV

data = Dataset.load_builtin("ml-100k")
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [21]:
# How to do the test?

## SVD

- **n_epochs** is the number of iterations of SGD, which is basically an iterative method used in statistics to minimize a function.

- **lr_all** is the learning rate for all parameters, which is a parameter that decides how much the parameters are adjusted in each iteration.

- **reg_all** is the regularization term for all parameters, which is a penalty term added to prevent overfitting.

In [22]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

data = Dataset.load_builtin("ml-100k")

param_grid = {
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.4, 0.6]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.9635607198455595
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
