# Using `surprise`

See the documentation [here](https://surprise.readthedocs.io/en/stable/getting_started.html)!

In [1]:
import surprise
from surprise.prediction_algorithms import *
import pandas as pd
import numpy as np
import datetime as dt

## Agenda

SWBAT:

- use the `surprise` package to build recommendation engines.

In [2]:
data = surprise.Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/jonathanfetterolf/.surprise_data/ml-100k


Now that we've downloaded the data, we can find it in a hidden directory:

In [3]:
df = pd.read_csv('~/.surprise_data/ml-100k/ml-100k/u.data',
            sep='\t', header=None)
df = df.rename(columns={0: 'user', 1: 'item', 2: 'rating', 3: 'timestamp'})
df

Unnamed: 0,user,item,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


## Data Exploration

In [4]:
df['user'].nunique()

943

In [5]:
df['item'].nunique()

1682

In [6]:
stats = df[['rating', 'timestamp']].describe()
stats

Unnamed: 0,rating,timestamp
count,100000.0,100000.0
mean,3.52986,883528900.0
std,1.125674,5343856.0
min,1.0,874724700.0
25%,3.0,879448700.0
50%,4.0,882826900.0
75%,4.0,888260000.0
max,5.0,893286600.0


In [7]:
print(dt.datetime.fromtimestamp(stats.loc['min', 'timestamp']))
print(dt.datetime.fromtimestamp(stats.loc['max', 'timestamp']))

1997-09-19 23:05:10
1998-04-22 19:10:38


In [8]:
read = surprise.Reader('ml-100k')

In [9]:
read.rating_scale

(1, 5)

## Modeling

In [11]:
train, test = surprise.model_selection.train_test_split(data, random_state=42)

In [12]:
model = KNNBasic().fit(train)

Computing the msd similarity matrix...
Done computing similarity matrix.


$\hat{r}_{ui} = \frac{
    \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot r_{vi}}
    {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}$
    OR
$\hat{r}_{ui} = \frac{
    \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot r_{uj}}
    {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)}$

In [13]:
model2 = SVD().fit(train)

$\sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 +
    \lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right)$

In [14]:
model3 = NMF().fit(train)

$\hat{r}_{ui} = q_i^Tp_u$

In [15]:
model.get_neighbors(iid=51, k=1)

[65]

In [16]:
conds = [df['item'] == 51, df['item'] == 65]
choices = 2*[True]

df.loc[np.select(conds, choices, default=False)].sort_values('user')

Unnamed: 0,user,item,rating,timestamp
17220,1,65,4,875072125
7180,1,51,4,878543275
34873,7,51,2,891352984
19068,11,51,4,891906439
20877,13,51,3,882399419
...,...,...,...,...
69366,916,65,3,880845327
71730,916,51,2,880845658
90292,922,51,4,891448451
83681,934,65,4,891192914


## Evaluation

In [17]:
model.test(test)

[Prediction(uid='907', iid='143', r_ui=5.0, est=4.039960584359155, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='371', iid='210', r_ui=4.0, est=4.017925064716712, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='218', iid='42', r_ui=4.0, est=3.7671897065953712, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='829', iid='170', r_ui=4.0, est=4.196945437050507, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='733', iid='277', r_ui=1.0, est=3.3353958388714653, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='363', iid='1512', r_ui=1.0, est=4.463116702100285, details={'actual_k': 4, 'was_impossible': False}),
 Prediction(uid='193', iid='487', r_ui=5.0, est=3.959646386658832, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='808', iid='313', r_ui=5.0, est=4.482811176968667, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='557', iid='682', r_ui=2.0, est

In [18]:
surprise.accuracy.mae(model.test(test))

MAE:  0.7727


0.7726923699816388

In [19]:
surprise.accuracy.mae(model2.test(test))

MAE:  0.7387


0.7387193198681798

In [20]:
surprise.accuracy.mae(model3.test(test))

MAE:  0.7528


0.7528091415119881

In [21]:
surprise.accuracy.rmse(model.test(test))

RMSE: 0.9802


0.980150596704479

In [22]:
surprise.accuracy.rmse(model2.test(test))

RMSE: 0.9364


0.9364250865984487

In [23]:
surprise.accuracy.rmse(model3.test(test))

RMSE: 0.9574


0.9574261291882717