# surprise

[API地址](https://surprise.readthedocs.io/en/stable/getting_started.html)<br>
[数据加载](#Dataset)<br>
[模型训练](#fit)<br>
[交叉验证](#CV)<br>

算法 | 描述
-: | :-
random.**NormalPredictor** | 基于统计的推荐系统预测打分，假定用户打分的分布是基于正态分布的
**BaselineOnly** | 基于统计的基准预测线打分
knns.**KNNBasic** | 基本的协同过滤算法
knns.**KNNWithMeans** | 协同过滤算法的变种，考虑每个用户的平均评分
knns.**KNNWithZScore** | 协同过滤算法的变种，考虑每个用户评分的归一化操作
knns.**KNNBaseline** | 协同过滤算法的变种，考虑每个用户评分的基线
matrix_factorzation.**SVD** | SVD 矩阵分解算法
matrix_factorzation.**SVDpp** | SVD++ 矩阵分解算法
matrix_factorzation.**NMF** | 一种非负矩阵分解的协同过滤算法
**SlopeOne** | SlopeOne 协同过滤算法

In [1]:
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import cross_validate, train_test_split

import pandas as pd

## Dataset
[回到顶部](#surprise)

In [48]:
data=Dataset.load_builtin('ml-100k')

In [5]:
data_value=data.raw_ratings
data_value  # 第四列为时间戳

[('196', '242', 3.0, '881250949'),
 ('186', '302', 3.0, '891717742'),
 ('22', '377', 1.0, '878887116'),
 ('244', '51', 2.0, '880606923'),
 ('166', '346', 1.0, '886397596'),
 ('298', '474', 4.0, '884182806'),
 ('115', '265', 2.0, '881171488'),
 ('253', '465', 5.0, '891628467'),
 ('305', '451', 3.0, '886324817'),
 ('6', '86', 3.0, '883603013'),
 ('62', '257', 2.0, '879372434'),
 ('286', '1014', 5.0, '879781125'),
 ('200', '222', 5.0, '876042340'),
 ('210', '40', 3.0, '891035994'),
 ('224', '29', 3.0, '888104457'),
 ('303', '785', 3.0, '879485318'),
 ('122', '387', 5.0, '879270459'),
 ('194', '274', 2.0, '879539794'),
 ('291', '1042', 4.0, '874834944'),
 ('234', '1184', 2.0, '892079237'),
 ('119', '392', 4.0, '886176814'),
 ('167', '486', 4.0, '892738452'),
 ('299', '144', 4.0, '877881320'),
 ('291', '118', 2.0, '874833878'),
 ('308', '1', 4.0, '887736532'),
 ('95', '546', 2.0, '879196566'),
 ('38', '95', 5.0, '892430094'),
 ('102', '768', 2.0, '883748450'),
 ('63', '277', 4.0, '875747401

In [53]:
reader=Reader(name=None,
             line_format='user item rating',
             sep=',',
             rating_scale=(1,10),
             skip_lines=1)
data1=Dataset.load_from_file(r'data_set\ratings.csv', reader=reader)

In [54]:
data1.raw_ratings

[('276726', '0155061224', 5.0, None),
 ('276729', '052165615X', 3.0, None),
 ('276729', '0521795028', 6.0, None),
 ('276744', '038550120X', 7.0, None),
 ('276747', '0060517794', 9.0, None),
 ('276747', '0671537458', 9.0, None),
 ('276747', '0679776818', 8.0, None),
 ('276747', '0943066433', 7.0, None),
 ('276747', '1885408226', 7.0, None),
 ('276748', '0747558167', 6.0, None),
 ('276751', '3596218098', 8.0, None),
 ('276754', '0684867621', 8.0, None),
 ('276755', '0451166892', 5.0, None),
 ('276762', '0380711524', 5.0, None),
 ('276762', '3453092007', 8.0, None),
 ('276772', '0553572369', 7.0, None),
 ('276772', '3499230933', 10.0, None),
 ('276772', '3596151465', 10.0, None),
 ('276774', '3442136644', 9.0, None),
 ('276786', '8437606322', 8.0, None),
 ('276786', '8478442588', 6.0, None),
 ('276788', '0345443683', 8.0, None),
 ('276788', '043935806X', 7.0, None),
 ('276788', '055310666X', 10.0, None),
 ('276796', '0330332775', 5.0, None),
 ('276798', '0006379702', 5.0, None),
 ('276798

In [55]:
ratings=pd.read_csv(r'data_set\ratings.csv', )
reader=Reader(rating_scale=(0,10))
data2=Dataset.load_from_df(ratings, reader)

In [56]:
data2.raw_ratings

[(276726, '0155061224', 5.0, None),
 (276729, '052165615X', 3.0, None),
 (276729, '0521795028', 6.0, None),
 (276744, '038550120X', 7.0, None),
 (276747, '0060517794', 9.0, None),
 (276747, '0671537458', 9.0, None),
 (276747, '0679776818', 8.0, None),
 (276747, '0943066433', 7.0, None),
 (276747, '1885408226', 7.0, None),
 (276748, '0747558167', 6.0, None),
 (276751, '3596218098', 8.0, None),
 (276754, '0684867621', 8.0, None),
 (276755, '0451166892', 5.0, None),
 (276762, '0380711524', 5.0, None),
 (276762, '3453092007', 8.0, None),
 (276772, '0553572369', 7.0, None),
 (276772, '3499230933', 10.0, None),
 (276772, '3596151465', 10.0, None),
 (276774, '3442136644', 9.0, None),
 (276786, '8437606322', 8.0, None),
 (276786, '8478442588', 6.0, None),
 (276788, '0345443683', 8.0, None),
 (276788, '043935806X', 7.0, None),
 (276788, '055310666X', 10.0, None),
 (276796, '0330332775', 5.0, None),
 (276798, '0006379702', 5.0, None),
 (276798, '3442131340', 7.0, None),
 (276798, '3548603203', 6

## fit
[回到顶部](#surprise)

**分割训练集测试集**

如果不想进行交叉验证法，也可以使用train_test_split()将测试集与训练集划按你给定大小进行划分，并使用选定的精度评价指标。<br>
fit()函数将在训练集上使用算法，test()函数返回在测试集中的预测

In [49]:
trainset, testset=train_test_split(data, test_size=0.25)

In [50]:
algo=SVD(
    n_factors=100,  # 隐因子个数
    n_epochs=20,  # 迭代次数
    biased=True,  # 开启为BiasSVD
    init_mean=0,  # 初始化均值
    init_std_dev=0.1,  #初始化标准差
    lr_all=0.005,  # 学习率
    reg_all=0.02,  # 正则化参数
    lr_bu=None,  # 用户的偏置
    lr_bi=None,  # 物品的偏置
    lr_pu=None,  # 用户的特征矩阵
    lr_qi=None,  # 物品的特征矩阵
    reg_bu=None,
    reg_bi=None,
    reg_pu=None,
    reg_qi=None,
    random_state=None,
    verbose=True,  # 信息显示
)

In [51]:
algo.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x24db3eaf108>

In [52]:
pred=algo.test(testset)
pred

[Prediction(uid='655', iid='880', r_ui=2.0, est=2.5485123982674875, details={'was_impossible': False}),
 Prediction(uid='655', iid='533', r_ui=2.0, est=2.843042018086738, details={'was_impossible': False}),
 Prediction(uid='727', iid='234', r_ui=2.0, est=3.765125525466264, details={'was_impossible': False}),
 Prediction(uid='551', iid='51', r_ui=5.0, est=3.5248906550914354, details={'was_impossible': False}),
 Prediction(uid='393', iid='739', r_ui=3.0, est=3.3108269215039132, details={'was_impossible': False}),
 Prediction(uid='372', iid='581', r_ui=5.0, est=3.8160436793939914, details={'was_impossible': False}),
 Prediction(uid='577', iid='546', r_ui=3.0, est=3.425196520930973, details={'was_impossible': False}),
 Prediction(uid='14', iid='213', r_ui=5.0, est=4.045541302336228, details={'was_impossible': False}),
 Prediction(uid='790', iid='258', r_ui=3.0, est=3.4747227171920105, details={'was_impossible': False}),
 Prediction(uid='314', iid='202', r_ui=5.0, est=3.785753605917301, det

In [57]:
accuracy.rmse(pred)

RMSE: 0.9474


0.9473743797731916

In [58]:
accuracy.mae(pred)

MAE:  0.7447


0.744674223312412

In [59]:
accuracy.mse(pred)

MSE: 0.8975


0.8975182154506396

In [60]:
accuracy.fcp(pred)

FCP:  0.7000


0.6999565012827683

**不分割训练集测试集**

显然，也可以简单地将算法拟合到整个数据集。使用build_full_trainset()方法来实现，然后通过直接调用predict()方法来预测正确率。

In [62]:
trainset=data.build_full_trainset()

In [66]:
algo=SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x24dc792ccc8>

In [67]:
algo.predict('196', '302')

Prediction(uid='196', iid='302', r_ui=None, est=3.7977464700656007, details={'was_impossible': False})

## CV
[回到顶部](#surprise)

surprise工具库拥有一系列内置函数和数据集供你使用，它强大到你只需要简单地写几行代码就能使用**交叉验证法**。<br>

那什么是交叉验证呢？交叉验证简称CV，有的时候也称作循环估计，是一种统计学上将数据样本切割成较小子集的实用方法。因为在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的，但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来，这一部分不参加训练，对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证。

- RMSE：均方根误差
- MAE：平均绝对误差

![](https://note.youdao.com/yws/api/personal/file/21324908C9F449388C30F2563680234E?method=download&shareKey=97d0390a58302ca966f63e38c8e78990)

In [68]:
cross_validate(algo, data, measures=['RMSE', 'MSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9386  0.9401  0.9273  0.9339  0.9396  0.9359  0.0048  
MSE (testset)     0.8810  0.8838  0.8598  0.8722  0.8828  0.8760  0.0090  
MAE (testset)     0.7407  0.7413  0.7294  0.7385  0.7389  0.7378  0.0043  
Fit time          5.20    5.27    5.64    5.76    5.54    5.48    0.21    
Test time         0.15    0.16    0.20    0.18    0.68    0.27    0.20    


{'test_rmse': array([0.93864198, 0.94012688, 0.92727885, 0.93392929, 0.93959316]),
 'test_mse': array([0.88104876, 0.88383855, 0.85984606, 0.87222392, 0.8828353 ]),
 'test_mae': array([0.74073835, 0.74134536, 0.7293954 , 0.73846963, 0.73890337]),
 'fit_time': (5.204080581665039,
  5.273918628692627,
  5.636896133422852,
  5.757598876953125,
  5.535195589065552),
 'test_time': (0.15162301063537598,
  0.15755319595336914,
  0.19547724723815918,
  0.17752552032470703,
  0.6761946678161621)}

In [69]:
algo.predict('87', '384', r_ui=4)

Prediction(uid='87', iid='384', r_ui=4, est=3.2736326792187964, details={'was_impossible': False})

In [70]:
algo.predict?

[1;31mSignature:[0m [0malgo[0m[1;33m.[0m[0mpredict[0m[1;33m([0m[0muid[0m[1;33m,[0m [0miid[0m[1;33m,[0m [0mr_ui[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mclip[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mverbose[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Compute the rating prediction for given user and item.

The ``predict`` method converts raw ids to inner ids and then calls the
``estimate`` method which is defined in every derived class. If the
prediction is impossible (e.g. because the user and/or the item is
unkown), the prediction is set according to :meth:`default_prediction()
<surprise.prediction_algorithms.algo_base.AlgoBase.default_prediction>`.

Args:
    uid: (Raw) id of the user. See :ref:`this note<raw_inner_note>`.
    iid: (Raw) id of the item. See :ref:`this note<raw_inner_note>`.
    r_ui(float): The true rating :math:`r_{ui}`. Optional, default is
        ``None``.
    clip(bool): Whether to cli

In [71]:
algo.predict('87', '384', r_ui=4).est

3.2736326792187964