### 基于Surprise的推荐系统

在推荐系统的建模过程中，我们将用到python库 [Surprise(Simple Python RecommendatIon System Engine)](https://github.com/NicolasHug/Surprise)，是scikit系列中的一个(很多同学用过scikit-learn和scikit-image等库)。

### 简单易用，同时支持多种推荐算法：
* [基础算法/baseline algorithms](http://surprise.readthedocs.io/en/stable/basic_algorithms.html)
* [基于近邻方法(协同过滤)/neighborhood methods](http://surprise.readthedocs.io/en/stable/knn_inspired.html)
* [矩阵分解方法/matrix factorization-based (SVD, PMF, SVD++, NMF)](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)

| 算法类名        | 说明  |
| ------------- |:-----|
|[random_pred.NormalPredictor](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.random_pred.NormalPredictor)|Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.|
|[baseline_only.BaselineOnly](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.baseline_only.BaselineOnly)|Algorithm predicting the baseline estimate for given user and item.|
|[knns.KNNBasic](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBasic)|A basic collaborative filtering algorithm.|
|[knns.KNNWithMeans](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans)|A basic collaborative filtering algorithm, taking into account the mean ratings of each user.|
|[knns.KNNBaseline](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBaseline)|A basic collaborative filtering algorithm taking into account a baseline rating.|	
|[matrix_factorization.SVD](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)|The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.|
|[matrix_factorization.SVDpp](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVDpp)|The SVD++ algorithm, an extension of SVD taking into account implicit ratings.|
|[matrix_factorization.NMF](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF)|A collaborative filtering algorithm based on Non-negative Matrix Factorization.|
|[slope_one.SlopeOne](http://surprise.readthedocs.io/en/stable/slope_one.html#surprise.prediction_algorithms.slope_one.SlopeOne)|A simple yet accurate collaborative filtering algorithm.|
|[co_clustering.CoClustering](http://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering)|A collaborative filtering algorithm based on co-clustering.|

### 其中基于近邻的方法(协同过滤)可以设定不同的度量准则。

| 相似度度量标准 | 度量标准说明  |
| ------------- |:-----|
|[cosine](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.cosine)|Compute the cosine similarity between all pairs of users (or items).|
|[msd](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.msd)|Compute the Mean Squared Difference similarity between all pairs of users (or items).|
|[pearson](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson)|Compute the Pearson correlation coefficient between all pairs of users (or items).|
|[pearson_baseline](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson_baseline)|Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.|

### 支持不同的评估准则
| 评估准则 | 准则说明  |
| ------------- |:-----|
|[rmse](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.rmse)|Compute RMSE (Root Mean Squared Error).|
|[mae](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.mae)|Compute MAE (Mean Absolute Error).|
|[fcp](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.fcp)|Compute FCP (Fraction of Concordant Pairs)协调比例.|

使用movielens数据集做测试，推荐与某个电影相似的top-N的电影。 
MovieLens数据集是一个关于电影评分的数据集，里面包含了从IMDB, The Movie DataBase上面得到的用户对电影的评分信息。 
数据集大小:100000 ratings by 943 users on 1682 items。

In [1]:
import io
from surprise import KNNBaseline
from surprise import Dataset, Reader

在代码中，需要做从电影id到电影名字的映射，其中，rid：raw_id也就是每部电影所对应的原始id号。而在训练计算皮尔逊相关系数矩阵的时候，又将每部电影进行了id映射，也就是代码中的to_inner_iid()就是讲raw_id转换到相似性矩阵的inner_id。之后计算近邻，得到的inner_id 要将其转换为具体的电影名字，同样需要通过raw_id作为中介进行转换。

In [2]:
def read_iter_names():
    # 获取电影名到id和id到电影名的映射
    # Input: u.item格式：电影编号|电影名字|评分|url
    # Return: rid_2_name {}; map rating movie id to movie name
    #         name_2_rid {}: map movie name to rating movie id
    item_file = 'ml-100k/u.item'
    rid_2_name = {}
    name_2_rid = {}
    with io.open(item_file, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_2_name[line[0]] = line[1]
            name_2_rid[line[1]] = line[0]
    return rid_2_name, name_2_rid

In [17]:
# u.data数据格式为 user item rating timestamp；
reader = Reader(line_format='user item rating timestamp', sep='\t')
file_path = 'ml-100k'
# user id | item id | rating | timestamp.
data = Dataset.load_from_file(file_path=file_path + '/u.data', reader=reader)
print('data size is', data.)
train_set = data.build_full_trainset()

在这里使用KNNBaseline()
>surprise.prediction_algorithms.knns.KNNBaseline(k=40, min_k=1, sim_options={}, bsl_options={}, verbose=True, **kwargs)
- k（int） - 聚合时要考虑的（最大）邻居数（参见本注释）。 默认值为40。
- min_k（int） - 聚合时要考虑的最小邻居数。 如果没有足够的邻居，则将邻居聚合设置为零。 默认值为1。
- sim_options（dict):相似性度量的选项字典。 请参阅[相似性度量配置](http://surprise.readthedocs.io/en/stable/prediction_algorithms.html#similarity-measures-configuration)。 建议使用pearson_baseline相似性度量。
  - sim_sim_options()中的选项
    - 'name'：相似性模块中定义的相似性度量的名称。 默认为'MSD'。
    - 'user_based'：是否在用户之间或项目之间计算相似性。 这对预测算法的性能有很大影响。 默认为True。
    - 'min_support'：公共项的最小数量（当'user_based'为'True'时）或最小公共用户数（当'user_based'为'False'时），相似度不为零。 
    - 'shrinkage'：要应用的收缩参数（仅与pearson_baseline相似性相关）。 默认值为100。
- bsl_options（dict):基线估计计算的选项字典。 请参阅基准估计接受选项的配置。
- verbose（bool):是否打印偏差估计，相似性等的跟踪消息。默认为True。
对sim_options()解释

### 基于物品的协同过滤

In [22]:
sim_options = {'name': 'pearson_baseline', 'user_based': False}
# 基本的协同滤波算法
item_baesd = KNNBaseline(sim_options=sim_options)
item_baesd.fit(train_set)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x1f17b4c2128>

In [34]:
# 获取id对应的电影名列表，由于中途涉及一个id转换，所以要双向
rid_2_name, name_2_rid = read_iter_names()

In [35]:
# raw-id映射到内部id
toy_story_raw_id = name_2_rid['Toy Story (1995)']
toy_story_inner_id = item_baesd.trainset.to_inner_iid(toy_story_raw_id)

In [36]:
# 获取toy story对应的内部id 并由此取得其对应的k个近邻 k个近邻对应的也是内部id
toy_story_neighbors = item_baesd.get_neighbors(toy_story_inner_id, k=10)


In [37]:
# 近邻内部id转换为对应的名字
toy_story_neighbors = (item_baesd.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_2_name[rid] for rid in toy_story_neighbors)

In [38]:
print('基于物品的推荐（皮尔逊相似计算得到）与toy story相近的十个电影为：\n')
for moives in toy_story_neighbors:
    print(moives)

基于物品的推荐（皮尔逊相似计算得到）与Forrest Gump相近的十个电影为：

Field of Dreams (1989)
Firm, The (1993)
It's a Wonderful Life (1946)
Braveheart (1995)
Dances with Wolves (1990)
Shawshank Redemption, The (1994)
Top Gun (1986)
In the Line of Fire (1993)
Fugitive, The (1993)
Pinocchio (1940)


### 基于用户的协同过滤

In [40]:
sim_options = {'name': 'pearson_baseline'}
user_based = KNNBaseline(sim_options=sim_options)
user_based.fit(train_set)

rid_2_name, name_2_rid = read_iter_names()


# raw-id映射到内部id
toy_story_raw_id = name_2_rid['Toy Story (1995)']
toy_story_inner_id = user_based.trainset.to_inner_iid(toy_story_raw_id)

# 获取toy story对应的内部id 并由此取得其对应的k个近邻 k个近邻对应的也是内部id
toy_story_neighbors = user_based.get_neighbors(toy_story_inner_id, k=10)

# 近邻内部id转换为对应的名字
toy_story_neighbors = (user_based.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_2_name[rid] for rid in toy_story_neighbors)

print('基于用户的推荐（皮尔逊相似计算得到）与toy story相近的十个电影为：\n')
for moives in toy_story_neighbors:
    print(moives)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
基于用户的推荐（皮尔逊相似计算得到）与toy story相近的十个电影为：

Matilda (1996)
Striking Distance (1993)
My Fellow Americans (1996)
Relic, The (1997)
Under Siege (1992)
House of Yes, The (1997)
Foreign Correspondent (1940)
Last Supper, The (1995)
Get Shorty (1995)
Basic Instinct (1992)
