## python推荐系统库Surprise

![](./Surprise.png)

在推荐系统的建模过程中，我们将用到python库 [Surprise(Simple Python RecommendatIon System Engine)](https://github.com/NicolasHug/Surprise)，是scikit系列中的一个(很多同学用过scikit-learn和scikit-image等库)。

### 简单易用，同时支持多种推荐算法：
* [基础算法/baseline algorithms](http://surprise.readthedocs.io/en/stable/basic_algorithms.html)
* [基于近邻方法(协同过滤)/neighborhood methods](http://surprise.readthedocs.io/en/stable/knn_inspired.html)
* [矩阵分解方法/matrix factorization-based (SVD, PMF, SVD++, NMF)](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)

| 算法类名        | 说明  |
| ------------- |:-----|
|[random_pred.NormalPredictor](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.random_pred.NormalPredictor)|Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.|
|[baseline_only.BaselineOnly](http://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.baseline_only.BaselineOnly)|Algorithm predicting the baseline estimate for given user and item.|
|[knns.KNNBasic](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBasic)|A basic collaborative filtering algorithm.|
|[knns.KNNWithMeans](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans)|A basic collaborative filtering algorithm, taking into account the mean ratings of each user.|
|[knns.KNNBaseline](http://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNBaseline)|A basic collaborative filtering algorithm taking into account a baseline rating.|	
|[matrix_factorization.SVD](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)|The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.|
|[matrix_factorization.SVDpp](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVDpp)|The SVD++ algorithm, an extension of SVD taking into account implicit ratings.|
|[matrix_factorization.NMF](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF)|A collaborative filtering algorithm based on Non-negative Matrix Factorization.|
|[slope_one.SlopeOne](http://surprise.readthedocs.io/en/stable/slope_one.html#surprise.prediction_algorithms.slope_one.SlopeOne)|A simple yet accurate collaborative filtering algorithm.|
|[co_clustering.CoClustering](http://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering)|A collaborative filtering algorithm based on co-clustering.|

### 其中基于近邻的方法(协同过滤)可以设定不同的度量准则。

| 相似度度量标准 | 度量标准说明  |
| ------------- |:-----|
|[cosine](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.cosine)|Compute the cosine similarity between all pairs of users (or items).|
|[msd](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.msd)|Compute the Mean Squared Difference similarity between all pairs of users (or items).|
|[pearson](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson)|Compute the Pearson correlation coefficient between all pairs of users (or items).|
|[pearson_baseline](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.pearson_baseline)|Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.|

### Jaccard similarity
交集元素个数/并集元素个数
当前使用的是这个相识度的计算，因为当前值就只有0 和1 

### 支持不同的评估准则
| 评估准则 | 准则说明  |
| ------------- |:-----|
|[rmse](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.rmse)|Compute RMSE (Root Mean Squared Error).|
|[msd](http://surprise.readthedocs.io/en/stable/similarities.html#surprise.similarities.msd)|Compute MAE (Mean Absolute Error).|
|[fcp](http://surprise.readthedocs.io/en/stable/accuracy.html#surprise.accuracy.fcp)|Compute FCP (Fraction of Concordant Pairs).|

### 使用示例

#### 基本使用方法如下

```python
# 可以使用上面提到的各种推荐系统算法
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf

# 默认载入movielens数据集
data = Dataset.load_builtin('ml-100k')
# k折交叉验证(k=3)
data.split(n_folds=3)
# 试一把SVD矩阵分解
algo = SVD()
# 在数据集上测试一下效果
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
#输出结果
print_perf(perf)
```

In [1]:
from surprise import SVD,KNNWithMeans
from surprise import Dataset
from surprise import evaluate, print_perf

# 默认载入movielens数据集
data = Dataset.load_builtin('ml-100k')
# k折交叉验证(k=3)
data.split(n_folds=3)
# 试一把SVD矩阵分解
algo = KNNWithMeans()
# 在数据集上测试一下效果
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
#输出结果
print_perf(perf)



Evaluating RMSE, MAE of algorithm KNNWithMeans.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9540
MAE:  0.7503
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9625
MAE:  0.7583
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9538
MAE:  0.7524
------------
------------
Mean RMSE: 0.9568
Mean MAE : 0.7536
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
MAE     0.7503  0.7583  0.7524  0.7536  
RMSE    0.9540  0.9625  0.9538  0.9568  


#### 载入自己的数据集方法

```python
# 指定文件所在路径
import os
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
# 告诉文本阅读器，文本的格式是怎么样的
reader = Reader(line_format='user item rating timestamp', sep='\t')
# 加载数据
data = Dataset.load_from_file(file_path, reader=reader)
# 手动切分成5折(方便交叉验证)
data.split(n_folds=5)
```

#### 算法调参(让推荐系统有更好的效果)

这里实现的算法用到的算法无外乎也是SGD等，因此也有一些超参数会影响最后的结果，我们同样可以用sklearn中常用到的网格搜索交叉验证(GridSearchCV)来选择最优的参数。简单的例子如下所示：

```python
# 定义好需要优选的参数网格
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],  # lr = learning rate 
              'reg_all': [0.4, 0.6]} # reg_all  = z正则化程度; n_epochs= 迭代次数
# 使用网格搜索交叉验证
grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP'])
# 在数据集上找到最好的参数
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
grid_search.evaluate(data)
# 输出调优的参数组 
# 输出最好的RMSE结果
print(grid_search.best_score['RMSE'])
# >>> 0.96117566386

# 输出对应最好的RMSE结果的参数
print(grid_search.best_params['RMSE'])
# >>> {'reg_all': 0.4, 'lr_all': 0.005, 'n_epochs': 10}

# 最好的FCP得分
print(grid_search.best_score['FCP'])
# >>> 0.702279736531

# 对应最高FCP得分的参数
print(grid_search.best_params['FCP'])
# >>> {'reg_all': 0.6, 'lr_all': 0.005, 'n_epochs': 10}
```

## 在我们的数据集上训练模型

## 建模和存储模型

### 1.用协同过滤构建模型并进行预测

#### 1.1 movielens的例子

In [4]:
# 可以使用上面提到的各种推荐系统算法
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import evaluate, print_perf

# 默认载入movielens数据集
data = Dataset.load_builtin('ml-100k')
# k折交叉验证(k=3)
data.split(n_folds=3)
# 试一把SVD矩阵分解
algo = KNNWithMeans()
# 在数据集上测试一下效果
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
#输出结果
print_perf(perf)



Evaluating RMSE, MAE of algorithm KNNWithMeans.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9636
MAE:  0.7580
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9511
MAE:  0.7500
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9533
MAE:  0.7528
------------
------------
Mean RMSE: 0.9560
Mean MAE : 0.7536
------------
------------
        Fold 1  Fold 2  Fold 3  Mean    
MAE     0.7580  0.7500  0.7528  0.7536  
RMSE    0.9636  0.9511  0.9533  0.9560  


In [5]:
data.raw_ratings[0]

(u'875', u'179', 5.0, u'876465188')

In [6]:
data.raw_ratings[1] #  893 用户对电影125 的打分是3 # user, item, rating, timestamp 

(u'770', u'111', 5.0, u'875972059')

In [18]:
"""
以下的程序段告诉大家如何在协同过滤算法建模以后，根据一个item取回相似度最高的item，主要是用到algo.get_neighbors()这个函数
"""

from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
import os
import io

from surprise import KNNBaseline
from surprise import Dataset


def read_item_names():
    """
    获取电影名到电影id 和 电影id到电影名的映射
    """

    file_name = (os.path.expanduser('~') +
                 '/.surprise_data/ml-100k/ml-100k/u.item')
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:  # 官方给定的编码字体
        for line in f:
            print("****line is *** :",line)
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]
            #print(line)

    return rid_to_name, name_to_rid


# 首先，用算法计算相互间的相似度
data     = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False} # 基于用户的协同过滤，皮尔森pearsonBaseline
algo        = KNNBaseline(sim_options=sim_options)
algo.train(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x4315e10>

In [19]:
# 获取电影名到电影id 和 电影id到电影名的映射
rid_to_name, name_to_rid = read_item_names()


****line is *** : 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

****line is *** : 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

****line is *** : 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

****line is *** : 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0

****line is *** : 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

****line is *** : 6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0

****line is *** : 7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkey


****line is *** : 574|Boxing Helena (1993)|01-Jan-1993||http://us.imdb.com/M/title-exact?Boxing%20Helena%20(1993)|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|0|1|0|0

****line is *** : 575|City Slickers II: The Legend of Curly's Gold (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?City%20Slickers%20II:%20The%20Legend%20of%20Curly's%20Gold%20(1994)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|1

****line is *** : 576|Cliffhanger (1993)|01-Jan-1993||http://us.imdb.com/M/title-exact?Cliffhanger%20(1993)|0|1|1|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0

****line is *** : 577|Coneheads (1993)|01-Jan-1993||http://us.imdb.com/M/title-exact?Coneheads%20(1993)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|1|0|0|0

****line is *** : 578|Demolition Man (1993)|01-Jan-1993||http://us.imdb.com/M/title-exact?Demolition%20Man%20(1993)|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0

****line is *** : 579|Fatal Instinct (1993)|01-Jan-1993||http://us.imdb.com/M/title-exact?Fatal%20Instinct%20(1993)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0

****line is *** : 5

****line is *** : 649|Once Upon a Time in America (1984)|01-Jan-1984||http://us.imdb.com/M/title-exact?Once%20Upon%20a%20Time%20in%20America%20(1984)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

****line is *** : 650|Seventh Seal, The (Sjunde inseglet, Det) (1957)|01-Jan-1957||http://us.imdb.com/M/title-exact?Sjunde%20inseglet,%20Det%20(1957)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0

****line is *** : 651|Glory (1989)|01-Jan-1989||http://us.imdb.com/M/title-exact?Glory%20(1989)|0|1|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0

****line is *** : 652|Rosencrantz and Guildenstern Are Dead (1990)|01-Jan-1990||http://us.imdb.com/M/title-exact?Rosencrantz%20and%20Guildenstern%20Are%20Dead%20(1990)|0|0|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0

****line is *** : 653|Touch of Evil (1958)|01-Jan-1958||http://us.imdb.com/M/title-exact?Touch%20of%20Evil%20(1958)|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|1|0|0

****line is *** : 654|Chinatown (1974)|01-Jan-1974||http://us.imdb.com/M/title-exact?Chinatown%20(1974)|0|0|0|0|0|0|0|0|0|0|1

****line is *** : 1073|Shallow Grave (1994)|01-Jan-1994||http://us.imdb.com/Title?Shallow+Grave+(1994)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

****line is *** : 1074|Reality Bites (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Reality%20Bites%20(1994)|0|0|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0

****line is *** : 1075|Man of No Importance, A (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Man%20of%20No%20Importance,%20A%20(1994)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0

****line is *** : 1076|Pagemaster, The (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Pagemaster,%20The%20(1994)|0|1|1|1|1|0|0|0|0|1|0|0|0|0|0|0|0|0|0

****line is *** : 1077|Love and a .45 (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Love%20and%20a%20.45%20(1994)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

****line is *** : 1078|Oliver & Company (1988)|29-Mar-1988||http://us.imdb.com/M/title-exact?Oliver%20&%20Company%20(1988)|0|0|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0

****line is *** : 1079|Joe's Apartment (1996

In [20]:
rid_to_name

{u'344': u'Apostle, The (1997)',
 u'345': u'Deconstructing Harry (1997)',
 u'346': u'Jackie Brown (1997)',
 u'347': u'Wag the Dog (1997)',
 u'340': u'Boogie Nights (1997)',
 u'341': u'Critical Care (1997)',
 u'342': u'Man Who Knew Too Little, The (1997)',
 u'343': u'Alien: Resurrection (1997)',
 u'348': u'Desperate Measures (1998)',
 u'349': u'Hard Rain (1998)',
 u'1653': u'Entertaining Angels: The Dorothy Day Story (1996)',
 u'298': u'Face/Off (1997)',
 u'299': u'Hoodlum (1997)',
 u'296': u'Promesse, La (1996)',
 u'297': u"Ulee's Gold (1997)",
 u'294': u'Liar Liar (1997)',
 u'295': u'Breakdown (1997)',
 u'292': u'Rosewood (1997)',
 u'293': u'Donnie Brasco (1997)',
 u'290': u'Fierce Creatures (1997)',
 u'291': u'Absolute Power (1997)',
 u'270': u'Gattaca (1997)',
 u'271': u'Starship Troopers (1997)',
 u'272': u'Good Will Hunting (1997)',
 u'273': u'Heat (1995)',
 u'274': u'Sabrina (1995)',
 u'275': u'Sense and Sensibility (1995)',
 u'276': u'Leaving Las Vegas (1995)',
 u'277': u'Restor

In [9]:
# 拿出来Toy Story这部电影对应的item id
toy_story_raw_id = name_to_rid['Toy Story (1995)'] # Toy Story (1995) - 电影名字
toy_story_raw_id  # 原始电影ID

u'1'

In [11]:
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
toy_story_inner_id  #内部电影的编号

24

In [11]:
# 找到最近的10个邻居
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
toy_story_neighbors

[433, 101, 302, 309, 971, 95, 26, 561, 816, 347]

In [12]:
# 从近邻的id映射回电影名称
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors) # 取到原始的ID
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors) # 取名字

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)


The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)


In [None]:
# 拿出来Toy Story这部电影对应的item id
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# 找到最近的10个邻居
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# 从近邻的id映射回电影名称
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors)

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)

#### 1.2 音乐预测的例子

In [23]:
from __future__ import (absolute_import, division, print_function, unicode_literals)
import os
import io

from surprise import KNNBaseline, Reader
from surprise import Dataset

import cPickle as pickle
# 重建歌单id到歌单名的映射字典
id_name_dic = pickle.load(open("popular_playlist.pkl","rb"))  # 这里利用第一部分留下来的歌单ID和歌单信息
print("加载歌单id到歌单名的映射字典完成...")
# 重建歌单名到歌单id的映射字典
name_id_dic = {}
for playlist_id in id_name_dic:
    name_id_dic[id_name_dic[playlist_id]] = playlist_id
print("加载歌单名到歌单id的映射字典完成...")


file_path = os.path.expanduser('./popular_music_suprise_format.txt')
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')  # popular_music_suprise_format.txt 的文件格式
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
# 计算歌曲和歌曲之间的相似度
print("构建数据集...")
trainset = music_data.build_full_trainset()
#sim_options = {'name': 'pearson_baseline', 'user_based': False}

加载歌单id到歌单名的映射字典完成...
加载歌单名到歌单id的映射字典完成...
构建数据集...


In [13]:
id_name_dic.keys()  #ID 到歌单的名称

['326644112',
 '374641035',
 '361197245',
 '79431768',
 '323144392',
 '135321902',
 '374012053',
 '467479257',
 '325440365',
 '137780248',
 '112540048',
 '43170093',
 '81889122',
 '141336913',
 '86559374',
 '131382061',
 '365123743',
 '120213287',
 '443881495',
 '75421929',
 '419935738',
 '101847979',
 '72360096',
 '83593528',
 '51552537',
 '366713739',
 '90930925',
 '707283621',
 '150297195',
 '163870337',
 '705368095',
 '412591020',
 '127782748',
 '167137598',
 '10311101',
 '119464204',
 '92187045',
 '18797764',
 '483873622',
 '80399618',
 '391358710',
 '54466374',
 '99328051',
 '389643724',
 '132518164',
 '313385807',
 '76347596',
 '367000697',
 '80631439',
 '106812968',
 '75247911',
 '369479222',
 '92509527',
 '69342695',
 '108668067',
 '636682363',
 '363974862',
 '392991828',
 '127781940',
 '3913771',
 '84056213',
 '58905445',
 '8617851',
 '40358497',
 '51685626',
 '17591484',
 '138932419',
 '65984144',
 '100329019',
 '39556861',
 '89117406',
 '325911034',
 '484199147',
 '48742458

In [26]:
print (id_name_dic[id_name_dic.keys()[2] ])

100种深情皆苦 | 你又不知道我难过


In [27]:
id_name_dic[id_name_dic.keys()[2] ]

'100\xe7\xa7\x8d\xe6\xb7\xb1\xe6\x83\x85\xe7\x9a\x86\xe8\x8b\xa6 | \xe4\xbd\xa0\xe5\x8f\x88\xe4\xb8\x8d\xe7\x9f\xa5\xe9\x81\x93\xe6\x88\x91\xe9\x9a\xbe\xe8\xbf\x87'

In [19]:
trainset.n_items  # 数据集个数

50539

In [20]:
trainset.n_users

1076

#### 1.2.1 模板之查找最近的user(在这里是歌单)

In [28]:
print("开始训练模型...")
#sim_options = {'user_based': False}
#algo = KNNBaseline(sim_options=sim_options)
algo = KNNBaseline()  # KNNBaseline 默认的是user base的协同过滤
algo.train(trainset)

current_playlist = name_id_dic.keys()[39]
print("歌单名称", current_playlist)

# 取出近邻
# 映射名字到id
playlist_id = name_id_dic[current_playlist]
print("歌单id", playlist_id)
# 取出来对应的内部user id => to_inner_uid
playlist_inner_id = algo.trainset.to_inner_uid(playlist_id)
print("内部id", playlist_inner_id)

playlist_neighbors = algo.get_neighbors(playlist_inner_id, k=10)

# 把歌曲id转成歌曲名字
# to_raw_uid映射回去
playlist_neighbors = (algo.trainset.to_raw_uid(inner_id)
                       for inner_id in playlist_neighbors)
playlist_neighbors = (id_name_dic[playlist_id]
                       for playlist_id in playlist_neighbors)

print()
print("和歌单 《", current_playlist, "》 最接近的10个歌单为：\n")
for playlist in playlist_neighbors:
    print(playlist, algo.trainset.to_inner_uid(name_id_dic[playlist]))

开始训练模型...
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
歌单名称 世事无常，唯愿你好
歌单id 306948578
内部id 427

和歌单 《 世事无常，唯愿你好 》 最接近的10个歌单为：

【华语】暖心物语 纯白思念 3
暗暗作祟| 不甘朋友不敢恋人 15
专属你的周杰伦 18
「华语歌曲」 23
[小风收集]21世纪年轻人的音乐 24
十七岁那年，以为能和你永远 28
热门流行华语歌曲50首 31
最易上手吉他弹唱超精选 40
打开任意门，就有对的人 42
行车路上，一曲长歌 45


#### 1.2.2 模板之针对用户进行预测

In [29]:
import cPickle as pickle
# 重建歌曲id到歌曲名的映射字典
song_id_name_dic = pickle.load(open("popular_song.pkl","rb"))
print("加载歌曲id到歌曲名的映射字典完成...")
# 重建歌曲名到歌曲id的映射字典
song_name_id_dic = {}
for song_id in song_id_name_dic:
    song_name_id_dic[song_id_name_dic[song_id]] = song_id
print("加载歌曲名到歌曲id的映射字典完成...")

加载歌曲id到歌曲名的映射字典完成...
加载歌曲名到歌曲id的映射字典完成...


In [30]:
#内部编码的4号用户
user_inner_id = 4
user_rating = trainset.ur[user_inner_id]
items = map(lambda x:x[0], user_rating)
for song in items:
    print(algo.predict(user_inner_id, song, r_ui=1), song_id_name_dic[algo.trainset.to_raw_iid(song)])

user: 4          item: 478        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 听见下雨的声音	魏如昀
user: 4          item: 429        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 梦一场	萧敬腾
user: 4          item: 936        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 干杯	西瓜Kune
user: 4          item: 937        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 给自己的歌 (Live) - live	纵贯线
user: 4          item: 938        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 小半	陈粒
user: 4          item: 939        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 思念是一种病(Live) - live	张震岳
user: 4          item: 940        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 可以不可以	丁当
user: 4          item: 941        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 秋酿	房东的猫
user: 4          item: 616        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 退后	周杰伦
user: 4          item: 942        r_ui = 1.00   est = 1.00   {u'was_impossible': False} 阴天	莫文蔚
user:

### 2.用矩阵分解进行预测

In [10]:
### 使用NMF
from surprise import NMF, evaluate
from surprise import Dataset

file_path = os.path.expanduser('./popular_music_suprise_format.txt')
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
# 构建数据集和建模
algo = NMF()
trainset = music_data.build_full_trainset()
algo.train(trainset)

In [17]:
user_inner_id = 4
user_rating = trainset.ur[user_inner_id]
items = map(lambda x:x[0], user_rating)
for song in items:
    print(algo.predict(algo.trainset.to_raw_uid(user_inner_id), algo.trainset.to_raw_iid(song), r_ui=1), song_id_name_dic[algo.trainset.to_raw_iid(song)])

user: 92509527   item: 27724082   r_ui = 1.00   est = 1.00   {u'was_impossible': False} 听见下雨的声音	魏如昀
user: 92509527   item: 167916     r_ui = 1.00   est = 1.00   {u'was_impossible': False} 梦一场	萧敬腾
user: 92509527   item: 408307325  r_ui = 1.00   est = 1.00   {u'was_impossible': False} 干杯	西瓜Kune
user: 92509527   item: 394618     r_ui = 1.00   est = 1.00   {u'was_impossible': False} 给自己的歌 (Live) - live	纵贯线
user: 92509527   item: 421423806  r_ui = 1.00   est = 1.00   {u'was_impossible': False} 小半	陈粒
user: 92509527   item: 394485     r_ui = 1.00   est = 1.00   {u'was_impossible': False} 思念是一种病(Live) - live	张震岳
user: 92509527   item: 5239563    r_ui = 1.00   est = 1.00   {u'was_impossible': False} 可以不可以	丁当
user: 92509527   item: 30635613   r_ui = 1.00   est = 1.00   {u'was_impossible': False} 秋酿	房东的猫
user: 92509527   item: 185884     r_ui = 1.00   est = 1.00   {u'was_impossible': False} 退后	周杰伦
user: 92509527   item: 276936     r_ui = 1.00   est = 1.00   {u'was_impossible': False} 阴天	莫文蔚
user:

## 模型存储

In [31]:
import surprise
surprise.dump.dump('./recommendation.model', algo=algo) # dump下来模型
# 可以用下面的方式载入
algo = surprise.dump.load('./recommendation.model')  # 重新加载

## 不同的推荐系统算法评估

### 首先载入数据

In [32]:
import os
from surprise import Reader, Dataset
# 指定文件路径
file_path = os.path.expanduser('./popular_music_suprise_format.txt')
# 指定文件格式
reader = Reader(line_format='user item rating timestamp', sep=',')
# 从文件读取数据
music_data = Dataset.load_from_file(file_path, reader=reader)
# 分成5折
music_data.split(n_folds=5)

In [None]:
music_data

In [None]:
music_data.raw_ratings[:20]

In [None]:
### 使用NormalPredictor
from surprise import NormalPredictor, evaluate
algo = NormalPredictor()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])

In [None]:
### 使用BaselineOnly
from surprise import BaselineOnly, evaluate
algo = BaselineOnly()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])

In [None]:
### 使用基础版协同过滤
from surprise import KNNBasic, evaluate
algo = KNNBasic()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])

In [None]:
### 使用均值协同过滤
from surprise import KNNWithMeans, evaluate
algo = KNNWithMeans()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])

In [None]:
### 使用协同过滤baseline
from surprise import KNNBaseline, evaluate
algo = KNNBaseline()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])

In [None]:
### 使用SVD
from surprise import SVD, evaluate
algo = SVD()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])

In [None]:
### 使用SVD++
from surprise import SVDpp, evaluate
algo = SVDpp()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])

In [None]:
### 使用NMF
from surprise import NMF
algo = NMF()
perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])
print_perf(perf)