# spark构建推荐系统

by [@寒小阳](http://blog.csdn.net/han_xiaoyang)hanxiaoyang.ml@gmail.com

这个notebook讲以[MovieLens数据集](http://grouplens.org/datasets/movielens/)为例，给大家讲解如何使用[协同过滤](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)，借助于[Spark的Alternating Least Saqures算法](https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html) 完成一个推荐系统。

为了清晰一点，这里的内容组织成2个部分：
* 第1部分是我们拿到数据以后，如何解析成合适的Spark RDDs
* 第2部分是如何去构建推荐系统模型  

多说一点，使用Movielens数据集的原因很简单，这是一个现在公认的研究型推荐系统数据集。而如果你有兴趣去看看现在开源的推荐系统引擎，有很多都是基于这种标准化的格式去做的，我们实际数据也经常会组织成类似的结构，方便做后续的处理。

## 1.数据获取与预处理

数据驱动的方法，第一步是你需要搞定需要的数据。这里的搞定包括获取数据(数据量和丰富度对最后的效果都有直接的影响) 和 数据预处理两个部分。

我们这里包含几个“处理”环节

- 加载与解析数据，持久化为RDD  
- 构建模型与后续

### 数据文件下载

作为一个示例，这里的movielens数据集，大家可以查看官网[MovieLens web site](http://movielens.org)看有关的信息，可以在[这里](http://grouplens.org/datasets/movielens/)下载到  

在我们这个案例当中，我们用movielens最新的数据集，包括这样2份：

- Small: 100,000 个打分，有706个用户在8570部电影上打的2488个标签，最近在2015年4月更新
- Full: 21,000,000个打分，包含230000个用户在27000部电影上打的470000个标签，最近在2015年4月更新  

In [1]:
complete_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest.zip'
small_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

用python下载的话，就和爬虫有点点类似了，但实际上你也可以直接wget把数据集拉下来

In [2]:
import os

datasets_path = os.path.join('/tmp', 'datasets')

complete_dataset_path = os.path.join(datasets_path, 'ml-latest.zip')
small_dataset_path = os.path.join(datasets_path, 'ml-latest-small.zip')

拉取数据

In [3]:
import urllib

small_f = urllib.urlretrieve (small_dataset_url, small_dataset_path)
complete_f = urllib.urlretrieve (complete_dataset_url, complete_dataset_path)

解压缩zip文件

In [4]:
import zipfile

with zipfile.ZipFile(small_dataset_path, "r") as z:
    z.extractall(datasets_path)

with zipfile.ZipFile(complete_dataset_path, "r") as z:
    z.extractall(datasets_path)

### 加载和解析数据集

我们把文件的每一行读进来，然后生成一个解析结果的RDD

在打分文件(`ratings.csv`)中，每一行的格式都是下面这样的：  

`userId,movieId,rating,timestamp`  

在电影信息文件(`movies.csv`)中，每一行的格式都是下面这样的： 

`movieId,title,genres`  

其中 *genres*(题材) 是下面这样的格式：

`Genre1|Genre2|Genre3...`

在标签文件 (`tags.csv`)中，每一行的格式都是下面这样的： 

`userId,movieId,tag,timestamp`  

电影评分链接文件`links.csv`中，每一行的格式都是下面这样的： 

`movieId,imdbId,tmdbId`  

这些格式都非常工整，所以我们可以用Python的 [`split()`](https://docs.python.org/2/library/stdtypes.html#str.split)函数去解析和加载RDD，解析电影和打分文件，我们做如下的处理：  

* 在打分文件中，我们生成`(UserID, MovieID, Rating)`格式的tuple，顺便把时间戳删掉了，暂时不打算使用它  
* 对于电影信息文件，我们生成`(MovieID, Title)`格式的tuple，题材那一栏我们暂时也不考虑了

先用spark读取，并瞄一眼

In [3]:
#small_ratings_file = os.path.join(datasets_path, 'ml-latest-small', 'ratings.csv')
small_ratings_file = "file:///tmp/datasets/ml-latest-small/ratings.csv"
small_ratings_raw_data = sc.textFile(small_ratings_file)
small_ratings_raw_data_header = small_ratings_raw_data.take(1)[0]

解析数据生成新的RDD  

In [4]:
small_ratings_data = small_ratings_raw_data.filter(lambda line: line!=small_ratings_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (tokens[0],tokens[1],tokens[2])).cache()

为了判断处理得对不对，我们take几个item出来看一下

In [5]:
small_ratings_data.take(3)

[(u'1', u'6', u'2.0'), (u'1', u'22', u'3.0'), (u'1', u'32', u'2.0')]

对`movies.csv`文件是差不多的处理

In [6]:
#small_movies_file = os.path.join(datasets_path, 'ml-latest-small', 'movies.csv')
small_movies_file = "file:///tmp/datasets/ml-latest-small/movies.csv"
small_movies_raw_data = sc.textFile(small_movies_file)
small_movies_raw_data_header = small_movies_raw_data.take(1)[0]

small_movies_data = small_movies_raw_data.filter(lambda line: line!=small_movies_raw_data_header)\
.map(lambda line: line.split(",")).map(lambda tokens: (tokens[0],tokens[1])).cache()
    
small_movies_data.take(3)

[(u'1', u'Toy Story (1995)'),
 (u'2', u'Jumanji (1995)'),
 (u'3', u'Grumpier Old Men (1995)')]

The following sections introduce *Collaborative Filtering* and explain how to use *Spark MLlib* to build a recommender model. We will close the tutorial by explaining how a model such this is used to make recommendations, and how to persist it for later use (e.g. in our Python/flask web-service).

## 协同过滤

关于协同过滤的理论知识，欢迎大家查阅机器学习相关部分的知识，做一个简单的部分。

下面这幅图，描述了协同过滤大方向在做的事情。

![collaborative filtering](https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif)

在Spark的MLlib中，也实现了[协同过滤/Collaborative Filtering](https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html) ，是通过[Alternating Least Squares](http://dl.acm.org/citation.cfm?id=1608614)实现的。有一些参数我做个小小的说明：  

- numBlocks 是用于并行计算的块的数量(set to -1 to auto-configure).  
- rank 是模型中隐变量的个数  
- iterations 是迭代的轮数
- lambda 是ALS算法中正则化的强度   

## 选择ALS的参数完成训练

为了先把整个流程跑通，我们这里先用小数据集，把数据切分为 训练集、验证集 和 测试集，先跑一遍

In [7]:
training_RDD, validation_RDD, test_RDD = small_ratings_data.randomSplit([6, 2, 2], seed=0L)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

开始训练

In [8]:
from pyspark.mllib.recommendation import ALS
import math

seed = 5L
iterations = 10
regularization_parameter = 0.1
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
tolerance = 0.02

min_error = float('inf')
best_rank = -1
best_iteration = -1
for rank in ranks:
    model = ALS.train(training_RDD, rank, seed=seed, iterations=iterations,
                      lambda_=regularization_parameter)
    predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
    rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
    error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
    errors[err] = error
    err += 1
    print 'For rank %s the RMSE is %s' % (rank, error)
    if error < min_error:
        min_error = error
        best_rank = rank

print 'The best model was trained with rank %s' % best_rank

For rank 4 the RMSE is 0.963681878574
For rank 8 the RMSE is 0.96250475933
For rank 12 the RMSE is 0.971647563632
The best model was trained with rank 8


我们先把预测结果取一些出来看看，然后我们解释一下

In [9]:
predictions.take(3)

[((32, 4018), 3.280114696166238),
 ((375, 4018), 2.7365714977314086),
 ((674, 4018), 2.510684514310653)]

大家都看到了，我们这个时候也是 UserID, MovieID, 预估的 Rating 几个部分。

其实我们是把测试集的真实得分和预测得分放在一起了，方便计算

In [10]:
rates_and_preds.take(3)

[((558, 788), (3.0, 3.0419325487471403)),
 ((176, 3550), (4.5, 3.3214065001580986)),
 ((302, 3908), (1.0, 2.4728711204440765))]

然后我们计算了一下均方误差(MSE)

In [11]:
model = ALS.train(training_RDD, best_rank, seed=seed, iterations=iterations,
                      lambda_=regularization_parameter)
predictions = model.predictAll(test_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = test_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
    
print 'For testing data the RMSE is %s' % (error)

For testing data the RMSE is 0.972342381898


## 使用全量数据集构建模型

既然小量数据集跑通了，我们就用全量数据集做一遍

In [12]:
# Load the complete dataset file
complete_ratings_file = os.path.join(datasets_path, 'ml-latest', 'ratings.csv')
complete_ratings_raw_data = sc.textFile(complete_ratings_file)
complete_ratings_raw_data_header = complete_ratings_raw_data.take(1)[0]

# Parse
complete_ratings_data = complete_ratings_raw_data.filter(lambda line: line!=complete_ratings_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (int(tokens[0]),int(tokens[1]),float(tokens[2]))).cache()
    
print "There are %s recommendations in the complete dataset" % (complete_ratings_data.count())

There are 21063128 recommendations in the complete dataset


切分训练集和测试集，在训练集上构建模型

In [13]:
training_RDD, test_RDD = complete_ratings_data.randomSplit([7, 3], seed=0L)

complete_model = ALS.train(training_RDD, best_rank, seed=seed, 
                           iterations=iterations, lambda_=regularization_parameter)

在测试集上测试一下效果

In [14]:
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

predictions = complete_model.predictAll(test_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = test_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
    
print 'For testing data the RMSE is %s' % (error)

For testing data the RMSE is 0.82183583368


你已经看到了，数据驱动的算法，数据发挥着非常大的作用，当把数据量提升之后，效果也跟着就提升了。

## 进行预测

In [15]:
complete_movies_file = os.path.join(datasets_path, 'ml-latest', 'movies.csv')
complete_movies_raw_data = sc.textFile(complete_movies_file)
complete_movies_raw_data_header = complete_movies_raw_data.take(1)[0]

# Parse
complete_movies_data = complete_movies_raw_data.filter(lambda line: line!=complete_movies_raw_data_header)\
    .map(lambda line: line.split(",")).map(lambda tokens: (int(tokens[0]),tokens[1],tokens[2])).cache()

complete_movies_titles = complete_movies_data.map(lambda x: (int(x[0]),x[1]))
    
print "There are %s movies in the complete dataset" % (complete_movies_titles.count())

There are 27303 movies in the complete dataset


有一个事情需要我们做一下，我们需要设定一个最小的预测得分的rating数量，这个数字算是一个超参数，我们这里用每部电影的平均打分数量去作为这个值。

In [17]:
def get_counts_and_averages(ID_and_ratings_tuple):
    nratings = len(ID_and_ratings_tuple[1])
    return ID_and_ratings_tuple[0], (nratings, float(sum(x for x in ID_and_ratings_tuple[1]))/nratings)

movie_ID_with_ratings_RDD = (complete_ratings_data.map(lambda x: (x[1], x[2])).groupByKey())
movie_ID_with_avg_ratings_RDD = movie_ID_with_ratings_RDD.map(get_counts_and_averages)
movie_rating_counts_RDD = movie_ID_with_avg_ratings_RDD.map(lambda x: (x[0], x[1][0]))

### 添加新的用户得分

随便添加一个新用户的信息

In [18]:
new_user_ID = 0

# 按照(userID, movieID, rating)的格式来
new_user_ratings = [
     (0,260,9), # Star Wars (1977)
     (0,1,8), # Toy Story (1995)
     (0,16,7), # Casino (1995)
     (0,25,8), # Leaving Las Vegas (1995)
     (0,32,9), # Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
     (0,335,4), # Flintstones, The (1994)
     (0,379,3), # Timecop (1994)
     (0,296,7), # Pulp Fiction (1994)
     (0,858,10) , # Godfather, The (1972)
     (0,50,8) # Usual Suspects, The (1995)
    ]
new_user_ratings_RDD = sc.parallelize(new_user_ratings)
print 'New user ratings: %s' % new_user_ratings_RDD.take(10)

New user ratings: [(0, 260, 9), (0, 1, 8), (0, 16, 7), (0, 25, 8), (0, 32, 9), (0, 335, 4), (0, 379, 3), (0, 296, 7), (0, 858, 10), (0, 50, 8)]


我们通过Spark的 `union()` transformation把它加到complete_ratings_data中

In [19]:
complete_data_with_new_ratings_RDD = complete_ratings_data.union(new_user_ratings_RDD)

我们用在小数据集上调得的参数去初始化ALS参数，进行训练

In [20]:
from time import time

t0 = time()
new_ratings_model = ALS.train(complete_data_with_new_ratings_RDD, best_rank, seed=seed, 
                              iterations=iterations, lambda_=regularization_parameter)
tt = time() - t0

print "New model trained in %s seconds" % round(tt,3)

New model trained in 56.61 seconds


### 计算取得最优推荐结果

我们对加入的新用户进行预测推荐

In [21]:
new_user_ratings_ids = map(lambda x: x[1], new_user_ratings) # get just movie IDs
# keep just those not on the ID list
new_user_unrated_movies_RDD = (complete_movies_data.filter(lambda x: x[0] not in new_user_ratings_ids).map(lambda x: (new_user_ID, x[0])))

# Use the input RDD, new_user_unrated_movies_RDD, with new_ratings_model.predictAll() to predict new ratings for the movies
new_user_recommendations_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)

In [22]:
# Transform new_user_recommendations_RDD into pairs of the form (Movie ID, Predicted Rating)
new_user_recommendations_rating_RDD = new_user_recommendations_RDD.map(lambda x: (x.product, x.rating))
new_user_recommendations_rating_title_and_count_RDD = \
    new_user_recommendations_rating_RDD.join(complete_movies_titles).join(movie_rating_counts_RDD)
new_user_recommendations_rating_title_and_count_RDD.take(3)

[(87040, ((6.834512984654888, u'"Housemaid'), 14)),
 (8194, ((5.966704041954459, u'Baby Doll (1956)'), 79)),
 (130390, ((0.6922328127396398, u'Contract Killers (2009)'), 1))]

电影ID我们是没有看到实际电影名的，整理一下得到 `(Title, Rating, Ratings Count)`形式的结果

In [23]:
new_user_recommendations_rating_title_and_count_RDD = \
    new_user_recommendations_rating_title_and_count_RDD.map(lambda r: (r[1][0][1], r[1][0][0], r[1][1]))

给用户取出推荐度最高的电影，数量用25去做一个截取

In [24]:
top_movies = new_user_recommendations_rating_title_and_count_RDD.filter(lambda r: r[2]>=25).takeOrdered(25, key=lambda x: -x[1])

print ('TOP recommended movies (with more than 25 reviews):\n%s' %
        '\n'.join(map(str, top_movies)))

TOP recommended movies (with more than 25 reviews):
(u'"Godfather: Part II', 8.503749129186701, 29198)
(u'"Civil War', 8.386497469089297, 257)
(u'Frozen Planet (2011)', 8.372705479107108, 31)
(u'"Shawshank Redemption', 8.258510064442426, 67741)
(u'Cosmos (1980)', 8.252254825768972, 948)
(u'Band of Brothers (2001)', 8.225114960311624, 4450)
(u'Generation Kill (2008)', 8.206487040524653, 52)
(u"Schindler's List (1993)", 8.172761674773625, 53609)
(u'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)', 8.166229786764168, 23915)
(u"One Flew Over the Cuckoo's Nest (1975)", 8.15617022970577, 32948)
(u'Casablanca (1942)', 8.141303207981174, 26114)
(u'Seven Samurai (Shichinin no samurai) (1954)', 8.139633165142612, 11796)
(u'Goodfellas (1990)', 8.12931139039048, 27123)
(u'Star Wars: Episode V - The Empire Strikes Back (1980)', 8.124225700242096, 47710)
(u'Jazz (2001)', 8.078538221315313, 25)
(u"Long Night's Journey Into Day (2000)", 8.050176820606127, 34)
(u'Lawrence of

### 计算某一部电影的打分

如果我们想取到某个用户对某部电影的预估打分，下面是一个简单的处理

In [25]:
my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = new_ratings_model.predictAll(new_user_unrated_movies_RDD)
individual_movie_rating_RDD.take(1)

[Rating(user=0, product=122880, rating=4.955831875971526)]

## 模型的持久化

可以通过下面的方式去持久化模型，这样如果线上要使用的话，可以直接加载预训练好的模型，进行预测。   

In [27]:
from pyspark.mllib.recommendation import MatrixFactorizationModel

model_path = os.path.join('..', 'models', 'movie_lens_als')

# Save and load model
model.save(sc, model_path)
same_model = MatrixFactorizationModel.load(sc, model_path)