## 根据用户对类目偏好打分训练基于ALS的矩阵分解模型

根据您统计的次数 + 打分规则 ==> 偏好打分数据集  ==> 基于ALS的矩阵分解模型


In [1]:
# spark配置信息
from pyspark import SparkConf
from pyspark.sql import SparkSession

SPARK_APP_NAME = "createUserCateRatingALSModel"
SPARK_URL = "yarn"

conf = SparkConf()    # 创建spark config对象
config = (
	("spark.app.name", SPARK_APP_NAME),    # 设置启动的spark的app名称，没有提供，将随机产生一个名称
	("spark.executor.memory", "2g"),    # 设置该app启动时占用的内存用量，默认1g
	("spark.master", SPARK_URL),    # spark master的地址
    ("spark.executor.cores", "1"),   # 设置spark executor使用的CPU核心数
    ("spark.executor.instances", 1)    # 设置spark executor数量，yarn时起作用)
)
# 查看更详细配置及说明：https://spark.apache.org/docs/latest/configuration.html
# 
conf.setAll(config)

# 利用config对象，创建spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

In [2]:
# spark ml的模型训练是基于内存的，如果数据过大，内存空间小，迭代次数过多的化，可能会造成内存溢出，报错
# 设置Checkpoint的话，会把所有数据落盘，这样如果异常退出，下次重启后，可以接着上次的训练节点继续运行
# 但该方法其实指标不治本，因为无法防止内存溢出，所以还是会报错
# 如果数据量大，应考虑的是增加内存、或限制迭代次数和训练数据量级等
spark.sparkContext.setCheckpointDir("hdfs://hadoop-master:9000/workspace/3.rs_project/project1/checkPoint/")

In [2]:
!hadoop fs -ls /workspace/3.rs_project/project1/trained_result

Found 2 items
drwxr-xr-x   - root supergroup          0 2019-03-21 11:07 /workspace/3.rs_project/project1/trained_result/models
drwxr-xr-x   - root supergroup          0 2019-03-21 11:21 /workspace/3.rs_project/project1/trained_result/preprocessing-datasets


In [4]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType, FloatType

# 构建结构对象
schema = StructType([
    StructField("userId", IntegerType()),
    StructField("cateId", IntegerType()),
    StructField("pv", IntegerType()),
    StructField("fav", IntegerType()),
    StructField("cart", IntegerType()),
    StructField("buy", IntegerType())
])

# 从hdfs加载CSV文件
cate_count_df = spark.read.csv("hdfs://hadoop-master:9000/workspace/3.rs_project/project1/trained_result/preprocessing-datasets/cate_count.csv", header=True, schema=schema)
cate_count_df.printSchema()
cate_count_df.first()    # 第一行数据

root
 |-- userId: integer (nullable = true)
 |-- cateId: integer (nullable = true)
 |-- pv: integer (nullable = true)
 |-- fav: integer (nullable = true)
 |-- cart: integer (nullable = true)
 |-- buy: integer (nullable = true)



Row(userId=301977, cateId=4280, pv=42, fav=None, cart=3, buy=None)

In [6]:
def process_row(r):
    # 处理每一行数据：r表示row对象
    
    # 偏好评分规则：
	#     m: 用户对应的行为次数
    #     该偏好权重比例，次数上限仅供参考，具体数值应根据产品业务场景权衡
	#     pv: if m<=20: score=0.2*m; else score=4
	#     fav: if m<=20: score=0.4*m; else score=8
	#     cart: if m<=20: score=0.6*m; else score=12
	#     buy: if m<=20: score=1*m; else score=20
    
    # 注意这里要全部设为浮点数，spark运算时对类型比较敏感，要保持数据类型都一致
	pv_count = r.pv if r.pv else 0.0
	fav_count = r.fav if r.fav else 0.0
	cart_count = r.cart if r.cart else 0.0
	buy_count = r.buy if r.buy else 0.0

	pv_score = 0.2*pv_count if pv_count<=20 else 4.0
	fav_score = 0.4*fav_count if fav_count<=20 else 8.0
	cart_score = 0.6*cart_count if cart_count<=20 else 12.0
	buy_score = 1.0*buy_count if buy_count<=20 else 20.0

	rating = pv_score + fav_score + cart_score + buy_score
	# 返回用户ID、分类ID、用户对分类的偏好打分
	return r.userId, r.cateId, rating

In [7]:
# 返回一个PythonRDD类型，此时还没开始计算
cate_count_df.rdd.map(process_row).toDF(["userId", "cateId", "rating"])

DataFrame[userId: bigint, cateId: bigint, rating: double]

In [8]:
# 用户对商品类别的打分数据
# map返回的结果是rdd类型，需要调用toDF方法转换为Dataframe
cate_rating_df = cate_count_df.rdd.map(process_row).toDF(["userId", "cateId", "rating"])
# 注意：toDF不是每个rdd都有的方法，仅局限于此处的rdd

In [None]:
# 可通过该方法获得 user-cate-matrix
# 但由于cateId字段过多，这里运算量比很大，机器内存要求很高才能执行，否则无法完成任务
# 请谨慎使用

# 但好在我们训练ALS模型时，不需要转换为user-cate-matrix，所以这里可以不用运行
# cate_rating_df.groupBy("userId").povit("cateId").min("rating")

In [7]:
# 用户对类别的偏好打分数据
cate_rating_df

DataFrame[userId: bigint, cateId: bigint, rating: double]

#### 通常如果USER-ITEM打分数据应该是通过一下方式进行处理转换为USER-ITEM-MATRIX

![CF介绍](images/CF介绍.png)

#### 但这里我们将使用的Spark的ALS模型进行CF推荐，因此注意这里数据输入不需要提前转换为矩阵，直接是 USER-ITEM-RATE的数据

#### 基于Spark的ALS隐因子模型进行CF评分预测

ALS的意思是交替最小二乘法（Alternating Least Squares），是Spark2.*中加入的进行基于模型的协同过滤（model-based CF）的推荐系统算法。

同SVD，它也是一种矩阵分解技术，对数据进行降维处理。

#### 详细使用方法：[pyspark.ml.recommendation.ALS](https://spark.apache.org/docs/2.2.2/api/python/pyspark.ml.html?highlight=vectors#module-pyspark.ml.recommendation)

注意：由于数据量巨大，因此这里也不考虑基于内存的CF算法

参考：[为什么Spark中只有ALS](https://www.cnblogs.com/mooba/p/6539142.html)

In [9]:
# 使用pyspark中的ALS矩阵分解方法实现CF评分预测
# 文档地址：https://spark.apache.org/docs/2.2.2/api/python/pyspark.ml.html?highlight=vectors#module-pyspark.ml.recommendation
from pyspark.ml.recommendation import ALS   # ml：dataframe， mllib：rdd

# 利用打分数据，训练ALS模型
als = ALS(userCol='userId', itemCol='cateId', ratingCol='rating', checkpointInterval=2)

# 此处训练时间较长
model = als.fit(cate_rating_df)

#### 模型训练好后，调用方法进行使用，[具体API查看](https://spark.apache.org/docs/2.2.2/api/python/pyspark.ml.html?highlight=alsmodel#pyspark.ml.recommendation.ALSModel)

In [10]:
# model.recommendForAllUsers(N) 给所有用户推荐TOP-N个物品
ret = model.recommendForAllUsers(3)
# 由于是给所有用户进行推荐，此处运算时间也较长
ret.show()
# 推荐结果存放在recommendations列中，
ret.select("recommendations").show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   148|[[5607,8.091523],...|
|   463|[[1610,8.860008],...|
|   471|[[1610,13.1980295...|
|   496|[[3347,6.303711],...|
|   833|[[5607,10.028404]...|
|  1088|[[5731,6.969639],...|
|  1238|[[1610,16.75008],...|
|  1342|[[5607,9.428972],...|
|  1580|[[5579,8.038961],...|
|  1591|[[5607,11.379921]...|
|  1645|[[201,12.506715],...|
|  1829|[[1610,19.828497]...|
|  1959|[[5631,10.744259]...|
|  2122|[[5737,11.620426]...|
|  2142|[[1610,12.57279],...|
|  2366|[[1610,13.826477]...|
|  2659|[[1610,14.002829]...|
|  2866|[[1610,11.263525]...|
|  3175|[[11568,1.8160022...|
|  3749|[[1610,3.5862575]...|
+------+--------------------+
only showing top 20 rows

+--------------------+
|     recommendations|
+--------------------+
|[[5607,8.091523],...|
|[[1610,8.860008],...|
|[[1610,13.1980295...|
|[[3347,6.303711],...|
|[[5607,10.028404]...|
|[[5731,6.969639],...|
|[[1610,16.75008],...|
|[[5607,9.428972],...|
|

In [12]:
# model.recommendForUserSubset 给部分用户推荐TOP-N个物品

# 注意注意注意：recommendForUserSubset API，2.2.2版本中无法使用
dataset = spark.createDataFrame([[1],[2],[3]])
dataset = dataset.withColumnRenamed("_1", "userId")
ret = model.recommendForUserSubset(dataset, 3)

# 只给部分用推荐，运算时间短
ret.show()
ret.collect()    # 注意： collect会将所有数据加载到内存，慎用

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[[1610, 25.4989],...|
|     3|[[5607, 13.665942...|
|     2|[[5579, 5.9051886...|
+------+--------------------+



[Row(userId=1, recommendations=[Row(cateId=1610, rating=25.498899459838867), Row(cateId=5737, rating=24.901548385620117), Row(cateId=3347, rating=20.736785888671875)]),
 Row(userId=3, recommendations=[Row(cateId=5607, rating=13.665942192077637), Row(cateId=1610, rating=11.770171165466309), Row(cateId=3347, rating=10.35690689086914)]),
 Row(userId=2, recommendations=[Row(cateId=5579, rating=5.90518856048584), Row(cateId=2447, rating=5.624575138092041), Row(cateId=5690, rating=5.2555742263793945)])]

In [None]:
# transform中提供userId和cateId可以对打分进行预测，利用打分结果排序后，同样可以实现TOP-N的推荐
model.transform

In [11]:
# 将模型进行存储
model.save("hdfs://hadoop-master:9000/workspace/3.rs_project/project1/trained_result/models/userCateRatingALSModel.obj")

In [12]:
# 查看存储的模型文件
!hadoop fs -ls /workspace/3.rs_project/project1/trained_result/models

Found 3 items
drwxr-xr-x   - root supergroup          0 2019-03-21 11:07 /workspace/3.rs_project/project1/trained_result/models/CTRModel_AllOneHot.obj
drwxr-xr-x   - root supergroup          0 2019-03-21 11:07 /workspace/3.rs_project/project1/trained_result/models/CTRModel_Normal.obj
drwxr-xr-x   - root supergroup          0 2019-03-21 11:07 /workspace/3.rs_project/project1/trained_result/models/userCateRatingALSModel.obj


In [3]:
# 测试存储的模型
from pyspark.ml.recommendation import ALSModel
# 从hdfs加载之前存储的模型
als_model = ALSModel.load("hdfs://hadoop-master:9000/workspace/3.rs_project/project1/trained_result/models/userCateRatingALSModel.obj")
als_model

ALS_4bd58e754c7dc776d7b0

In [14]:
# model.recommendForAllUsers(N) 给用户推荐TOP-N个物品
# 运行时间较长
result = als_model.recommendForAllUsers(3)
result.show()

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   148|[[5607,8.091523],...|
|   463|[[1610,8.860008],...|
|   471|[[1610,13.1980295...|
|   496|[[3347,6.303711],...|
|   833|[[5607,10.028404]...|
|  1088|[[5731,6.969639],...|
|  1238|[[1610,16.75008],...|
|  1342|[[5607,9.428972],...|
|  1580|[[5579,8.038961],...|
|  1591|[[5607,11.379921]...|
|  1645|[[201,12.506715],...|
|  1829|[[1610,19.828497]...|
|  1959|[[5631,10.744259]...|
|  2122|[[5737,11.620426]...|
|  2142|[[1610,12.57279],...|
|  2366|[[1610,13.826477]...|
|  2659|[[1610,14.002829]...|
|  2866|[[1610,11.263525]...|
|  3175|[[11568,1.8160022...|
|  3749|[[1610,3.5862575]...|
+------+--------------------+
only showing top 20 rows



In [5]:
# 召回到redis
def recall_cate_by_cf(partition):
    host = "192.168.19.137"
    port = 6379
    
    import redis
    # 建立redis 连接池
    pool = redis.ConnectionPool(host=host, port=port)
    # 建立redis客户端
    client = redis.Redis(connection_pool=pool)
    for row in partition:
        client.hset("recall_cate", row.userId, [i.cateId for i in row.recommendations])
# 对每个分片的数据进行处理
result.foreachPartition(recall_cate_by_cf)

# 注意：这里这是召回的是用户最感兴趣的n个类别

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 13, hadoop-slave1, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/bigdata/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1553606288214_0001/container_1553606288214_0001_01_000002/pyspark.zip/pyspark/worker.py", line 178, in main
    process()
  File "/root/bigdata/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1553606288214_0001/container_1553606288214_0001_01_000002/pyspark.zip/pyspark/worker.py", line 173, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 2430, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 2430, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 2430, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 353, in func
    return f(iterator)
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 801, in func
    r = f(it)
  File "<ipython-input-5-8e0e916218d0>", line 6, in recall_cate_by_cf
ModuleNotFoundError: No module named 'redis'

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:194)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:235)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:153)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1533)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1521)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1520)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1520)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1748)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:476)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/bigdata/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1553606288214_0001/container_1553606288214_0001_01_000002/pyspark.zip/pyspark/worker.py", line 178, in main
    process()
  File "/root/bigdata/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1553606288214_0001/container_1553606288214_0001_01_000002/pyspark.zip/pyspark/worker.py", line 173, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 2430, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 2430, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 2430, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 353, in func
    return f(iterator)
  File "/miniconda2/envs/py365/lib/python3.6/site-packages/pyspark-2.2.2-py3.6.egg/pyspark/rdd.py", line 801, in func
    r = f(it)
  File "<ipython-input-5-8e0e916218d0>", line 6, in recall_cate_by_cf
ModuleNotFoundError: No module named 'redis'

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:194)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:235)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:153)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


In [10]:
# 总的条目数，查看redis中总的条目数是否一致
result.count()

1136340