## 1 业务问题 

- 1 研究推荐系统（个性化）
- 2 根据用户的影评数据集，对用户进行电影推荐

## 2 数据理解 

数据源自公开数据集[**movielens**](http://files.grouplens.org/datasets/movielens/)。

数据集简要描述如下：

1 943人对1682个电影进行100000条影评数据（1-5）

2 每个用户至少评价了20部电影

3 每个用户也记录简单的人口统计信息（例如：年龄、性别、职业、邮编等）

数据集的详细描述，阅读[**ml-100k-README**](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt)

## 3 数据准备 

In [1]:
# 导入Python库
import numpy as np
import pandas as pd

In [9]:
# 导入数据
header = ['user_id', 'item_id', 'rating', 'timestamp']
user_rate_movies_data = pd.read_csv("../data/ml-100k/u.data", sep = '\t', names = header)

In [10]:
# 数据探索性
# 多少个用户？
users_number = user_rate_movies_data.user_id.unique().shape[0]
# 多少个电影？
items_number = user_rate_movies_data.item_id.unique().shape[0]

print('用户数 = ' + str(users_number) + ' | 电影数 = ' + str(items_number))

用户数 = 943 | 电影数 = 1682


In [11]:
# 数据分割：训练集和测试集
from sklearn import model_selection as ms
train_data, test_data = ms.train_test_split(user_rate_movies_data, test_size = 0.25)

### 基于内存的协同过滤 Memory-Based Collaborative Filtering

两种策略：

- 1 user-item 过滤
- 2 item-item 过滤

In [12]:
# 创建user-item矩阵
train_data_matrix = np.zeros((users_number, items_number))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]
    
    
test_data_matrix = np.zeros((users_number, items_number))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

In [14]:
# 相似度计算
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric = 'cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric = 'cosine')

In [15]:
# 基于内存内容推荐
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred


In [16]:
# 推荐结果
# 1 与你相似的用户还喜欢什么项目
user_prediction = predict(train_data_matrix, user_similarity, type='user')
# 2 喜欢你这个项目的用户还喜欢什么项目
item_prediction = predict(train_data_matrix, item_similarity, type='item')

In [19]:
# 推荐效果评估
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [20]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.1221970840442044
Item-based CF RMSE: 3.4499340037189445


### 基于模型的协同过滤 Model-based Collaborative Filtering

In [21]:
# 计算矩阵的稀疏度
sparsity=round(1.0-len(user_rate_movies_data)/float(users_number * items_number),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


In [22]:
# 矩阵分解
import scipy.sparse as sp
from scipy.sparse.linalg import svds

In [23]:
u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.713218521998522


### 参考资料：

1 [**Implementing your own recommender systems in Python**](https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html)