使用sklearn库中的相似性度量，使用MovieLens数据集，[数据集说明](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt)

#### 读取原始数据集构建矩阵
从原始数据中读取，并建立dataframe

In [1]:
import numpy as np
import pandas as pd

#u.data文件中包含了完整数据集。
u_data_path="ml-100k\\"
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv(u_data_path+'u.data', sep='\t', names=header)
print(df.head(5))
print(len(df))
#观察数据前两行。接下来，让我们统计其中的用户和电影总数。
n_users = df.user_id.unique().shape[0]  #unique()为去重.shape[0]行个数
n_items = df.item_id.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items))
#切割训练集与测试集
from sklearn import model_selection as cv
train_data, test_data = cv.train_test_split(df, test_size=0.25)

   user_id  item_id  rating  timestamp
0      196      242       3  881250949
1      186      302       3  891717742
2       22      377       1  878887116
3      244       51       2  880606923
4      166      346       1  886397596
100000
Number of users = 943 | Number of movies = 1682


#### 为测试和训练数据集创建两个矩阵。

In [2]:
#Create two user-item matrices, one for training and another for testing
#差别在于train_data与test_data
train_data_matrix = np.zeros((n_users, n_items))
print(train_data_matrix.shape)
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]
test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

(943, 1682)


#### 计算相似度

In [3]:
# 可以使用 sklearn 的pairwise_distances函数来计算余弦相似性。注意，因为评价都为正值输出取值应为0到1.
from sklearn.metrics.pairwise import pairwise_distances

user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
#矩阵的转置实现主题的相似度
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

#### 预测
##### user-based CF预测
- 可以运用下面的公式为user-based CF做一个预测： 
![title](img1.png)  

##### item-based CF预测
- 可以运用下面的公式为item-based CF做一个预测：
![title](img2.png)

In [4]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / 
                np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

#### 两种方法预测

In [5]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

#### 评估
有许多的评价指标，但是用于评估预测精度最流行的指标之一是Root Mean Squared Error(RMSE)。公式如下：
![title](img3.png)

In [6]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()#nonzero(a)返回数组a中值不为零的元素的下标,相当于对稀疏矩阵进行提取
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

print ('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print ('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.1239494091919484
Item-based CF RMSE: 3.4519912799951475
