## CoClustering
这是一个基于聚类的协同过滤算法，该算法是根据文献【1】来实现的。该算法的主要思想是分别将user和item进行分类（类似分桶），然后根据类中心来进行预测。其预测函数如下：
$$
\hat r_{ui}=\overline C_{ui}+(\mu _u-\overline C_u)+(\mu _i-\overline C_i)
$$
其中，$\overline C_{ui}$是用户u和商品i所在的联合类中的所有成员的分数均值，类中的成员以(u,i)的形式存在。$\overline C_u$是用户u所在的用户类中的所有成员的分数均值。$\overline C_i$是item i所在的item 类中的所有成员的分数均。$\mu _u$是用户u历史上打过的所有的分值的均值。$\mu _i$是item i得到的所有分数的均值。

In [1]:
from surprise import CoClustering, accuracy, Dataset
from surprise.model_selection import train_test_split

In [2]:
data = Dataset.load_builtin("ml-100k")
trainset, testset = train_test_split(data, test_size=.2, shuffle=True, random_state=10)

In [3]:
model = CoClustering(
    n_cltr_u=3,#num of u clusters
    n_cltr_i=3,#num of i clusters
    n_epochs=20,
    random_state=10,
    verbose=True,
)
%time model.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Wall time: 1.44 s


<surprise.prediction_algorithms.co_clustering.CoClustering at 0x1c41b6e5cc0>

In [4]:
cocltr_mean = model.avg_cocltr
cocltr_mean

array([[3.36830905, 3.70071406, 3.65944022],
       [3.57759377, 3.66831564, 2.92864125],
       [2.89403525, 3.78092675, 3.79628463]])

In [5]:
cltr_u_mean = model.avg_cltr_u
cltr_u_mean

array([3.56273375, 3.52025715, 3.49477763])

In [6]:
cltr_i_mean = model.avg_cltr_i
cltr_i_mean

array([3.30288814, 3.72144089, 3.5714011 ])

In [10]:
#u,i的类别向量
cu, ci = model.cltr_u, model.cltr_i
cu.shape, ci.shape

((943,), (1653,))

In [11]:
pred = model.test(testset)
pred[:5]

[Prediction(uid='154', iid='302', r_ui=4.0, est=4.693372591340446, details={'was_impossible': False}),
 Prediction(uid='896', iid='484', r_ui=4.0, est=3.5753966810489017, details={'was_impossible': False}),
 Prediction(uid='230', iid='371', r_ui=4.0, est=3.810400856426408, details={'was_impossible': False}),
 Prediction(uid='234', iid='294', r_ui=3.0, est=2.3202648154470222, details={'was_impossible': False}),
 Prediction(uid='25', iid='729', r_ui=4.0, est=4.24822136924692, details={'was_impossible': False})]

In [12]:
accuracy.rmse(pred)

RMSE: 0.9513


0.9513316110326725

In [13]:
trainset.to_inner_iid('302'), trainset.to_inner_uid('154')

(171, 738)

In [14]:
ur_738 = trainset.ur[738]
ir_171 = trainset.ir[171]

In [17]:
#验证结果和pred第一行是否一致
ur_738_mean = sum([rating for _, rating in ur_738])/len(ur_738)
ir_171_mean = sum([rating for _, rating in ir_171])/len(ir_171)
est = cocltr_mean[cu[738], ci[171]] + \
(ur_738_mean - cltr_u_mean[cu[738]]) + \
(ir_171_mean - cltr_i_mean[ci[171]])
est

4.693372591340445