目前的协同过滤推荐算法主要分为两大类:1.基于计算相似度的KNN算法。2.基于矩阵分解的SVD算法。先来看看KNN算法
https://blog.csdn.net/weixin_42608414/article/details/87891057

# 使用KNN算法的协同过滤

In [2]:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
I2=np.array([[0,5,5,0,0,5,0,3,0,2]])
I1=np.array([[0,4,5,0,4,0,0,0,0,0]])
dist=euclidean_distances(I2,I1)
print('distance between I2 and I1:',dist)

distance between I2 and I1: [[7.41619849]]


加载数据

In [3]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
%matplotlib inline
 
movies = pd.read_csv( './data/douban/movies.csv')
print('电影数目（有名称）：%d' % movies[~pd.isnull(movies.title)].shape[0])
print('电影数目（没有名称）：%d' % movies[pd.isnull(movies.title)].shape[0])
print('电影数目（总计）：%d' % movies.shape[0])
movies.sample(10)

电影数目（有名称）：33258
电影数目（没有名称）：24166
电影数目（总计）：57424


Unnamed: 0,movieId,title
49761,49761,特公 TOKKO
43755,43755,H.I.T（类　型：刑事）
42435,42435,
38600,38600,Team Picture
11816,11816,Rio Bravo
22118,22118,Huie\'s Sermon
41009,41009,Anderson Cooper 360°
2203,2203,蟲師
9510,9510,Murder by Death
47255,47255,骑牛难下


In [4]:
ratings = pd.read_csv('./data/douban/ratings.csv')
print('用户数据：%d' % ratings.userId.unique().shape[0])
print('电影数据：%d' % ratings.movieId.unique().shape[0])
print('评分数目：%d' % ratings.shape[0])
ratings.head()

用户数据：28718
电影数据：57424
评分数目：2828500


Unnamed: 0,userId,movieId,rating,timestamp
0,0,0,5,1318222486
1,0,1,4,1313813583
2,0,2,5,1313458035
3,0,3,5,1313327802
4,0,4,3,1312126734


由于评分数据较多,我们将所有的数据都喂给我们的推荐算法，这样会导致内存溢出,会出现"内存错误"的问题，因此我们只要关注公众关注度比较高的电影,也就是那些评价次数多和评分也高的电影,因为那些评价差和评价次数少的电影也没有必要去推荐。

为了找出哪些电影的公众关注度比较高，我们需要整合一下movies表和rating表

In [5]:
combine_movie_rating= pd.merge(ratings,movies,on='movieId')
combine_movie_rating=combine_movie_rating.drop(['timestamp'],axis = 1)
print(len(combine_movie_rating))
combine_movie_rating.head()

2828500


Unnamed: 0,userId,movieId,rating,title
0,0,0,5,
1,529,0,4,
2,1247,0,5,
3,1335,0,5,
4,1397,0,5,


我们发现有大量的电影名称title为空的记录,所以我们要先过滤掉这些没有电影title的记录

In [6]:
combine_movie_rating = combine_movie_rating.dropna(axis = 0 ,subset=['title'])
print(len(combine_movie_rating))
combine_movie_rating.head()

2604995


Unnamed: 0,userId,movieId,rating,title
22,0,1,4,Harry Potter and the Deathly Hallows: Part II
23,21,1,4,Harry Potter and the Deathly Hallows: Part II
24,25,1,5,Harry Potter and the Deathly Hallows: Part II
25,34,1,4,Harry Potter and the Deathly Hallows: Part II
26,36,1,5,Harry Potter and the Deathly Hallows: Part II


In [7]:
combine_movie_rating.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2604995 entries, 22 to 2828497
Data columns (total 4 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   userId   int64 
 1   movieId  int64 
 2   rating   int64 
 3   title    object
dtypes: int64(3), object(1)
memory usage: 99.4+ MB


In [8]:
# 删除之后，恢复索引
combine_movie_rating.index=range(combine_movie_rating.shape[0])
combine_movie_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2604995 entries, 0 to 2604994
Data columns (total 4 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   userId   int64 
 1   movieId  int64 
 2   rating   int64 
 3   title    object
dtypes: int64(3), object(1)
memory usage: 79.5+ MB


接下来我要统计一下每部电影总共的评价次数:

In [9]:
movie_rating_count=pd.DataFrame(combine_movie_rating.
                    groupby(['movieId'])['rating'].
                    count().
                    reset_index().
                    rename(columns={'rating':'totalRatingCount'})                   
                   )
movie_rating_count.head()

Unnamed: 0,movieId,totalRatingCount
0,1,1703
1,2,1080
2,4,1898
3,5,2218
4,10,4981


有了每部电影的总共评价次数以后,我们就可以想办法找出最受关注的电影,最流行的电影了，首先我们先查看一下总评价次数的分布情况


In [10]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(movie_rating_count['totalRatingCount'].describe())

count   33258.000
mean       78.327
std       262.606
min         1.000
25%         3.000
50%        10.000
75%        38.000
max      6574.000
Name: totalRatingCount, dtype: float64


我们可以看到电影总数是33258，其中有50%的电影评价次数小于10次，那说明还有另外50%的电影评价次数高于50%，还记得我上一篇博客中所使用的筛选流行电影的标准就是就是评价次数超过10次就是来源于这里。但是如果一部电影的评价次数只有10次左右的化，那还谈不上是部受关注的电影,所以它不应该被推荐，更不应该被"喂"给推荐算法参加计算。因为推荐算法在执行的时候会消耗大量系统资源,没有价值的数据不应该参与运算,否则会造成系统内存溢出,出现"内存错误"的问题。接下来我们还要查看分位数表中顶层的那10%的数据:

In [11]:
print(movie_rating_count['totalRatingCount'].quantile(np.arange(.9,1,.01)))

0.900    158.000
0.910    184.000
0.920    211.440
0.930    253.000
0.940    303.580
0.950    375.150
0.960    462.000
0.970    590.000
0.980    814.860
0.990   1298.860
Name: totalRatingCount, dtype: float64


我们可以看到有90%的电影评价次数少于158次，那也就是说还有另外10%的电影的评价次数超过了158次。还有9%的电影它们的评价次数超过了184次，还有8%的电影评价次数超过了211次，还有7%的电影评价次数超过了253次。我觉得如果一部电影的评价次数超过了158次的化，那应该是一部受关注的电影了把。那我们就暂时把158次作为识别流行电影的指标吧。(可能你有不同的想法,可以尝试其他指标)，目前电影总数有33258，那10%的话也应该有3325部电影，那我们就决定推荐这3325部电影。

In [12]:
rating_with_totalRatingCount = pd.merge(combine_movie_rating,movie_rating_count,on="movieId")
rating_with_totalRatingCount.sample(5)

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
961823,1728,1206,3,The Mist,937
117466,1618,125,3,Into the Wild,1515
2576447,5823,33788,4,Poirot：The Double Clue,4
500018,1837,574,5,ドラえもん,1061
455263,1193,482,4,Transformers: Revenge of the Fallen,2964


In [13]:
(rating_with_totalRatingCount.title=='一一').sum()

1833

In [14]:
#有10%的电影评价次数大于158次
popular_threshold=158
rating_popular_movies= rating_with_totalRatingCount.query('totalRatingCount>=@popular_threshold')
rating_popular_movies.head()

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
0,0,1,4,Harry Potter and the Deathly Hallows: Part II,1703
1,21,1,4,Harry Potter and the Deathly Hallows: Part II,1703
2,25,1,5,Harry Potter and the Deathly Hallows: Part II,1703
3,34,1,4,Harry Potter and the Deathly Hallows: Part II,1703
4,36,1,5,Harry Potter and the Deathly Hallows: Part II,1703


# 实现KNN算法

我们现在要构造一个用户对电影的评分矩阵,该矩阵每一行代表一个movie，每一列代表一个user，矩阵中的每一个值代表某位用户对某部电影的评分。如果用户对某部电影没有评价那就置为0。然后，我们将矩阵dataframe的值(rating)转换为稀疏矩阵，以便可以进行更有效的计算

In [15]:
from scipy.sparse import csr_matrix
ratings_pivot = rating_popular_movies.pivot(index='movieId', columns='userId',values='rating').fillna(0)
ratings_pivot_sparse = csr_matrix(ratings_pivot.values)

然后我们使用sklearn.neighbors算法。并指定参数(metric='cosine', algorithm='brute')以便算法计算rating向量之间的余弦相似度。 最后，拟合我们的模型

In [16]:
model_nn_binary = NearestNeighbors(metric='cosine', algorithm='brute')
model_nn_binary.fit(ratings_pivot_sparse)

NearestNeighbors(algorithm='brute', metric='cosine')

# 测试和推荐

In [17]:
movieId=2550
distances, indices = model_nn_binary.kneighbors(ratings_pivot.query('movieId == 2550').values, n_neighbors = 11)
 
for i in range(0, len(distances.flatten())):
    likelymovieId=ratings_pivot.index[indices.flatten()[i]]
    if i == 0:
        print('当前电影:',movies[movies.movieId==movieId]['title'].values[0])
    else:
        print('推荐电影{0}: {1}, 距离为:{2}'.format(i, movies[movies.movieId==likelymovieId]['title'].values[0], 
                                                    distances.flatten()[i]))

当前电影: 黃飛鴻之三獅王爭霸
推荐电影1: 黃飛鴻之二男兒當自強, 距离为:0.2236617772379993
推荐电影2: 黃飛鴻, 距离为:0.2771905809331011
推荐电影3: 方世玉, 距离为:0.3036596141407937
推荐电影4: 太极张三丰, 距离为:0.3454806002742725
推荐电影5: 方世玉续集, 距离为:0.3489783330675462
推荐电影6: 精武英雄, 距离为:0.38593751478880156
推荐电影7: 新少林五祖, 距离为:0.4035055605033421
推荐电影8: 倚天屠龍記之魔教教主, 距离为:0.4150684119775777
推荐电影9: 中南海保镖, 距离为:0.45784800244154567
推荐电影10: 我是谁, 距离为:0.45895083257042657


# 使用基于矩阵分解(SVD)算法的协同过滤 

奇异值分解(Singular Value Decomposition，以下简称SVD)是在机器学习领域广泛应用的算法，它不光可以用于降维算法中的特征分解，还可以用于推荐系统，以及自然语言处理等领域。是很多机器学习算法的基石。这里我们将通过sklearn的TruncatedSVD方法对原本巨大的用户对物品的评分矩阵进行分解和降维。将原来的矩阵的维度从(3329, 27895)降到(3329, 10)，我们只从原来矩阵中提取其中主要的10个特征,然后再此基础上再进行推荐,好了，废话少说,让我们撸起袖子干起来吧！

首先我们先生成一个用户-电影的评分矩阵，然后将其转换成一个稀疏矩阵(这样可以大大节省存储空间):



In [19]:
ratings_pivot2 = rating_popular_movies.pivot(index='userId', columns='movieId',values='rating').fillna(0)
ratings_pivot2_sparse = csr_matrix(ratings_pivot2.values)
print(ratings_pivot2.shape)
ratings_pivot2.head()

(27895, 3329)


movieId,1,2,4,5,10,12,13,15,17,18,...,12612,12634,13346,14821,15721,15741,15826,16155,16323,16660
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4.0,5.0,3.0,4.0,5.0,4.0,2.0,4.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,3.0,4.0,3.0,0.0,5.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# 该矩阵的维度是(27895,3329),矩阵的行代表userId,列代表movieId,然后我们再将它转置:
X = ratings_pivot2.values.T
X.shape

(3329, 27895)

之所以要将矩阵转置,是因为我们做的是基于物品的协同过滤，我们必须要保留所有的movieId,然后从userId里面去抽取主要的特征,下面我们开始运用SVD算法，抽取主要特征,此处暂定抽取10个主要特征。

In [21]:
from sklearn.decomposition import TruncatedSVD
svd=TruncatedSVD(n_components=10,random_state=17)
matrix=svd.fit_transform(X)
matrix.shape

(3329, 10)

这时候我会看到矩阵原来的维度由(3329,27895)降到了(3329,10),我们保留了所有的movieId信息。接下来我们用经过降维的矩阵来计算一个相关系数矩阵:

In [22]:
import warnings
warnings.filterwarnings("ignore",category=RuntimeWarning)
corr=np.corrcoef(matrix)
print(corr.shape)
corr

(3329, 3329)


array([[1.        , 0.88719504, 0.94722286, ..., 0.80781994, 0.7544022 ,
        0.5269623 ],
       [0.88719504, 1.        , 0.86930487, ..., 0.61777898, 0.5339515 ,
        0.67181041],
       [0.94722286, 0.86930487, 1.        , ..., 0.79992334, 0.5870368 ,
        0.45608533],
       ...,
       [0.80781994, 0.61777898, 0.79992334, ..., 1.        , 0.74405944,
        0.38598591],
       [0.7544022 , 0.5339515 , 0.5870368 , ..., 0.74405944, 1.        ,
        0.33090199],
       [0.5269623 , 0.67181041, 0.45608533, ..., 0.38598591, 0.33090199,
        1.        ]])

这个相关系统矩阵的维度为(3329,3329),其中的元素值表示任意两部电影它们的相关系数,相关系数取值范围为[-1,1],并且矩阵的主对角线元素值为1，表示每部电影自己与自己是完全相关的。

接下来我们还是找movieId为2550这部李连杰主要的黄飞鸿系列电影来测试一下推荐效果,首先我们要从相关系数矩阵中找出2250这部电影所在行的行向量,该行向量包含了2250这部电影与其他所有电影的相关系数,然后我们将该行向量进行降序排列,并抽取前11个相关系数最大的电影,并输出它们的movieId和相关系数值。

In [24]:
movieIds=ratings_pivot2.columns 
movieIds_list = list(movieIds)
movieId_index = movieIds_list.index(movieId)
 
movieId_vec=corr[movieId_index]
argsort_idx =np.argsort(-movieId_vec)[:11]
coff=movieId_vec[argsort_idx]
similar_movie_Ids=movieIds[argsort_idx]
print(similar_movie_Ids.values)
print('--------------------------------------------------------------')
print(coff)

[2550 3874 2552 3143 3732 2553  639 2547 2555 2956 2551]
--------------------------------------------------------------
[1.         0.99637533 0.99598866 0.99554685 0.99453251 0.99409538
 0.99386483 0.99091663 0.98941127 0.98634359 0.98620206]


In [25]:
# 最后我们将movieId转换成电影名称,看看推荐效果如何吧
for idx,mId in enumerate(similar_movie_Ids):
    name = movies[movies.movieId==mId]['title'].values[0]
    if idx==0:
        print('当前电影:',name)
    else:
         print('推荐电影{0}: {1}, 相关系数:{2}'.format(idx,name, coff[idx]))

当前电影: 黃飛鴻之三獅王爭霸
推荐电影1: 太极张三丰, 相关系数:0.9963753317262952
推荐电影2: 黃飛鴻之二男兒當自強, 相关系数:0.9959886617593086
推荐电影3: 黃飛鴻, 相关系数:0.9955468545480031
推荐电影4: 方世玉续集, 相关系数:0.9945325050301705
推荐电影5: 新少林五祖, 相关系数:0.9940953835981862
推荐电影6: 方世玉, 相关系数:0.9938648308354906
推荐电影7: 倚天屠龍記之魔教教主, 相关系数:0.9909166322984473
推荐电影8: 赌神, 相关系数:0.9894112727822968
推荐电影9: 红番区, 相关系数:0.986343588182445
推荐电影10: 冒險王, 相关系数:0.9862020560988379
