# 2-推荐系统

## 主流推荐算法

![](./主流推荐算法.png)

### 基于知识推荐

基于知识的推荐（Knowledge-based Recommendation）在某种程度是可以看成是一种推理（Inference）技术，它不是建立在用户需要和偏好基础上推荐的。基于知识的方法因它们所用的功能知识不同而有明显区别。效用知识（Functional Knowledge）是一种关于一个项目如何满足某一特定用户的知识，因此能解释需要和推荐的关系，所以用户资料可以是任何能支持推理的知识结构，它可以是用户已经规范化的查询，也可以是一个更详细的用户需要的表示。

<img src="./基于知识推荐.jpg">


### 基于内容推荐

在一个基于内容的推荐系统算法中，我们假设对于我们希望推荐的东西有一些数据，这些数据是有关这些东西的特征。

在我们的例子中，我们可以假设每部电影都有两个特征，如$x_1​$代表电影的浪漫程度，$x_2​$ 代表电影的动作程度。

![](./基于内容推荐.png)

则每部电影都有一个特征向量，如$x^{(1)}$是第一部电影的特征向量为\[0.9 0\]。

下面我们要基于这些特征来构建一个推荐系统算法。
假设我们采用线性回归模型，我们可以针对每一个用户都训练一个线性回归模型，如${{\theta }^{(1)}}$是第一个用户的模型的参数。
于是，我们有：

$\theta^{(j)}$用户 $j$ 的参数向量($\theta^{(j)} \in \mathbb R^{n+1}$)

$x^{(i)}$电影 $i$ 的特征向量($x^{(i)}_0, x^{(i)}_1, x^{(i)}_2, ... x^{(i)}_n$)

对于用户 $j$ 和电影 $i$，我们预测评分为：$(\theta^{(j)})^T x^{(i)}$

代价函数

针对用户 $j$，该线性回归模型的代价为预测误差的平方和，加上正则化项：
$$
\min_{\theta (j)}\frac{1}{2}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\right)^2+\frac{\lambda}{2}\sum_{k=1}^n\left(\theta_{k}^{(j)}\right)^2
$$


其中 $i:r(i,j)$表示我们只计算那些用户 $j$ 评过分的电影。在一般的线性回归模型中，误差项和正则项应该都是乘以$1/2m$，在这里我们将$m$去掉。并且我们不对方差项$\theta_0$进行正则化处理。

上面的代价函数只是针对一个用户的，为了学习所有用户，我们将所有用户的代价函数求和：
$$
\min_{\theta^{(1)},...,\theta^{(n_u)}} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\right)^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2
$$
如果我们要用梯度下降法来求解最优解，我们计算代价函数的偏导数后得到梯度下降的更新公式为：

$$
\theta_k^{(j)}:=\theta_k^{(j)}-\alpha\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_{k}^{(i)} \quad (\text{for} \, k = 0)
$$

$$
\theta_k^{(j)}:=\theta_k^{(j)}-\alpha\left(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_{k}^{(i)}+\lambda\theta_k^{(j)}\right) \quad (\text{for} \, k\neq 0)
$$

### 协调过滤

在之前的基于内容的推荐系统中，对于每一部电影，我们都掌握了可用的特征，使用这些特征训练出了每一个用户的参数。相反地，如果我们拥有用户的参数，我们可以学习得出电影的特征。

$$
\mathop{min}\limits_{x^{(1)},...,x^{(n_m)}}\frac{1}{2}\sum_{i=1}^{n_m}\sum_{j{r(i,j)=1}}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(i)})^2
$$
但是如果我们既没有用户的参数，也没有电影的特征，这两种方法都不可行了。协同过滤算法可以同时学习这两者。

我们的优化目标便改为同时针对$x$和$\theta$进行。
$$
J(x^{(1)},...x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})=\frac{1}{2}\sum_{(i:j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(x_k^{(j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2
$$


对代价函数求偏导数的结果如下：

$$
x_k^{(i)}:=x_k^{(i)}-\alpha\left(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})\theta_k^{j}+\lambda x_k^{(i)}\right)
$$

$$
\theta_k^{(i)}:=\theta_k^{(i)}-\alpha\left(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda \theta_k^{(j)}\right)
$$



注：在协同过滤从算法中，我们通常不使用方差项，如果需要的话，算法会自动学得。
协同过滤算法使用步骤如下：

1. 初始 $x^{(1)},x^{(1)},...x^{(nm)},\ \theta^{(1)},\theta^{(2)},...,\theta^{(n_u)}$为一些随机小值(不需要$x_{0}和\theta_{0}$)

2. 使用梯度下降算法最小化代价函数

3. 在训练完算法后，我们预测$(\theta^{(j)})^Tx^{(i)}$为用户 $j$ 给电影 $i$ 的评分

通过这个学习过程获得的特征矩阵包含了有关电影的重要数据，这些数据不总是人能读懂的，但是我们可以用这些数据作为给用户推荐电影的依据。

例如，如果一位用户正在观看电影 $x^{(i)}$，我们可以寻找另一部电影$x^{(j)}$，依据两部电影的特征向量之间的距离$\left\| {{x}^{(i)}}-{{x}^{(j)}} \right\|$的大小。

#### 低秩矩阵分解 Low Rank Matrix Factorization

举例子：

1. 当给出一件产品时，你能否找到与之相关的其它产品。

2. 一位用户最近看上一件产品，有没有其它相关的产品，你可以推荐给他。

我将要做的是：实现一种选择的方法，写出协同过滤算法的预测情况。

我们有关于五部电影的数据集，我将要做的是，将这些用户的电影评分，进行分组并存到一个矩阵中。

我们有五部电影，以及四位用户，那么 这个矩阵 $Y$ 就是一个5行4列的矩阵，它将这些电影的用户评分数据都存在矩阵里：

| **Movie**            | **Alice (1)** | **Bob (2)** | **Carol (3)** | **Dave (4)** |
| -------------------- | ------------- | ----------- | ------------- | ------------ |
| Love at last         | 5             | 5           | 0             | 0            |
| Romance forever      | 5             | ?           | ?             | 0            |
| Cute puppies of love | ?             | 4           | 0             | ?            |
| Nonstop car chases   | 0             | 0           | 5             | 4            |
| Swords vs. karate    | 0             | 0           | 5             | ?            |

![](./Y.png)

推出评分：

![](./低秩矩阵.png)

定义矩阵
$$
X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\  (x^{(n_m)})^T  \end{bmatrix} = 
\begin{bmatrix} x^{(1)}_1  &  x^{(1)}_2   & x^{(1)}_3  & \cdots & x^{(1)}_n    \\ 
x^{(2)}_1  &  x^{(2)}_2   & x^{(2)}_3  & \cdots  & x^{(2)}_n \\  
\vdots  & \vdots & \vdots  & & \vdots\\
x^{(n_m)}_1  &  x^{(n_m)}_2   & x^{(n_m)}_3  & \cdots  & x^{(n_m)}_n   \end{bmatrix}
$$

$$
\Theta = \begin{bmatrix} (\theta^{(1)})^T \\ (\theta^{(2)})^T \\ \vdots \\ (\theta^{(n_u)})^T  \end{bmatrix} = 
\begin{bmatrix} \theta^{(1)}_1 & \theta^{(1)}_2 & \theta^{(1)}_3 & \cdots & \theta^{(1)}_n \\
\theta^{(2)}_1 & \theta^{(2)}_2 & \theta^{(2)}_3 & \cdots & \theta^{(2)}_n \\
\vdots  & \vdots & \vdots  & & \vdots\\
\theta^{(n_u)}_1 & \theta^{(n_u)}_2 & \theta^{(n_u)}_3 & \cdots & \theta^{(n_u)}_n
\end{bmatrix}
$$
为了得到上述所需的评分， 只需要$X\Theta^T$, 这个算法也叫作**低秩矩阵分解**(Low rank matrix factorization)

找到相关影片：

![](./similar_movies.png)

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context="notebook", style="white", palette=sns.color_palette("RdBu"))
import numpy as np
import pandas as pd
import scipy.io as sio

# load data and setting up

Notes:
 
X - num_movies (1682) x num_features (10) matrix of movie features 

Theta - num_users (943) x num_features (10) matrix of user features 

Y - num_movies x num_users matrix of user ratings of movies 

R - num_movies x num_users matrix, where R(i, j) = 1 if the i-th movie was rated by the j-th user

In [2]:
movies_mat = sio.loadmat('./data/ex8_movies.mat')
Y, R = movies_mat.get('Y'), movies_mat.get('R')

Y.shape, R.shape   # (n_movies, n_users)

((1682, 943), (1682, 943))

In [3]:
m, u = Y.shape
# m: how many movies
# u: how many users

n = 10  # how many features for a movie

In [4]:
# 参数获取  用户的喜好  电影的特征
param_mat = sio.loadmat('./data/ex8_movieParams.mat')
theta, X = param_mat.get('Theta'), param_mat.get('X')

theta.shape, X.shape

((943, 10), (1682, 10))

In [14]:
theta[:2]

array([[ 0.28544362, -1.68426509,  0.26293877, -0.28731731,  0.58572506,
         0.98018795, -0.06337453,  0.76723235, -1.10460164, -0.25186708],
       [ 0.50501321, -0.45464846,  0.31746244, -0.11508694,  0.56770367,
         0.81890506,  0.46164876,  1.09306336, -1.20029436, -0.39161676]])

In [20]:
X[:2]

array([[ 1.0486855 , -0.40023196,  1.19411945,  0.37112768,  0.40760718,
         0.97440691, -0.05841025,  0.861721  , -0.69728994,  0.28874563],
       [ 0.78085123, -0.38562591,  0.52119779,  0.22735522,  0.57010888,
         0.64126447, -0.55000555,  0.70402073, -0.48583521, -0.56462407]])

# cost
$$
J(x^{(1)}, \cdots, x^{(n_m)}, \theta^{(1)}, \cdots, \theta^{(n_u)}) = 
\frac 1 2 \sum_{(i, j):r(r, j)=1}((\theta^{(j)})^T x^{(i)} - y^{(i, j)})^2
$$

In [6]:
def serialize(X, theta):
    """serialize 2 matrix
    """
    # X (movie, feature), (1682, 10): movie features
    # theta (user, feature), (943, 10): user preference
    return np.concatenate((X.ravel(), theta.ravel()))


def deserialize(param, n_movie, n_user, n_features):
    """into ndarray of X(1682, 10), theta(943, 10)"""
    return param[:n_movie * n_features].reshape(n_movie, n_features), \
           param[n_movie * n_features:].reshape(n_user, n_features)


# recommendation fn
def cost(param, Y, R, n_features):
    """
    计算最终的cost
    compute cost for every r(i, j)=1
    Args:
        param: serialized X, theta
        Y (movie, user), (1682, 943): (movie, user) rating
        R (movie, user), (1682, 943): (movie, user) has rating
    """
    # theta (user, feature), (943, 10): user preference
    # X (movie, feature), (1682, 10): movie features
    n_movie, n_user = Y.shape
    X, theta = deserialize(param, n_movie, n_user, n_features)
    
    # 所有得到评分的电影的 预测评分-实际评分
    inner = np.multiply(X @ theta.T - Y, R)

    return np.power(inner, 2).sum() / 2


def gradient(param, Y, R, n_features):
    # cost函数的梯度
    # theta (user, feature), (943, 10): user preference
    # X (movie, feature), (1682, 10): movie features
    n_movies, n_user = Y.shape
    X, theta = deserialize(param, n_movies, n_user, n_features)

    # 所有得到评分的电影的 预测评分-实际评分
    inner = np.multiply(X @ theta.T - Y, R)  # (1682, 943)

    # X_grad (1682, 10)
    X_grad = inner @ theta

    # theta_grad (943, 10)
    theta_grad = inner.T @ X

    # roll them together and return
    return serialize(X_grad, theta_grad)


def regularized_cost(param, Y, R, n_features, l=1):
    # l: 正则化系数lambda
    reg_term = np.power(param, 2).sum() * (l / 2)

    return cost(param, Y, R, n_features) + reg_term


def regularized_gradient(param, Y, R, n_features, l=1):
    grad = gradient(param, Y, R, n_features)
    reg_term = l * param

    return grad + reg_term


In [7]:
# use subset of data to calculate the cost as in pdf...
# 使用 较少的数据来进行计算
users = 4
movies = 5
features = 3

X_sub = X[:movies, :features]
theta_sub = theta[:users, :features]
Y_sub = Y[:movies, :users]
R_sub = R[:movies, :users]

param_sub = serialize(X_sub, theta_sub)
cost(param_sub, Y_sub, R_sub, features)

22.224603725685675

In [8]:
param = serialize(X, theta)  # total real params

cost(serialize(X, theta), Y, R, 10)  # this is real total cost

27918.64012454421

# gradient
$$
\frac {\partial J}{\partial {x_k^{(i)}}} = \sum_{j:r(i, j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})\theta_k^{(j)} \\
\frac {\partial J}{\partial {\theta_k^{(j)}}} = \sum_{j:r(i, j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})x_k^{(i)}

$$

In [9]:
n_movie, n_user = Y.shape

X_grad, theta_grad = deserialize(gradient(param, Y, R, 10),
                                      n_movie, n_user, 10)


$$X_{grad}(i, :) = (X(i, :) * Theta^T_{temp} - Y_{temp}) * Theta_{temp}$$

In [10]:
assert X_grad.shape == X.shape
assert theta_grad.shape == theta.shape

# regularized cost

In [11]:
# in the ex8_confi.m, lambda = 1.5, and it's using sub data set
regularized_cost(param_sub, Y_sub, R_sub, features, l=1.5)

31.34405624427422

In [12]:
regularized_cost(param, Y, R, 10, l=1)  # total regularized cost

32520.682450229557

# regularized gradient

$$
\frac {\partial J}{\partial {x_k^{(i)}}} = \sum_{j:r(i, j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})\theta_k^{(j)} + \lambda x_k^{(i)} \\
\frac {\partial J}{\partial {\theta_k^{(j)}}} = \sum_{j:r(i, j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})x_k^{(i)} + \lambda \theta_k^{(j)} 

$$

In [13]:
n_movie, n_user = Y.shape

X_grad, theta_grad = deserialize(regularized_gradient(param, Y, R, 10),
                                                                n_movie, n_user, 10)

assert X_grad.shape == X.shape
assert theta_grad.shape == theta.shape

# parse `movie_id.txt`

In [22]:
movie_list = []

with open('./data/movie_ids.txt', encoding='latin-1') as f:
    for line in f:
        tokens = line.strip().split(' ')
        movie_list.append(' '.join(tokens[1:]))

movie_list = np.array(movie_list)
movie_list

array(['Toy Story (1995)', 'GoldenEye (1995)', 'Four Rooms (1995)', ...,
       'Sliding Doors (1998)', 'You So Crazy (1994)',
       'Scream of Stone (Schrei aus Stein) (1991)'], dtype='<U81')

# reproduce my ratings

生成个人评分

In [23]:
ratings = np.zeros(1682)

ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5

# prepare data

In [24]:
Y, R = movies_mat.get('Y'), movies_mat.get('R')

# 新的评分矩阵(n_movies, n_new_users)
Y = np.insert(Y, 0, ratings, axis=1)  # now I become user 0
Y.shape

(1682, 944)

In [31]:
# 更新R(i, j)
R = np.insert(R, 0, ratings != 0, axis=1)
R.shape

(1682, 944)

In [32]:
# 假设电影特征50个 
n_features = 50
n_movie, n_user = Y.shape
l = 10

In [33]:
X = np.random.standard_normal((n_movie, n_features))
theta = np.random.standard_normal((n_user, n_features))

X.shape, theta.shape

((1682, 50), (944, 50))

In [34]:
param = serialize(X, theta)

normalized ratings

对结果$Y$矩阵进行均值归一化处理，将每一个用户对某一部电影的评分减去所有用户对该电影评分的平均值(新矩阵Y中每一行的均值都为0)

In [82]:
# 均值归一化的 Y
Y_norm = Y - Y.mean(axis=1)[:, np.newaxis]
Y_norm.mean(axis=1)

array([ 2.00875098e-16,  1.37366578e-16, -1.55125442e-16, ...,
       -4.08992299e-17, -3.02639417e-17, -3.36010792e-17])

# training

In [79]:
import scipy.optimize as opt

In [83]:
res = opt.minimize(fun=regularized_cost,
                   x0=param,
                   args=(Y_norm, R, n_features, l),
                   method='TNC',
                   jac=regularized_gradient)
#这里很慢

In [84]:
res

     fun: 63571.23700679088
     jac: array([6.17977962e-06, 5.92303659e-06, 3.23239212e-06, ...,
       4.27764606e-08, 1.62711187e-06, 1.23681476e-06])
 message: 'Converged (|f_n-f_(n-1)| ~= 0)'
    nfev: 2515
     nit: 83
  status: 1
 success: True
       x: array([ 0.16399667, -0.38208505,  0.04902922, ..., -0.45053505,
        0.46402138,  0.46301432])

In [85]:
X_trained, theta_trained = deserialize(res.x, n_movie, n_user, n_features)
X_trained.shape, theta_trained.shape

((1682, 50), (944, 50))

In [86]:
prediction = X_trained @ theta_trained.T

In [89]:
my_preds = prediction[:, 0] + Y.mean(axis=1)

In [90]:
idx = np.argsort(my_preds)[::-1]  # Descending order
idx.shape

(1682,)

In [91]:
# top ten idx
my_preds[idx][:10]

array([4.44234434, 4.17834012, 4.04470411, 4.04314502, 4.01796448,
       4.00462143, 3.92903682, 3.89953287, 3.87701144, 3.8342994 ])

In [92]:
for m in movie_list[idx][:10]:
    print(m)

Star Wars (1977)
Titanic (1997)
Shawshank Redemption, The (1994)
Raiders of the Lost Ark (1981)
Return of the Jedi (1983)
Forrest Gump (1994)
Godfather, The (1972)
Empire Strikes Back, The (1980)
Braveheart (1995)
Schindler's List (1993)
