## 基于RNN的排序模块

- 主要思路：根据评分的大小将评分转换为喜欢或者不喜欢，然后根据用户历史喜好预测他是否喜欢某个电影。根据分类概率，做point-wise排序。
- 假设：用户在某一段时间内观看了一些电影之后，会对一些具体的电影特别感兴趣
- 样本：(电影历史评分,待预测电影id)，评分
- 处理方法：电影历史评分送入RNN，RNN的Embedding和电影id的Embedding做运算，得到预测评分

### 构造样本

- 评分转化为类别（喜欢or不喜欢），方法每个用户选择阈值，评分大于阈值的为喜欢，评分小于等于阈值的为不喜欢。（阈值可以考虑使用最大值与最小值的中间值、中位数、平均数等）
- 每个用户的评分按照时间排序。设定时间窗口，滑动生成样本。
    - 样本特征分为U侧特征和I侧特征，U侧特征为历史评分，I侧特征为电影ID，标签是是否喜欢
    - 比如窗口为n，则每个用户的第i个样本为：<$r_i, r_{i+1}, ..., r{i+n-1}$>,$r_n$ => $r_n<r_{mean}$

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 加载数据

In [3]:
movies_file_path = '../../data/ml-clean/movies_v2.csv'
ratings_file_path = '../../data/ml-clean/ratings_v2.csv'

In [4]:
df_movies = pd.read_csv(movies_file_path, index_col=0)

In [5]:
df_ratings = pd.read_csv(ratings_file_path, index_col=0)

In [6]:
df_movies.shape, df_ratings.shape

((15171, 14), (15654592, 5))

## 构造样本

#### 确定阈值

- 实际应用场景中喜欢的电影往往只占到推荐总数的很少一部分
- 确定喜欢还是不喜欢的阈值采用所有评分中的最大值

In [7]:
df_threshold_score = df_ratings[['userId', 'rating']].groupby('userId').max()

In [8]:
df_threshold_score = df_threshold_score.reset_index()

In [9]:
df_threshold_score.columns = ['userId', 'threshold_rating']

- 对每条评分打分类标签（小于阈值为不喜欢（0），大于等于阈值为喜欢（1））

In [10]:
df_ratings = pd.merge(df_ratings, df_threshold_score, on='userId')

In [11]:
df_ratings['is_like'] = df_ratings['rating'] >= df_ratings['threshold_rating']

- 代表喜欢和不喜欢的打分数量如下：

In [12]:
df_ratings['is_like'].value_counts()

False    13515943
True      2138649
Name: is_like, dtype: int64

- 代表不喜欢的打分占比如下：

In [13]:
(df_ratings['is_like'].sum() / df_ratings['is_like'].count())

0.13661480286423305

#### 查看一下每个用户的评分数量信息

In [14]:
df_rating_count = df_ratings[['userId', 'rating']].groupby('userId').count()

In [15]:
df_rating_count = df_rating_count.reset_index()

In [16]:
df_rating_count.columns = ['userId', 'rating_count']

In [17]:
df_rating_count['rating_count'].describe()

count    189911.000000
mean         82.431202
std         144.897259
min           4.000000
25%          17.000000
50%          35.000000
75%          84.000000
max        8573.000000
Name: rating_count, dtype: float64

- 每个用户至少有4条评分
- 有75%的用户，其评分在17条以上
- 评分最多的用户有8573条

#### 在用户维度划分训练集、测试集

In [18]:
df_userid = df_ratings[['userId']].drop_duplicates()

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
df_userid_train, df_userid_test = train_test_split(df_userid, test_size=0.3)

- 训练集和测试集的用户数量：

In [21]:
df_userid_train.shape, df_userid_test.shape

((132937, 1), (56974, 1))

In [22]:
df_ratings_train = pd.merge(df_ratings, df_userid_train, on='userId')

In [23]:
df_ratings_test = pd.merge(df_ratings, df_userid_test, on='userId')

- 训练集和测试集的评论数量：

In [24]:
df_ratings_train.shape, df_ratings_test.shape

((10944668, 7), (4709924, 7))

#### 生成样本数据

- U侧：特征的历史评分滑动窗口大小选择10，所以特征有10维，每维特征的元素的具体输入值为：2 * movieId + is_like
- I侧：特征为历史评分窗口外的下一个评分的movieId
- 预测标签：喜欢（1）还是不喜欢（0）
- 生成样本数据的算法
    - 对评分DataFrame按照(userId,timestamp)排序
    - 对userId, movieId, is_like列分别做n次shift，新生成n * 3列，n为窗口大小。
    - 对于每一列数据判断，判断原userId列和新生成n列userId是否相同，不同丢弃
    - 前n个movieId和is_like可生成U侧特征，第n+1个movieId为I侧特征，第n+1个is_like为样本标签

In [26]:
def should_drop(row):
    if len(set(row.filter(regex='userId'))) > 1:
        return True
    return False

In [27]:
def make_samples(df_ratings, window_size):
    df_ratings = df_ratings.sort_values(by=['userId', 'timestamp'])
    df_ratings['userId_0'] = df_ratings['userId']
    df_ratings['movieId_0'] = df_ratings['movieId']
    df_ratings['is_like_0'] = df_ratings['is_like']
    for i in range(1, window_size + 1):
        df_shift = df_ratings[['userId', 'movieId', 'is_like']].shift(-i)
        df_shift.columns = ['userId_{}'.format(i), 'movieId_{}'.format(i), 'is_like_{}'.format(i)]
        df_ratings = pd.concat([df_ratings, df_shift], axis=1)
    df_ratings['should_drop'] = df_ratings.apply(should_drop, axis=1)
    df_ratings = df_ratings[~df_ratings['should_drop']]
    for i in range(0, window_size):
        df_ratings['x_user_{}'.format(i)] = df_ratings['movieId_{}'.format(i)] * 2 + df_ratings['is_like_{}'.format(i)]
    df_ratings['x_item'] = df_ratings['movieId_{}'.format(window_size)]
    df_ratings['y'] = df_ratings['is_like_{}'.format(window_size)]
    X_user = df_ratings.filter(regex='x_user').values
    X_item = df_ratings.filter(regex='x_item').values
    y = df_ratings.filter(regex='y').values
    return X_user, X_item, y

In [1]:
X_user_train, X_item_train, y_train = make_samples(df_ratings_train, 10)

In [None]:
X_user_train.shape, X_item_train.shape, y_train.shape

In [None]:
X_user_test, X_item_test, y_test = make_samples(df_ratings_test, 10)

In [None]:
X_user_test.shape, X_item_test.shape, y_test.shape

In [None]:
!free

#### 保存为pickle文件

In [35]:
joblib.dump((X_user_test, X_item_test, y_test), 'data_test.m')

['data_test.m']

In [45]:
!ls -all -h data_test.m

-rw-r--r-- 1 root root 370M Mar 14 14:16 data_test.m


In [42]:
joblib.dump((X_user_train, X_item_train, y_train), 'data_train.m')

['data_train.m']

In [44]:
!ls -all -h data_train.m

-rw-r--r-- 1 root root 855M Mar 14 15:19 data_train.m


## 加载样本数据

In [4]:
X_user_train, X_item_train, y_train = joblib.load('data_train.m')

In [5]:
X_user_train.shape, X_item_train.shape, y_train.shape

((9601120, 10), (9601120, 1), (9601120, 1))

In [6]:
X_user_test, X_item_test, y_test = joblib.load('data_test.m')

In [7]:
X_user_test.shape, X_item_test.shape, y_test.shape

((4154428, 10), (4154428, 1), (4154428, 1))

## 构造模型

- U侧特征做Embedding送入RNN，I侧特征做Embedding，之后与RNN的输入做拼接，然后在加一个输出层
- 模型的输出层接的是sigmoid函数，预测是喜欢还是不喜欢

In [8]:
import keras
from keras.layers import Input, Embedding, LSTM, Dense, Flatten
from keras.optimizers import Adam
from keras.models import Model

Using TensorFlow backend.


In [9]:
from keras_ex.metrics import precision_score, recall_score

In [13]:
def build_model(max_movie_id, window_size):
    input_user = Input(shape=(window_size,))
    embedding_user = Embedding(input_dim=(max_movie_id + 1) * 2, output_dim=16, 
                               embeddings_initializer='he_normal', input_length=window_size)(input_user)
    lstm_user = LSTM(64)(embedding_user)
    output_user = Dense(128, activation='relu')(lstm_user)
    input_item = Input(shape=(1,))
    embedding_item = Embedding(input_dim=max_movie_id + 1, 
                               embeddings_initializer='he_normal', output_dim=16)(input_item)
    embedding_item = Flatten()(embedding_item)
    output_item = Dense(128, activation='relu')(embedding_item)
    layer_user_item = keras.layers.concatenate([output_user, output_item])
    layer_output = Dense(1, activation='sigmoid')(layer_user_item)
    model = Model(inputs=[input_user, input_item], outputs=layer_output)
    adam = Adam(lr=0.1)
    model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['acc', precision_score, recall_score])
    return model

In [11]:
from keras.callbacks import TensorBoard, ReduceLROnPlateau, ModelCheckpoint

In [14]:
def train(X_item_train, X_item_test, X_user_train, X_user_test, y_train, y_test):
    max_movie_id = max(X_item_train.max(), X_item_test.max())
    model = build_model(int(max_movie_id), X_user_train.shape[1])
    reduce_lr = ReduceLROnPlateau(factor=0.1, patience=3, min_lr=0.0001)
    hist = model.fit([X_user_train, X_item_train], y_train, epochs=10, batch_size=256, 
                 validation_data=[[X_user_test, X_item_test], y_test], callbacks=[reduce_lr])
    return hist, model

In [None]:
hist, model = train(X_item_train, X_item_test, X_user_train, X_user_test, y_train, y_test)

Train on 9601120 samples, validate on 4154428 samples
Epoch 1/10
 179456/9601120 [..............................] - ETA: 37:44 - loss: 0.4218 - acc: 0.8647 - precision_score: 0.1857 - recall_score: 0.0216

## 评估模型(auc)

- 优化函数使用rmsprop，batch_size=1024，epoch=10，最终auc=0.81
- Embedding加上initializer之后，4轮迭代之后auc也是0.81

In [97]:
from sklearn.metrics import roc_curve, auc, recall_score, precision_score

In [99]:
def estimate(model, X_user_test, X_item_test, y_test):
    y_test = y_test.astype(np.int32)
    pred_test = model.predict([X_user_test, X_item_test])
    fpr, tpr, thresholds = roc_curve(y_test[:,0], pred_test[:,0])
    auc_score = auc(fpr, tpr)
    return auc_score