[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Wp-Zhang/HandyRec/blob/master/examples/DIEN.ipynb)

> This notebook runs DIEN on MovieLens1M dataset. We'll use YouTubeDNN for generating candidates and DIEN for ranking these candidates.

> Only movies with ratings larger than 3 are treated as 'positive' samples for each user. Every last 10 'positive' movies of each user are held out for testing.

## Table of Contents:
* [Prepare data for matching](#section-0)
* [Train match model and export embeddings](#section-1)
* [Use Faiss to generate candidates](#section-2)
* [Train rank model and predict](#section-3)

**Download dataset and install packages**

In [1]:
! git clone https://github.com/Wp-Zhang/HandyRec.git
! pip install faiss-cpu

Cloning into 'HandyRec'...
remote: Enumerating objects: 1575, done.[K
remote: Counting objects: 100% (1575/1575), done.[K
remote: Compressing objects: 100% (1181/1181), done.[K
remote: Total 1575 (delta 606), reused 1224 (delta 343), pack-reused 0[K
Receiving objects: 100% (1575/1575), 20.87 MiB | 6.29 MiB/s, done.
Resolving deltas: 100% (606/606), done.
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 11.4 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


In [2]:
! wget https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ./ml-1m.zip
! unzip -o ml-1m.zip

--2022-04-10 16:34:15--  https://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘./ml-1m.zip’


2022-04-10 16:34:16 (6.96 MB/s) - ‘./ml-1m.zip’ saved [5917549/5917549]

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


**Import relative packages**

In [3]:
import sys
sys.path.append('./HandyRec/')

In [4]:
from handyrec.dataset.movielens import MovieMatchDataHelper, MovieRankDataHelper
from handyrec.models.match import YouTubeMatchDNN
from handyrec.models.rank import DIEN
from handyrec.features import DenseFeature, SparseFeature, SparseSeqFeature, FeatureGroup, EmbdFeatureGroup, FeaturePool
from handyrec.layers.utils import sampledsoftmaxloss
from handyrec.dataset.metrics import map_at_k, recall_at_k
from handyrec.models.utils import search_embedding

import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras.utils import plot_model
import pandas as pd
import numpy as np
import gc

In [5]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
MATCH_EMBEDDING_DIM = 256
RANK_EMBEDDING_DIM = 128
SEQ_LEN = 15
BATCH_SIZE = 2**12
NEPOCH = 50

NEG_NUM = 10
CANDIDATE_NUM = 100

# 0. Prepare data for ranking<a name="section-0"></a>

In [7]:
match_dh = MovieMatchDataHelper('./ml-1m/')
data = match_dh.get_clean_data(sparse_features=['gender','age','occupation','zip','year'])

match_user_features = ['user_id','gender','age','occupation','zip']
match_movie_features = [f for f in data['item'].columns if f != 'title']
match_dh.gen_dataset(match_user_features+match_movie_features, data, seq_max_len=SEQ_LEN, negnum=0)

Encode User Sparse Feats: 100%|██████████| 4/4 [00:00<00:00, 126.17it/s]
Encode Item Sparse Feats: 100%|██████████| 1/1 [00:00<00:00, 91.82it/s]
Generate train set: 100%|██████████| 6040/6040 [00:03<00:00, 1540.82it/s]
Save user features: 100%|██████████| 4/4 [00:00<00:00,  4.80it/s]
Save item features: 100%|██████████| 2/2 [00:01<00:00,  1.86it/s]


In [8]:
match_train, match_train_label, match_test, match_test_label = match_dh.load_dataset(match_user_features, match_movie_features)

Load user features: 100%|██████████| 7/7 [00:00<00:00, 136.12it/s]
Load movie features: 100%|██████████| 3/3 [00:00<00:00, 148.00it/s]


In [9]:
match_feature_dim = match_dh.get_feature_dim(data, match_user_features, match_movie_features, [])

# 1. Train match model and export embeddings <a name="section-1"></a>

In [10]:
# * add example_age^2 as showed in the original paper
match_train['example_age_2'] = match_train['example_age']**2
match_test['example_age_2'] = match_test['example_age']**2

In [11]:
match_user_dense_feats = ['example_age','example_age_2']
match_user_sparse_feats = ['user_id','gender','age','occupation','zip']

In [12]:
match_item_dense_feats = []
match_item_sparse_feats = [f for f in match_movie_features if f!='genres']
all_item_model_input = {f:np.array(data['item'][f].tolist()) for f in match_movie_features}

In [13]:
feat_pool1 = FeaturePool()

In [14]:
match_item_features = [DenseFeature(x) for x in match_item_dense_feats] +\
                [SparseFeature(x, match_feature_dim[x], MATCH_EMBEDDING_DIM) for x in match_item_sparse_feats] +\
                [SparseSeqFeature(SparseFeature('genre_id', 19, MATCH_EMBEDDING_DIM), 'genres',6)]
item_feature_group = EmbdFeatureGroup(
    name='item', 
    id_name='movie_id', 
    features=match_item_features, 
    feature_pool=feat_pool1, 
    value_dict=all_item_model_input,
    embd_dim=MATCH_EMBEDDING_DIM)

In [15]:
match_user_features = [DenseFeature(x) for x in match_user_dense_feats] +\
                [SparseFeature(x, match_feature_dim[x], MATCH_EMBEDDING_DIM) for x in match_user_sparse_feats] +\
                [SparseSeqFeature(item_feature_group, 'hist_movie_id',SEQ_LEN)]
user_feature_group = FeatureGroup('user', match_user_features, feat_pool1)

In [16]:
match_model = YouTubeMatchDNN(
    user_feature_group, item_feature_group,
    dnn_hidden_units=(512,256,MATCH_EMBEDDING_DIM), 
    dnn_dropout=0.1,
    dnn_bn=True,
    num_sampled=100
)

In [17]:
# plot_model(match_model)

In [18]:
match_model.compile(optimizer=tf.keras.optimizers.Adam(lr=5e-4), loss=sampledsoftmaxloss)
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='./match_checkpoint/',
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)
history = match_model.fit(match_train, match_train_label,
                            batch_size=BATCH_SIZE, 
                            epochs=NEPOCH,
                            verbose=1,
                            validation_split=0.1,
                            callbacks=[early_stop,checkpoint])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [19]:
match_model.load_weights('./match_checkpoint/')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f42a03bae90>

In [20]:
user_embedding_model = Model(inputs=match_model.user_input, outputs=match_model.user_embedding)
item_embedding_model = Model(inputs=match_model.item_input, outputs=match_model.item_embedding)

user_embs = user_embedding_model.predict(match_test, batch_size=2 ** 15)
item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 15)

print(user_embs.shape)
print(item_embs.shape)

(6040, 256)
(3883, 256)


# 2. Use Faiss to generate candidates <a name="section-2"></a>

## Test match model

In [21]:
candidates = search_embedding(
    MATCH_EMBEDDING_DIM, 
    item_embs, 
    user_embs,
    data['item']['movie_id'].values,
    CANDIDATE_NUM)

In [22]:
map_at_k(match_test_label, candidates, k=10)

0.0355188229265216

In [23]:
recall_at_k(match_test_label, candidates, k=10)

0.08423841059602648

In [24]:
recall_at_k(match_test_label, candidates, k=100)

0.44639072847682115

## Prepare data for ranking

In [25]:
test_user_embs = user_embedding_model.predict(match_test, batch_size=2 ** 15)
test_candidates = search_embedding(
    MATCH_EMBEDDING_DIM, 
    item_embs, 
    test_user_embs,
    data['item']['movie_id'].values,
    CANDIDATE_NUM)

test_candidates = {
    match_test['user_id'][i] : test_candidates[i]
    for i in range(test_candidates.shape[0])
}

In [26]:
del user_embs, item_embs, match_train, match_train_label, test_user_embs
gc.collect()

1983

In [27]:
rank_dh = MovieRankDataHelper('./ml-1m/')
rank_user_features = ['user_id','gender','age','occupation','zip']
rank_movie_features = [f for f in data['item'].columns if f != 'title']

rank_dh.gen_dataset(rank_user_features+rank_movie_features, data, test_candidates, seq_max_len=SEQ_LEN, negnum=NEG_NUM, neg_seq=True)

Generate train set: 100%|██████████| 6040/6040 [00:07<00:00, 787.95it/s]
Save user features: 100%|██████████| 4/4 [00:06<00:00,  1.68s/it]
Save item features: 100%|██████████| 2/2 [00:11<00:00,  5.84s/it]


In [28]:
rank_train, rank_train_label, rank_test = rank_dh.load_dataset(rank_user_features, rank_movie_features, neg_seq=True)

Load user features: 100%|██████████| 9/9 [00:01<00:00,  8.87it/s]
Load movie features: 100%|██████████| 3/3 [00:00<00:00, 17.35it/s]


In [29]:
rank_feature_dim = rank_dh.get_feature_dim(data, rank_user_features, rank_movie_features, [])

# 3. Train rank model and predict <a name="section-3"></a>

In [30]:
rank_user_dense_feats = []
rank_user_sparse_feats = ['user_id','gender','age','occupation', 'zip']
rank_item_dense_feats = []
rank_item_sparse_feats = [f for f in rank_movie_features if f!='genres']

In [31]:
feat_pool2 = FeaturePool()

In [32]:
all_item_model_input = {f:np.array(data['item'][f].tolist()) for f in match_movie_features}
rank_item_feats = [SparseFeature(x, rank_feature_dim[x], RANK_EMBEDDING_DIM) for x in rank_item_sparse_feats] +\
                [DenseFeature(x) for x in rank_item_dense_feats] +\
                [SparseSeqFeature(SparseFeature('genre_id', 19, RANK_EMBEDDING_DIM), 'genres',6)]
item_feature_group = EmbdFeatureGroup(
    name='item', 
    id_name='movie_id', 
    features=rank_item_feats, 
    feature_pool=feat_pool2, 
    value_dict=all_item_model_input,
    embd_dim=RANK_EMBEDDING_DIM
)

In [33]:
item_seq_features = [SparseSeqFeature(item_feature_group, 'hist_movie_id', SEQ_LEN)]
item_seq_feat_group = FeatureGroup('item_seq', item_seq_features, feat_pool2)
neg_item_seq_features = [SparseSeqFeature(item_feature_group, 'neg_hist_movie_id', SEQ_LEN)]
neg_item_seq_feat_group = FeatureGroup('neg_item_seq', neg_item_seq_features, feat_pool2)

In [34]:
other_feats = [DenseFeature(x) for x in rank_user_dense_feats] +\
                [SparseFeature(x, rank_feature_dim[x], RANK_EMBEDDING_DIM) for x in rank_user_sparse_feats]
other_feature_group = FeatureGroup('others', other_feats, feat_pool2)

In [35]:
rank_model = DIEN(
    item_seq_feat_group, neg_item_seq_feat_group, other_feature_group,
    gru_units=RANK_EMBEDDING_DIM, gru_dropout=0.1,
    lau_dnn_hidden_units=(36,1), lau_dnn_activation='dice', lau_dnn_dropout=0., lau_l2_dnn=0.2, lau_dnn_bn=False,
    augru_units=RANK_EMBEDDING_DIM,
    dnn_hidden_units=(256,128,1), dnn_activation='dice', dnn_dropout=0.2, l2_dnn=0.2, dnn_bn=True,
)

In [36]:
rank_model.compile(optimizer=tf.keras.optimizers.Adam(lr=1e-4), loss=binary_crossentropy)
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5)
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='./rank_checkpoint/',
    save_weights_only=True,
    monitor='loss',
    mode='min',
    save_best_only=True)
history = rank_model.fit(rank_train, rank_train_label,
                    batch_size=BATCH_SIZE*2, 
                    epochs=NEPOCH,
                    verbose=1,
                    validation_split=0.1,
                    callbacks=[early_stop,checkpoint])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [37]:
rank_model.load_weights('./rank_checkpoint/')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f429511e550>

In [38]:
del rank_train
gc.collect()

2369

In [39]:
pred = rank_model.predict(rank_test, batch_size=BATCH_SIZE*2)

In [40]:
pred_df = pd.DataFrame(columns=['user_id','movie_id','pred'])
pred_df['user_id'] = rank_test['user_id']
pred_df['movie_id'] = rank_test['movie_id']
pred_df['pred'] = pred

pred_df = pred_df.sort_values(by=['user_id','pred'], ascending=False).reset_index(drop=True)
pred_df = pred_df.groupby('user_id')['movie_id'].apply(list).reset_index()

In [41]:
test_label_df = pd.DataFrame(columns=['user_id','label'])
test_label_df['user_id'] = match_test['user_id']
test_label_df['label'] = match_test_label.tolist()

In [42]:
test_label_df = pd.merge(test_label_df, pred_df, on=['user_id'], how='left')

In [43]:
map_at_k(test_label_df['label'], test_label_df['movie_id'], k=10)

0.018196612530221805

In [44]:
recall_at_k(test_label_df['label'], test_label_df['movie_id'], k=10)

0.05559602649006623

In [45]:
recall_at_k(test_label_df['label'], test_label_df['movie_id'], k=100)

0.44639072847682115