[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Wp-Zhang/HandyRec/blob/master/examples/FMLPRec.ipynb)

> This notebook runs FMLP-Rec on MovieLens1M dataset. We'll use YouTubeDNN for generating candidates and FMLP-Rec for ranking these candidates.

> Only movies with ratings larger than 3 are treated as 'positive' samples for each user. Every last 10 'positive' movies of each user are held out for testing.

## Table of Contents:
* [Prepare data for matching](#section-0)
* [Train match model and export embeddings](#section-1)
* [Use Faiss to generate candidates](#section-2)
* [Train rank model and predict](#section-3)

**Download dataset and install packages**

In [1]:
! git clone https://github.com/Wp-Zhang/HandyRec.git
! pip install faiss-cpu

Cloning into 'HandyRec'...
remote: Enumerating objects: 1658, done.[K
remote: Counting objects: 100% (1658/1658), done.[K
remote: Compressing objects: 100% (1236/1236), done.[K
remote: Total 1658 (delta 656), reused 1280 (delta 370), pack-reused 0[K
Receiving objects: 100% (1658/1658), 20.90 MiB | 11.26 MiB/s, done.
Resolving deltas: 100% (656/656), done.
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 4.2 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


In [2]:
! wget https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ./ml-1m.zip
! unzip -o ml-1m.zip

--2022-04-13 03:59:09--  https://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘./ml-1m.zip’


2022-04-13 03:59:11 (4.17 MB/s) - ‘./ml-1m.zip’ saved [5917549/5917549]

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


**Import relative packages**

In [3]:
import sys
sys.path.append('./HandyRec/')

In [4]:
from handyrec.dataset.movielens import MovieMatchDataHelper, MovieRankSeqDataHelper
from handyrec.models.match import YouTubeMatchDNN
from handyrec.models.rank import FMLPRec
from handyrec.features import DenseFeature, SparseFeature, SparseSeqFeature, FeatureGroup, EmbdFeatureGroup, FeaturePool
from handyrec.layers.utils import sampledsoftmaxloss
from handyrec.dataset.metrics import map_at_k, recall_at_k
from handyrec.models.utils import search_embedding

import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.utils import plot_model
import pandas as pd
import numpy as np
import gc

In [5]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
MATCH_EMBEDDING_DIM = 64
RANK_EMBEDDING_DIM = 64
SEQ_LEN = 40
BATCH_SIZE = 2**12
NEPOCH = 50

NEG_NUM = 10
CANDIDATE_NUM = 100

# 0. Prepare data for ranking<a name="section-0"></a>

In [7]:
match_dh = MovieMatchDataHelper('./ml-1m/')
data = match_dh.get_clean_data(sparse_features=['gender','age','occupation','zip','year'])

match_user_features = ['user_id','gender','age','occupation','zip']
match_movie_features = [f for f in data['item'].columns if f != 'title']
match_dh.gen_dataset(match_user_features+match_movie_features, data, seq_max_len=SEQ_LEN, negnum=0)

Encode User Sparse Feats: 100%|██████████| 4/4 [00:00<00:00, 118.89it/s]
Encode Item Sparse Feats: 100%|██████████| 1/1 [00:00<00:00, 232.29it/s]
Generate train set: 100%|██████████| 6040/6040 [00:05<00:00, 1199.54it/s]
Save user features: 100%|██████████| 4/4 [00:00<00:00,  4.31it/s]
Save item features: 100%|██████████| 2/2 [00:01<00:00,  1.83it/s]


In [8]:
match_train, match_train_label, match_test, match_test_label = match_dh.load_dataset(match_user_features, match_movie_features)

Load user features: 100%|██████████| 7/7 [00:00<00:00, 63.44it/s]
Load movie features: 100%|██████████| 3/3 [00:00<00:00, 151.28it/s]


In [9]:
match_feature_dim = match_dh.get_feature_dim(data, match_user_features, match_movie_features, [])

# 1. Train match model and export embeddings <a name="section-1"></a>

In [10]:
# * add example_age^2 as showed in the original paper
match_train['example_age_2'] = match_train['example_age']**2
match_test['example_age_2'] = match_test['example_age']**2

In [11]:
match_user_dense_feats = ['example_age','example_age_2']
match_user_sparse_feats = ['user_id','gender','age','occupation','zip']

In [12]:
match_item_dense_feats = []
match_item_sparse_feats = [f for f in match_movie_features if f!='genres']
all_item_model_input = {f:np.array(data['item'][f].tolist()) for f in match_movie_features}

In [13]:
feat_pool1 = FeaturePool()

In [14]:
match_item_features = [DenseFeature(x) for x in match_item_dense_feats] +\
                [SparseFeature(x, match_feature_dim[x], MATCH_EMBEDDING_DIM) for x in match_item_sparse_feats] +\
                [SparseSeqFeature(SparseFeature('genre_id', 19, MATCH_EMBEDDING_DIM), 'genres',6)]
item_feature_group = EmbdFeatureGroup(
    name='item', 
    id_name='movie_id', 
    features=match_item_features, 
    feature_pool=feat_pool1, 
    value_dict=all_item_model_input,
    embd_dim=MATCH_EMBEDDING_DIM)

In [15]:
match_user_features = [DenseFeature(x) for x in match_user_dense_feats] +\
                [SparseFeature(x, match_feature_dim[x], MATCH_EMBEDDING_DIM) for x in match_user_sparse_feats] +\
                [SparseSeqFeature(item_feature_group, 'hist_movie_id',SEQ_LEN)]
user_feature_group = FeatureGroup('user', match_user_features, feat_pool1)

In [16]:
match_model = YouTubeMatchDNN(
    user_feature_group, item_feature_group,
    dnn_hidden_units=(512,256,MATCH_EMBEDDING_DIM), 
    dnn_dropout=0.1,
    dnn_bn=True,
    num_sampled=100
)

In [17]:
match_model.compile(optimizer=tf.keras.optimizers.Adam(lr=5e-4), loss=sampledsoftmaxloss)
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='./match_checkpoint/',
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)
history = match_model.fit(match_train, match_train_label,
                            batch_size=BATCH_SIZE, 
                            epochs=NEPOCH,
                            verbose=1,
                            validation_split=0.1,
                            callbacks=[early_stop,checkpoint])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [18]:
match_model.load_weights('./match_checkpoint/')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f3c353e3410>

In [19]:
user_embedding_model = Model(inputs=match_model.user_input, outputs=match_model.user_embedding)
item_embedding_model = Model(inputs=match_model.item_input, outputs=match_model.item_embedding)

user_embs = user_embedding_model.predict(match_test, batch_size=2 ** 15)
item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 15)

print(user_embs.shape)
print(item_embs.shape)

(6040, 64)
(3883, 64)


# 2. Use Faiss to generate candidates <a name="section-2"></a>

## Test match model

In [20]:
candidates = search_embedding(
    MATCH_EMBEDDING_DIM, 
    item_embs, 
    user_embs,
    data['item']['movie_id'].values,
    CANDIDATE_NUM)

In [21]:
map_at_k(match_test_label, candidates, k=10)

0.027647469252601704

In [22]:
recall_at_k(match_test_label, candidates, k=10)

0.06869205298013245

In [23]:
recall_at_k(match_test_label, candidates, k=100)

0.3984602649006622

## Prepare data for ranking

In [24]:
test_user_embs = user_embedding_model.predict(match_test, batch_size=2 ** 15)
test_candidates = search_embedding(
    MATCH_EMBEDDING_DIM, 
    item_embs, 
    test_user_embs,
    data['item']['movie_id'].values,
    CANDIDATE_NUM)

test_candidates = {
    match_test['user_id'][i] : test_candidates[i]
    for i in range(test_candidates.shape[0])
}

In [25]:
del user_embs, item_embs, match_train, match_train_label, test_user_embs
gc.collect()

996

In [26]:
rank_dh = MovieRankSeqDataHelper('./ml-1m/')
rank_user_features = ['user_id']
rank_movie_features = ['movie_id']

rank_dh.gen_dataset(rank_user_features+rank_movie_features, data, test_candidates, seq_max_len=SEQ_LEN, negnum=NEG_NUM)

Generate train set: 100%|██████████| 6040/6040 [00:08<00:00, 714.05it/s]
Save user features: 0it [00:00, ?it/s]
Save item features: 0it [00:00, ?it/s]


In [27]:
rank_train, rank_test = rank_dh.load_dataset(rank_user_features, rank_movie_features)

Load user features: 100%|██████████| 2/2 [00:00<00:00,  2.05it/s]
Load movie features: 100%|██████████| 1/1 [00:00<00:00, 39.85it/s]


In [28]:
rank_feature_dim = rank_dh.get_feature_dim(data, rank_user_features, rank_movie_features, [])

# 3. Train rank model and predict <a name="section-3"></a>

In [29]:
feat_pool2 = FeaturePool()

In [30]:
rank_item_seq_features = [SparseSeqFeature(SparseFeature('movie_id', rank_feature_dim['movie_id'], RANK_EMBEDDING_DIM), 'hist_movie_id', SEQ_LEN)]
item_seq_feat_group = FeatureGroup('item_seq', rank_item_seq_features, feat_pool2)

In [31]:
rank_model = FMLPRec(
    item_seq_feat_group,
    dropout=0.5,
    block_num=3
)

In [32]:
rank_model.compile(optimizer=tf.keras.optimizers.Adam(lr=2e-4, clipvalue=1.), loss=None)
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=5)
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='./rank_checkpoint/',
    save_weights_only=True,
    monitor='loss',
    mode='min',
    save_best_only=True)
history = rank_model.fit(rank_train, None,
                    batch_size=BATCH_SIZE*2, 
                    epochs=NEPOCH,
                    verbose=1,
                    validation_split=0.1,
                    callbacks=[early_stop,checkpoint])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [33]:
rank_model.load_weights('./rank_checkpoint/')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f3c37f96890>

In [34]:
del rank_train
gc.collect()

1615

In [35]:
pred_model = Model(inputs=rank_model.actual_inputs, outputs=rank_model.actual_outputs)

In [36]:
pred = pred_model.predict(rank_test, batch_size=BATCH_SIZE*2)

In [37]:
pred_df = pd.DataFrame(columns=['user_id','movie_id','pred'])
pred_df['user_id'] = rank_test['user_id']
pred_df['movie_id'] = rank_test['movie_id']
pred_df['pred'] = pred

pred_df = pred_df.sort_values(by=['user_id','pred'], ascending=False).reset_index(drop=True)
pred_df = pred_df.groupby('user_id')['movie_id'].apply(list).reset_index()

In [38]:
test_label_df = pd.DataFrame(columns=['user_id','label'])
test_label_df['user_id'] = match_test['user_id']
test_label_df['label'] = match_test_label.tolist()

In [39]:
test_label_df = pd.merge(test_label_df, pred_df, on=['user_id'], how='left')

In [40]:
map_at_k(test_label_df['label'], test_label_df['movie_id'], k=10)

0.037148474456007566

In [41]:
recall_at_k(test_label_df['label'], test_label_df['movie_id'], k=10)

0.09082781456953641

In [42]:
recall_at_k(test_label_df['label'], test_label_df['movie_id'], k=100)

0.3984602649006622