[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Wp-Zhang/HandyRec/blob/master/examples/FMLPRec/FMLPRec.ipynb)

> This notebook runs FMLP-Rec on MovieLens1M dataset. We'll use DSSM for generating candidates and FMLP-Rec for ranking these candidates.

> Only movies with ratings larger than 3 are treated as 'positive' samples for each user. Every last 10 'positive' movies of each user are held out for testing.

**Download dataset and install packages**

In [1]:
! git clone https://github.com/Wp-Zhang/HandyRec.git
! cd HandyRec && python setup.py install
! pip install faiss-gpu

Cloning into 'HandyRec'...
remote: Enumerating objects: 1704, done.[K
remote: Counting objects: 100% (1704/1704), done.[K
remote: Compressing objects: 100% (1269/1269), done.[K
remote: Total 1704 (delta 682), reused 1309 (delta 383), pack-reused 0[K
Receiving objects: 100% (1704/1704), 20.92 MiB | 10.31 MiB/s, done.
Resolving deltas: 100% (682/682), done.
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 15.2 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


In [2]:
! wget https://files.grouplens.org/datasets/movielens/ml-1m.zip -O ./ml-1m.zip
! unzip -o ml-1m.zip

--2022-04-15 22:57:04--  https://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘./ml-1m.zip’


2022-04-15 22:57:06 (3.45 MB/s) - ‘./ml-1m.zip’ saved [5917549/5917549]

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


**Import relative packages**

In [4]:
from handyrec.data.movielens import MovielensDataHelper
from handyrec.data.utils import gen_sequence
from handyrec.data import PairWiseDataset
from handyrec.models.ranking import FMLPRec
from handyrec.config import ConfigLoader
from handyrec.data.metrics import map_at_k, recall_at_k
from handyrec.models.utils import search_embedding

import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.utils import plot_model
import numpy as np
import pandas as pd
import gc

In [5]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
BATCH_SIZE = 2**12
NEPOCH = 50

TEST_NUM = 10
VALID_RATIO = 0.1
NEG_NUM = 10

**Load MovieLens1M data**

In [7]:
retrieve_dh = MovielensDataHelper('./ml-1m/')
data = retrieve_dh.get_clean_data(sparse_features=['gender','occupation','zip','age','year'])
data['inter']['hist_movie'] = gen_sequence(data['inter'], 'user_id', 'movie_id', seq_len=40)

Encode User Sparse Feats: 100%|██████████| 4/4 [00:00<00:00, 95.18it/s]
Encode Item Sparse Feats: 100%|██████████| 1/1 [00:00<00:00, 93.77it/s]
Generate movie_id sequence: 100%|██████████| 6040/6040 [00:10<00:00, 552.23it/s] 


## Prepare data for ranking

In [None]:
# * Use pre-trained embeddings and Faiss to generate candidates
user_embs = np.load("./HandyRec/examples/DSSM/DSSM_user_embd.npy")
item_embs = np.load("./HandyRec/examples/DSSM/DSSM_item_embd.npy")
user_ids = np.load("./HandyRec/examples/DSSM/user_ids.npy")

test_candidates = search_embedding(
    32, item_embs, user_embs, data["item"]["movie_id"].values, 100, gpu=True
)

test_candidates = {user_ids[i] : test_candidates[i] for i in range(len(user_ids))}

In [26]:
user_features = ['user_id']
item_features = ['movie_id']
inter_features = ['hist_movie']

In [27]:
ranking_dataset = PairWiseDataset(
    "RankingDataset",
    task="ranking",
    data=data,
    uid_name="user_id",
    iid_name="movie_id",
    inter_name="interact",
    time_name="timestamp",
    neg_iid_name="neg_movie_id",
    threshold=4,
)

ranking_dataset.train_test_split(TEST_NUM)
ranking_dataset.negative_sampling(NEG_NUM)
ranking_dataset.train_valid_split(VALID_RATIO)
ranking_dataset.gen_dataset(user_features, item_features, inter_features, test_candidates)

Generate negative samples: 100%|██████████| 5923/5923 [00:08<00:00, 711.63it/s]
Regenerate test set: 100%|██████████| 5923/5923 [00:00<00:00, 683735.97it/s]
Save user features: 100%|██████████| 1/1 [00:05<00:00,  5.49s/it]
Save item features: 100%|██████████| 1/1 [00:05<00:00,  5.49s/it]
Save inter features: 100%|██████████| 1/1 [00:35<00:00, 35.20s/it]


In [28]:
train_data, valid_data, test_data, test_label = ranking_dataset.load_dataset(
    user_features, item_features, inter_features, BATCH_SIZE
)

Load user features: 100%|██████████| 1/1 [00:00<00:00,  1.39it/s]
Load item features: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s]
Load inter features: 100%|██████████| 1/1 [00:01<00:00,  1.32s/it]


In [29]:
feature_dim = ranking_dataset.get_feature_dim(user_features, item_features, [])
feature_dim["genre_id"] = 19

## Train rank model and predict

In [30]:
cfg = ConfigLoader("./HandyRec/examples/FMLPRec/FMLPRec_cfg.yaml")
feature_groups = cfg.prepare_features(feature_dim, data)

In [32]:
rank_model = FMLPRec(
    feature_groups["item_seq_feat_group"],
    **cfg.config.Model
)

In [33]:
rank_model.compile(optimizer=tf.keras.optimizers.Adam(lr=2e-4, clipvalue=1.), loss=None)
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='./rank_checkpoint/',
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True
)
history = rank_model.fit(
    x=train_data, 
    validation_data=valid_data,
    epochs=NEPOCH,
    callbacks=[checkpoint]
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [34]:
rank_model.load_weights('./rank_checkpoint/')
pred_model = Model(inputs=rank_model.real_inputs, outputs=rank_model.real_outputs)

In [35]:
del train_data
gc.collect()

2843

In [36]:
pred = pred_model.predict(test_data, batch_size=BATCH_SIZE)

In [37]:
pred_df = pd.DataFrame(columns=['user_id','movie_id','pred'])
pred_df['user_id'] = test_data['user_id']
pred_df['movie_id'] = test_data['movie_id']
pred_df['pred'] = pred

pred_df = pred_df.sort_values(by=['user_id','pred'], ascending=False).reset_index(drop=True)
pred_df = pred_df.groupby('user_id')['movie_id'].apply(list).reset_index()

In [38]:
test_label_df = pd.DataFrame(columns=['user_id','label'])
test_label_df['user_id'] = pd.Series(test_data['user_id']).drop_duplicates()
test_label_df['label'] = test_label.tolist()

In [39]:
test_label_df = pd.merge(test_label_df, pred_df, on=['user_id'], how='left')

In [40]:
map_at_k(test_label_df['label'], test_label_df['movie_id'], k=10)

0.02545580317781905

In [41]:
recall_at_k(test_label_df['label'], test_label_df['movie_id'], k=10)

0.05924362654060442

In [42]:
recall_at_k(test_label_df['label'], test_label_df['movie_id'], k=100)

0.1351848725308121