## Introduction 

The following notebook contains a demo of a method for sequence aware product recommendation. In particular, the [Short-term and Long-term preference Integrated
Recommender system](https://www.microsoft.com/en-us/research/uploads/prod/2019/07/IJCAI19-ready_v1.pdf) (SLi-Rec) method is applied to the [Amazon Review Dataset](https://nijianmo.github.io/amazon/index.html). Specifically, the Movies and TV dataset is used which contains 8,765,568 reviews of 203,970 products. 

## Package Imports and Global Variables

In [1]:
import os
import wandb
import pandas as pd

from recommenders.utils.timer import Timer
from recommenders.models.deeprec.deeprec_utils import prepare_hparams
from recommenders.models.deeprec.io.sequential_iterator import SequentialIterator
from recommenders.models.deeprec.models.sequential.sli_rec import SLI_RECModel as SeqModel

2022-07-07 14:15:22.637682: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-07 14:15:22.637719: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
DATA_PATH = "../data/amazon"
REVIEWS_FILE = 'reviews_Movies_and_TV_5.json'
META_FILE = 'meta_Movies_and_TV.json'

YAML_PATH = "../../recommenders/recommenders/models/deeprec/config/sli_rec.yaml"

EPOCHS = 10
BATCH_SIZE = 400
RANDOM_SEED = 42

train_num_ngs = 4
valid_num_ngs = 4

In [3]:
wandb.init(project='SLi-Rec', sync_tensorboard=True)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mjewelltaylor9430[0m ([33manomalydetection[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [4]:
# Directories to store train, validation and test splits
train_path = os.path.join(DATA_PATH, r'train_data')
valid_path = os.path.join(DATA_PATH, r'valid_data')
test_path = os.path.join(DATA_PATH, r'test_data')

# Files paths to store the list of existing ids for user, item and item category 
user_vocab_path = os.path.join(DATA_PATH, r'user_vocab.pkl')
item_vocab_path = os.path.join(DATA_PATH, r'item_vocab.pkl')
cate_vocab_path = os.path.join(DATA_PATH, r'category_vocab.pkl')
output_file_path = os.path.join(DATA_PATH, r'output.txt')

# File paths to store reviews and associated metadata
reviews_path = os.path.join(DATA_PATH, REVIEWS_FILE)
meta_path = os.path.join(DATA_PATH, META_FILE)

valid_num_ngs = 4 # number of negative instances with a positive instance for validation
test_num_ngs = 9 # number of negative instances with a positive instance for testing

## Data Loading 

Given that the data is preprocessed in the [amazon_preprocessing notebook](amazon_preprocessing.ipynb), no further processing is required. In this section, we will briefly analyze the train, validation and test sets to get aquainted with the data we will be modelling. Futhermore, a data loader will be defined to iteratively fetch samples from the datasets during training and evaluation. 

The train dataset consists of a dataframe where each record is a review of a product `item_id` in category `cate_id` at time `timestamp` by user `user_id`. Each record also contains the list of previous items the user interacted with `prev_ids` along with the corresponding categories `prev_cate_ids` and timestamps `prev_timestamps`.

In [5]:
train_df = pd.read_csv(train_path, sep="\t", index_col=False, names=["label", "user_id", "item_id", "cate_id", "timestamp", "prev_item_ids", "prev_cate_ids", "prev_timestamps"])
train_df

Unnamed: 0,label,user_id,item_id,cate_id,timestamp,prev_item_ids,prev_cate_ids,prev_timestamps
0,1,AWF2S3UNW9UA0,B008220C38,Movies,1362441600,B005LAIHQS,Movies,1361232000
1,1,AWF2S3UNW9UA0,B009AMANBA,Movies,1365033600,"B005LAIHQS,B008220C38","Movies,Movies",13612320001362441600
2,1,AWF2S3UNW9UA0,B00B74MJOS,Movies,1367625600,"B005LAIHQS,B008220C38,B009AMANBA","Movies,Movies,Movies",136123200013624416001365033600
3,1,AWF2S3UNW9UA0,B0067EKYL8,Movies,1371686400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS","Movies,Movies,Movies,Movies",1361232000136244160013650336001367625600
4,1,AWF2S3UNW9UA0,0792839072,Movies,1372982400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies","1361232000,1362441600,1365033600,1367625600,13..."
...,...,...,...,...,...,...,...,...
16630,1,A1WZZDWYPVST2M,B008JFUUIA,Movies,1365552000,B005S9ELM6,Movies,1365552000
16631,1,A37K6TJ94ZFXVQ,B008JFUOWM,Movies,1390262400,B00B74MJOS,Movies,1368144000
16632,1,A16342W88H5YWK,B0090SI3ZW,Movies,1364256000,B007R6D74G,Movies,1348185600
16633,1,AA3UZRM4EFLK2,B0067EKYL8,Movies,1365465600,B005S9ELM6,Movies,1365465600


The validation and test datasets share the schema as the train dataset. The only key distinction is that the evaluations sets contain negative samples which are denoted by a label of 0. Negative samples are interactions between users and items that have not occured. They are included so we can compute metrics of how well the generated recommendations approximate the users actual behaviour. 

In [6]:
# Visualize validation dataset dataframe
valid_df = pd.read_csv(valid_path, sep="\t", index_col=False, names=["label", "user_id", "item_id", "cate_id", "timestamp", "prev_item_ids", "prev_cate_ids", "prev_timestamps"])
valid_df

Unnamed: 0,label,user_id,item_id,cate_id,timestamp,prev_item_ids,prev_cate_ids,prev_timestamps
0,1,AWF2S3UNW9UA0,B00005K3OT,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
1,0,AWF2S3UNW9UA0,B0090SI3ZW,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
2,0,AWF2S3UNW9UA0,B00E8RK5OC,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
3,0,AWF2S3UNW9UA0,6305171769,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
4,0,AWF2S3UNW9UA0,B00005JPFX,Movies,1393718400,"B005LAIHQS,B008220C38,B009AMANBA,B00B74MJOS,B0...","Movies,Movies,Movies,Movies,Movies,Movies,Movi...","1361232000,1362441600,1365033600,1367625600,13..."
...,...,...,...,...,...,...,...,...
34360,1,A173F44ZGP878J,B00E8RK5OC,Movies,1383264000,B009AMANBA,Movies,1365811200
34361,0,A173F44ZGP878J,B00005JPS8,Movies,1383264000,B009AMANBA,Movies,1365811200
34362,0,A173F44ZGP878J,B009934S5M,Movies,1383264000,B009AMANBA,Movies,1365811200
34363,0,A173F44ZGP878J,B000E1MTYK,Movies,1383264000,B009AMANBA,Movies,1365811200


In [7]:
# Visualize test dataset dataframe
test_df = pd.read_csv(test_path, sep="\t", index_col=False, names=["label", "user_id", "item_id", "cate_id", "timestamp", "prev_item_ids", "prev_cate_ids", "prev_timestamps"])
test_df

Unnamed: 0,label,user_id,item_id,cate_id,timestamp,prev_item_ids,prev_cate_ids,prev_timestamps
0,1,A3R27T4HADWFFJ,B0000AZT3R,Movies,1389657600,B000J10EQU,Movies,1387756800
1,0,A3R27T4HADWFFJ,B0000VD02Y,Movies,1389657600,B000J10EQU,Movies,1387756800
2,0,A3R27T4HADWFFJ,B00005JPS8,Movies,1389657600,B000J10EQU,Movies,1387756800
3,0,A3R27T4HADWFFJ,B00003CXXO,Movies,1389657600,B000J10EQU,Movies,1387756800
4,0,A3R27T4HADWFFJ,B000C3L27K,Movies,1389657600,B000J10EQU,Movies,1387756800
...,...,...,...,...,...,...,...,...
169165,0,AGAWDSE1J20RI,B002ZG98R8,Movies,1405468800,B00H7KJTCG,Movies,1405468800
169166,0,AGAWDSE1J20RI,B00005JPFX,Movies,1405468800,B00H7KJTCG,Movies,1405468800
169167,0,AGAWDSE1J20RI,B000AE4QD8,TV,1405468800,B00H7KJTCG,Movies,1405468800
169168,0,AGAWDSE1J20RI,B000BTJDG2,Movies,1405468800,B00H7KJTCG,Movies,1405468800


When training and evaluating neural network models, we typically feed batches of input into the model to generate predictions. This involves iterively sampling batches of data in the dataset . The [microsoft recommenders](https://github.com/microsoft/recommenders) package provides the `SequentialIterator` class which acts as a dataloader for sequential recommender systems such as SLi-Rec. 

In [8]:
input_creator = SequentialIterator

## Model Definition

In [9]:
### NOTE:  
### remember to use `_create_vocab(train_file, user_vocab, item_vocab, cate_vocab)` to generate the user_vocab, item_vocab and cate_vocab files, if you are using your own dataset rather than using our demo Amazon dataset.
hparams = prepare_hparams(YAML_PATH, 
                          embed_l2=0., 
                          layer_l2=0., 
                          learning_rate=0.001,  # set to 0.01 if batch normalization is disable
                          epochs=EPOCHS,
                          batch_size=BATCH_SIZE,
                          show_step=20,
                          MODEL_DIR=os.path.join(DATA_PATH, "model/"),
                          SUMMARIES_DIR=os.path.join(DATA_PATH, "summary/"),
                          user_vocab=user_vocab_path,
                          item_vocab=item_vocab_path,
                          cate_vocab=cate_vocab_path,
                          need_sample=True,
                          train_num_ngs=train_num_ngs, # provides the number of negative instances for each positive instance for loss computation.
            )

In [10]:
model = SeqModel(hparams, input_creator, seed=RANDOM_SEED)

Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Colocations handled automatically by placer.


2022-07-07 14:15:32.474175: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-07 14:15:32.474300: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-07-07 14:15:32.474374: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-07-07 14:15:32.474445: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2022-07-07 14:15:32.474515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64

## Training and Validation

In [11]:
with Timer() as train_time:
    model = model.fit(train_path, valid_path, valid_num_ngs=valid_num_ngs) 

# valid_num_ngs is the number of negative lines after each positive line in your valid_file 
# we will evaluate the performance of model on valid_file every epoch
print('Time cost for training is {0:.2f} mins'.format(train_time.interval/60.0))

2022-07-07 14:15:37.730633: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-07 14:15:37.730670: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


step 20 , total_loss: 1.6076, data_loss: 1.6076
step 40 , total_loss: 1.6033, data_loss: 1.6033
eval valid at epoch 1: auc:0.5075,logloss:0.6958,mean_mrr:0.4637,ndcg@2:0.3357,ndcg@4:0.5214,ndcg@6:0.5952,group_auc:0.5117
INFO:tensorflow:../data/amazon/model/epoch_1.index
INFO:tensorflow:0
INFO:tensorflow:../data/amazon/model/epoch_1.data-00000-of-00001
INFO:tensorflow:600
INFO:tensorflow:../data/amazon/model/epoch_1.meta
INFO:tensorflow:2500
INFO:tensorflow:../data/amazon/model/epoch_10.data-00000-of-00001
INFO:tensorflow:3100
INFO:tensorflow:../data/amazon/model/epoch_10.index
INFO:tensorflow:3100
INFO:tensorflow:../data/amazon/model/epoch_10.meta
INFO:tensorflow:5000
INFO:tensorflow:../data/amazon/model/best_model.meta
INFO:tensorflow:1900
INFO:tensorflow:../data/amazon/model/best_model.data-00000-of-00001
INFO:tensorflow:2500
INFO:tensorflow:../data/amazon/model/best_model.index
INFO:tensorflow:2500
step 20 , total_loss: 1.5672, data_loss: 1.5672
step 40 , total_loss: 1.5190, data_lo

In [12]:
wandb.finish()

0,1
data_loss,██████▇▇▆▆▅▅▄▅▄▄▄▄▃▃▃▃▃▃▂▃▂▃▂▄▂▁▃▃▂▃▂▃▂▂
global_step,▁▂▅▆████████████████████████████████████
loss,██████▇▇▆▆▅▅▄▅▄▄▄▄▃▃▃▃▃▃▂▃▂▃▂▄▂▁▃▃▂▃▂▃▂▂
regular_loss,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
data_loss,1.1889
global_step,41.0
loss,1.1889
regular_loss,0.0
