# CICERO -- SUBER with NRMS as the model to be trained

How can we take the NRMS model and use it as input for SUBER?  By that I mean, how it provides the data in the form that the original NRMS model uses 

The NRMS  data comes out of the recommenders module itself. It's installed in the contianer.

## do imports and check things out.

Are we using a GPU?



In [1]:
import time

# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import sys

nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

import numpy as np
import pandas as pd
import pickle
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages




from environment.data_utils import generate_uid2index, load_glove_embeddings, create_word_dict_and_embeddings


from nltk.tokenize import word_tokenize
import nltk
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata





print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

# List available devices
print("Available devices:")
for device in tf.config.list_physical_devices():
    print(device)

# Check if a GPU is detected
if tf.config.list_physical_devices('GPU'):
    print("GPU is available and TensorFlow is using it.")
else:
    print("GPU is NOT available. TensorFlow is using the CPU.")


  from tqdm.autonotebook import tqdm, trange


System version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
Tensorflow version: 2.15.1
Available devices:
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
GPU is available and TensorFlow is using it.


## Prepare Parameters

Adjust these as needed. 



In [2]:
epochs = 5
seed = 42
batch_size = 64


# My modification for MINDSMALL as I mount in the data into the container.  Further the small dataset is not accessable anymore.

# Options: MINDdemo, MINDsmall, MINDlarge



MIND_type = 'MINDsmall'

## Specify the dataset to use

There are three in the original notebook: `demo`, `small` and `large`.

From the [original  notebook](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb),  the `demo` set is 5000 samples of the `small` dataset.

I was able to download the original demo but that's not really needed. A sample of 5000 is good for showing the algorithm works but won't train it well enough.  


In [13]:
#tmpdir = TemporaryDirectory()
#data_path = tmpdir.name
data_path_base="/app/datasets/"
data_path = data_path_base + MIND_type

# Create the directory if it doesn't exist
os.makedirs(data_path, exist_ok=True)

print(f"Data Path is {data_path}")

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
dev_news_file = os.path.join(data_path,'dev','news.tsv')
dev_behaviors_file = os.path.join(data_path,'dev','behaviors.tsv')

wordEmb_file = os.path.join(data_path, 'utils',"embedding.npy")
userDict_file = os.path.join(data_path, 'utils',"uid2index.pkl")
wordDict_file = os.path.join(data_path, 'utils',"word_dict.pkl")
yaml_file = os.path.join(data_path, "utils",'nrms.yaml')


Data Path is /app/datasets/MINDsmall


## Download Glove embeddings

The original NRMS used glove embeddings, this is how they are created. 

In [4]:
from recommenders.datasets.mind import download_and_extract_glove

## All models can use the same glove embeddings

# Download and extract GloVe embeddings
glove_dir = data_path_base +'glove_embeddings'
download_and_extract_glove(glove_dir)
print(f"GloVe embeddings extracted to: {glove_dir}")

# Specify the directory where NLTK data should be downloaded
nltk.download('punkt')
nltk.download('punkt_tab')
#nltk.data.path.append('app/SUBERX/datasets/nltk_data')


GloVe embeddings extracted to: /app/datasets/glove_embeddings


[nltk_data] Downloading package punkt to /home/suber/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/suber/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Identify the news and behavior files

In [5]:
# Generate uid2index.pkl
uid2index = generate_uid2index(train_behaviors_file, output_file= userDict_file)

Created /app/datasets/MINDsmall/utils/uid2index.pkl with 50000 users.


In [6]:
# Path to GloVe file (e.g., glove.6B.300d.txt)
glove_file = os.path.join(glove_dir, "glove/glove.6B.300d.txt")
embedding_dim = 300
glove_embeddings = load_glove_embeddings(glove_file, embedding_dim=embedding_dim)


Loading GloVe embeddings: 400001it [00:23, 17387.56it/s]

Loaded 400001 word vectors.





In [7]:
# Generate word_dict.pkl and embedding.npy
word_dict, embedding_matrix = create_word_dict_and_embeddings(
    news_file= train_news_file,
    glove_embeddings=glove_embeddings,
    embedding_dim=embedding_dim,
    output_dir= data_path + '/utils/'
)


Saved word_dict.pkl and embedding.npy to /app/datasets/MINDsmall/utils/.


## Define the hyper-parameters

In [15]:
hparams = prepare_hparams(None, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          model_type="nrms",
                          title_size=30,
                          his_size=50,
                          npratio=4,
                          data_format='news',
                          word_emb_dim=300,
                          head_num=20,
                          head_dim=20,
                          attention_hidden_dim=200,
                          loss='cross_entropy_loss',
                          dropout=0.2,
                          support_quick_scoring=True,
                          show_step=10, 
                          metrics=['group_auc', 'mean_mrr', 'ndcg@5;10'])
#print(hparams)

In [16]:
for i in hparams.values().keys():
    print(f"{i} = {hparams.values()[i]}")

support_quick_scoring = True
dropout = 0.2
attention_hidden_dim = 200
head_num = 20
head_dim = 20
filter_num = 200
window_size = 3
vert_emb_dim = 100
subvert_emb_dim = 100
gru_unit = 400
type = ini
user_emb_dim = 50
learning_rate = 0.001
optimizer = adam
epochs = 5
batch_size = 64
show_step = 10
wordEmb_file = /app/datasets/MINDsmall/utils/embedding.npy
wordDict_file = /app/datasets/MINDsmall/utils/word_dict.pkl
userDict_file = /app/datasets/MINDsmall/utils/uid2index.pkl
model_type = nrms
title_size = 30
his_size = 50
npratio = 4
data_format = news
word_emb_dim = 300
loss = cross_entropy_loss
metrics = ['group_auc', 'mean_mrr', 'ndcg@5;10']


In [17]:
iterator = MINDIterator
model = NRMSModel(hparams, iterator, seed=seed)

In [18]:
# Run the model without any training -- can training do better than no training at all?  

In [19]:
print(model.run_eval(dev_news_file, dev_behaviors_file))

663it [00:03, 213.89it/s]
1143it [00:52, 21.91it/s]
73152it [00:10, 6754.55it/s] 


{'group_auc': 0.4631, 'mean_mrr': 0.2014, 'ndcg@5': 0.2035, 'ndcg@10': 0.2653}


In [20]:
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

step 3690 , total_loss: 1.3552, data_loss: 1.2673: : 3693it [11:43,  5.25it/s]
663it [00:01, 376.01it/s]
1143it [00:48, 23.58it/s]
73152it [00:11, 6620.78it/s] 


at epoch 1
train info: logloss loss:1.3551201551482828
eval info: group_auc:0.6413, mean_mrr:0.2923, ndcg@10:0.3864, ndcg@5:0.32
at epoch 1 , train time: 703.2 eval time: 151.9


step 3690 , total_loss: 1.2674, data_loss: 1.2201: : 3693it [11:29,  5.36it/s]
663it [00:01, 400.28it/s]
1143it [00:46, 24.66it/s] 
73152it [00:11, 6174.09it/s]


at epoch 2
train info: logloss loss:1.2674262470550857
eval info: group_auc:0.6431, mean_mrr:0.2945, ndcg@10:0.3895, ndcg@5:0.322
at epoch 2 , train time: 689.2 eval time: 150.7


step 3690 , total_loss: 1.2237, data_loss: 1.0511: : 3693it [11:31,  5.34it/s]
663it [00:01, 390.14it/s]
1143it [00:46, 24.40it/s]
73152it [00:12, 5644.49it/s]


at epoch 3
train info: logloss loss:1.2236662934925897
eval info: group_auc:0.6408, mean_mrr:0.2942, ndcg@10:0.3908, ndcg@5:0.3241
at epoch 3 , train time: 691.2 eval time: 152.9


step 3690 , total_loss: 1.1854, data_loss: 1.2819: : 3693it [11:27,  5.37it/s]
663it [00:01, 402.88it/s]
1143it [00:46, 24.55it/s] 
73152it [00:16, 4499.24it/s]


at epoch 4
train info: logloss loss:1.1853428060144
eval info: group_auc:0.6375, mean_mrr:0.2934, ndcg@10:0.3871, ndcg@5:0.3219
at epoch 4 , train time: 687.6 eval time: 154.7


step 3690 , total_loss: 1.1478, data_loss: 1.0840: : 3693it [11:31,  5.34it/s]
663it [00:01, 420.95it/s]
1143it [00:47, 23.88it/s]
73152it [00:12, 6010.16it/s]


at epoch 5
train info: logloss loss:1.1478492611635354
eval info: group_auc:0.6346, mean_mrr:0.292, ndcg@10:0.3851, ndcg@5:0.3198
at epoch 5 , train time: 691.1 eval time: 152.7


<recommenders.models.newsrec.models.nrms.NRMSModel at 0x7f90108101f0>