# CICERO -- NRMS Small Dataset Adapted

This notebook is a run of the [NRMS notebook]( adapted from the [Recommenders Team](https://github.com/recommenders-team/recommenders/tree/main).  It has been modified as time has caused some things to not work as presneted originally.

NRMS stands for `Neural News Reccomendaiton with Multi-head Self-Attention`.  The reference to the paper is provided below. please look over and study the [recommenders team github repository](https://github.com/recommenders-team/recommenders/blob/main/README.md#Getting-Started) because this code is derived from it. 

Read up on the [MIND: MIcrosoft News Dataset](https://msnews.github.io/) and download the data into the datasets folder of this project and place them in `/apps/datasets`.  Review the [Readme](/app/datasets/README.md) 

## do imports and check things out.

Are we using a GPU? If not then things will take a very long time.


In [1]:
import time

# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import sys
# Gets the SUBERX python modules included.
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

import numpy as np
import pandas as pd
import pickle
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

# This module was created to generate the necessary files if they dont' already exist
from environment.data_utils import generate_uid2index, load_glove_embeddings, create_word_dict_and_embeddings, setup_nltk_resources

from nltk.tokenize import word_tokenize
import nltk
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.datasets.mind import download_and_extract_glove
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

# List available devices
print("Available devices:")
for device in tf.config.list_physical_devices():
    print(device)

# Check if a GPU is detected
if tf.config.list_physical_devices('GPU'):
    print("GPU is available and TensorFlow is using it.")
else:
    print("GPU is NOT available. TensorFlow is using the CPU.")


  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

rust_model.ot:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

System version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
Tensorflow version: 2.15.1
Available devices:
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
GPU is available and TensorFlow is using it.


## Prepare Parameters

Adjust these as needed. 



In [2]:
epochs = 5
seed = 42
batch_size = 64


# My modification for MINDSMALL as I mount in the data into the container.  Further the small dataset is not accessable anymore.

# Options: MINDdemo, MINDsmall, MINDlarge

MIND_type = 'MINDsmall'

## Specify the dataset to use

There are three in the original notebook: `demo`, `small` and `large`.

From the [original  notebook](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb),  the `demo` set is 5000 samples of the `small` dataset.

I was able to download the original demo but that's not really needed. A sample of 5000 is good for showing the algorithm works but won't train it well enough.  

In [3]:
## This is how it was done originally
#tmpdir = TemporaryDirectory()
#data_path = tmpdir.name

## I mount in the datasets folder to /app
data_path_base="/app/datasets/"
data_path = data_path_base + MIND_type

# Create the directory if it doesn't exist
os.makedirs(data_path, exist_ok=True)

print(f"Data Path is {data_path}")

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
dev_news_file = os.path.join(data_path,'dev','news.tsv')
dev_behaviors_file = os.path.join(data_path,'dev','behaviors.tsv')

wordEmb_file = os.path.join(data_path, 'utils',"embedding.npy")
userDict_file = os.path.join(data_path, 'utils',"uid2index.pkl")
wordDict_file = os.path.join(data_path, 'utils',"word_dict.pkl")
yaml_file = os.path.join(data_path, "utils",'nrms.yaml')


Data Path is /app/datasets/MINDsmall


## Download Glove embeddings

The original NRMS used glove embeddings, this is how they are created. 

In [4]:
## All models can use the same glove embeddings

# Download and extract GloVe embeddings
glove_dir = data_path_base +'glove_embeddings'
download_and_extract_glove(glove_dir)
print(f"GloVe embeddings extracted to: {glove_dir}")




GloVe embeddings extracted to: /app/datasets/glove_embeddings


## Get the NLTK Data

In [5]:
nltk_data_dir = data_path_base + 'nltk_data'
setup_nltk_resources(download_dir=nltk_data_dir)
nltk.data.path.append(nltk_data_dir)

[nltk_data] Downloading package punkt to /app/datasets/nltk_data...


NLTK resources downloaded to /app/datasets/nltk_data


[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /app/datasets/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Identify the news and behavior files

In [6]:
# Generate uid2index.pkl
uid2index = generate_uid2index(train_behaviors_file, output_file= userDict_file)

/app/datasets/MINDsmall/utils/uid2index.pkl already exists. Skipping creation.


In [7]:
# Path to GloVe file (e.g., glove.6B.300d.txt)
glove_file = os.path.join(glove_dir, "glove/glove.6B.300d.txt")
embedding_dim = 300
glove_embeddings = load_glove_embeddings(glove_file, embedding_dim=embedding_dim)


glove_embeddings.pkl already exists. Loading embeddings from file.


In [8]:
# Generate word_dict.pkl and embedding.npy
word_dict, embedding_matrix = create_word_dict_and_embeddings(
    news_file= train_news_file,
    glove_embeddings=glove_embeddings,
    embedding_dim=embedding_dim,
    output_dir= data_path + '/utils/'
)


Files already exist in /app/datasets/MINDsmall/utils/. Loading existing word_dict and embedding matrix.


## Define the hyper-parameters

In [9]:
hparams = prepare_hparams(None, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          model_type="nrms",
                          title_size=30,
                          his_size=50,
                          npratio=4,
                          data_format='news',
                          word_emb_dim=300,
                          head_num=20,
                          head_dim=20,
                          attention_hidden_dim=200,
                          loss='cross_entropy_loss',
                          dropout=0.2,
                          support_quick_scoring=True,
                          show_step=10, 
                          metrics=['group_auc', 'mean_mrr', 'ndcg@5;10'])
#print(hparams)

In [10]:
for i in hparams.values().keys():
    print(f"{i} = {hparams.values()[i]}")

support_quick_scoring = True
dropout = 0.2
attention_hidden_dim = 200
head_num = 20
head_dim = 20
filter_num = 200
window_size = 3
vert_emb_dim = 100
subvert_emb_dim = 100
gru_unit = 400
type = ini
user_emb_dim = 50
learning_rate = 0.001
optimizer = adam
epochs = 5
batch_size = 64
show_step = 10
wordEmb_file = /app/datasets/MINDsmall/utils/embedding.npy
wordDict_file = /app/datasets/MINDsmall/utils/word_dict.pkl
userDict_file = /app/datasets/MINDsmall/utils/uid2index.pkl
model_type = nrms
title_size = 30
his_size = 50
npratio = 4
data_format = news
word_emb_dim = 300
loss = cross_entropy_loss
metrics = ['group_auc', 'mean_mrr', 'ndcg@5;10']


In [11]:
iterator = MINDIterator
model = NRMSModel(hparams, iterator, seed=seed)

  super().__init__(name, **kwargs)


In [12]:
# Run the model without any training -- can training do better than no training at all?  

In [13]:
print(model.run_eval(dev_news_file, dev_behaviors_file))

  updates=self.state_updates,
663it [00:03, 177.06it/s]
1143it [00:51, 22.38it/s]
73152it [00:12, 5637.77it/s]


{'group_auc': 0.4621, 'mean_mrr': 0.1991, 'ndcg@5': 0.2, 'ndcg@10': 0.2625}


In [14]:
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

step 3690 , total_loss: 1.3564, data_loss: 1.2460: : 3693it [11:41,  5.27it/s]
663it [00:01, 438.94it/s]
1143it [00:46, 24.36it/s]
73152it [00:12, 5964.12it/s]


at epoch 1
train info: logloss loss:1.3563891566932733
eval info: group_auc:0.6434, mean_mrr:0.2919, ndcg@10:0.3877, ndcg@5:0.3193
at epoch 1 , train time: 701.4 eval time: 152.6


step 3690 , total_loss: 1.2646, data_loss: 1.2478: : 3693it [11:30,  5.35it/s]
663it [00:01, 387.77it/s]
1143it [00:48, 23.66it/s]
73152it [00:11, 6146.50it/s] 


at epoch 2
train info: logloss loss:1.2646360734985542
eval info: group_auc:0.6437, mean_mrr:0.2961, ndcg@10:0.3908, ndcg@5:0.3238
at epoch 2 , train time: 690.1 eval time: 154.2


step 3690 , total_loss: 1.2219, data_loss: 1.1531: : 3693it [11:31,  5.34it/s]
663it [00:01, 417.00it/s]
1143it [00:48, 23.81it/s]
73152it [00:12, 5674.08it/s] 


at epoch 3
train info: logloss loss:1.2218898550107684
eval info: group_auc:0.647, mean_mrr:0.2992, ndcg@10:0.3959, ndcg@5:0.3292
at epoch 3 , train time: 691.2 eval time: 154.2


step 3690 , total_loss: 1.1871, data_loss: 1.3355: : 3693it [11:33,  5.32it/s]
663it [00:01, 432.85it/s]
1143it [00:47, 23.93it/s]
73152it [00:14, 5140.87it/s]


at epoch 4
train info: logloss loss:1.1870706619655993
eval info: group_auc:0.6385, mean_mrr:0.2955, ndcg@10:0.3881, ndcg@5:0.323
at epoch 4 , train time: 693.8 eval time: 156.1


step 3690 , total_loss: 1.1485, data_loss: 1.0954: : 3693it [11:29,  5.35it/s]
663it [00:01, 425.09it/s]
1143it [00:47, 23.85it/s]
73152it [00:12, 5796.82it/s]


at epoch 5
train info: logloss loss:1.148466747255142
eval info: group_auc:0.6353, mean_mrr:0.2976, ndcg@10:0.3895, ndcg@5:0.3249
at epoch 5 , train time: 689.8 eval time: 154.7


<recommenders.models.newsrec.models.nrms.NRMSModel at 0x7f0fc7942200>

In [15]:
res_syn = model.run_eval(dev_news_file, dev_behaviors_file)
print(res_syn)
 

663it [00:01, 440.28it/s] 
1143it [00:48, 23.60it/s]
73152it [00:12, 5802.64it/s]


{'group_auc': 0.6353, 'mean_mrr': 0.2976, 'ndcg@5': 0.3249, 'ndcg@10': 0.3895}


In [16]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [17]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [18]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

663it [00:01, 345.14it/s]
1143it [00:47, 24.30it/s]
73152it [00:11, 6226.21it/s]


In [19]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

73152it [00:01, 55368.57it/s]


In [20]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## References

https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf


https://github.com/recommenders-team/recommenders/tree/main

In [21]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")

Elapsed time: 1:17:50
