# NRMS -- Cicero

NRMS stands for `Neural News Reccomendaiton with Multi-head Self-Attention`.  The reference to the paper is provided below. please look over and study the [recommenders team github repository](https://github.com/recommenders-team/recommenders/blob/main/README.md#Getting-Started) because this code is derived from it. 

Read up on the [MIND: MIcrosoft News Dataset](https://msnews.github.io/) and download the data into the datasets folder of this project.  It should have a structure like 


## Getting setup

Just run the container.  The compose file is actually the easiest way to get things working.

```
insert instructions for running with docker compose and on its on with just docker.

```

## Notebook and GPU Check.  

Run the follow-on cell to check things out.  


In [1]:
import time

# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import os
import sys
import numpy as np
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

# List available devices
print("Available devices:")
for device in tf.config.list_physical_devices():
    print(device)

# Check if a GPU is detected
if tf.config.list_physical_devices('GPU'):
    print("GPU is available and TensorFlow is using it.")
else:
    print("GPU is NOT available. TensorFlow is using the CPU.")



System version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
Tensorflow version: 2.15.1
Available devices:
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
GPU is available and TensorFlow is using it.


## Prepare Parameters

adjust these as needed but this is fine for now. 

In [2]:
epochs = 5
seed = 42
batch_size = 64

# Options: MINDdemo, MINDsmall, MINDlarge

#
MIND_type = 'MINDdemo'
data_path_base = "/app/datasets/"
data_path = "/app/datasets/" + MIND_type

## Download and Load the Data

The [original, NRMS MIND notebook](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb) was crafted back in 2021 and the links for the data are inaccesable. Download from the MIND dataset website -- both the small and large datasets. 

If we are set to `MINDdemo` we can test things out.  talk about how you got the data.

## Data Paths

The original NRMS MIDN did alot of things for you.  the utils folder contains those things that it did. Remember that we have downloaded the data ahead of time for this.  The way I got the pkl and npy files is by executing hte original nrmsMIND notebook with demo set. In case you're wondering.

In [3]:
# Create the directory if it doesn't exist but it will
os.makedirs(data_path, exist_ok=True)

print(f"Data Path is {data_path}")
train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'nrms.yaml')


Data Path is /app/datasets/MINDdemo


## Set Hyperparameters



In [4]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          show_step=10)
print(hparams)

HParams object with values {'support_quick_scoring': True, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 20, 'head_dim': 20, 'filter_num': 200, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 50, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 5, 'batch_size': 64, 'show_step': 10, 'title_size': 30, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 300, 'model_type': 'nrms', 'loss': 'cross_entropy_loss', 'wordEmb_file': '/app/datasets/MINDdemo/utils/embedding.npy', 'wordDict_file': '/app/datasets/MINDdemo/utils/word_dict.pkl', 'userDict_file': '/app/datasets/MINDdemo/utils/uid2index.pkl'}


## Train the NRMS Model

In [5]:
for i in hparams.values().keys():
    print(f"{i} = {hparams.values()[i]}")

support_quick_scoring = True
dropout = 0.2
attention_hidden_dim = 200
head_num = 20
head_dim = 20
filter_num = 200
window_size = 3
vert_emb_dim = 100
subvert_emb_dim = 100
gru_unit = 400
type = ini
user_emb_dim = 50
learning_rate = 0.0001
optimizer = adam
epochs = 5
batch_size = 64
show_step = 10
title_size = 30
his_size = 50
data_format = news
npratio = 4
metrics = ['group_auc', 'mean_mrr', 'ndcg@5;10']
word_emb_dim = 300
model_type = nrms
loss = cross_entropy_loss
wordEmb_file = /app/datasets/MINDdemo/utils/embedding.npy
wordDict_file = /app/datasets/MINDdemo/utils/word_dict.pkl
userDict_file = /app/datasets/MINDdemo/utils/uid2index.pkl


In [6]:
iterator = MINDIterator

In [7]:
model = NRMSModel(hparams, iterator, seed=seed)

  super().__init__(name, **kwargs)


In [None]:
### Run the model without any training.

In [8]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

  updates=self.state_updates,
293it [00:00, -521.92it/s]
118it [00:05, 20.51it/s]
7538it [00:01, 6692.38it/s]


{'group_auc': 0.4794, 'mean_mrr': 0.2059, 'ndcg@5': 0.205, 'ndcg@10': 0.2701}


In [9]:
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

step 540 , total_loss: 1.5315, data_loss: 1.4524: : 543it [01:42,  5.28it/s]
293it [00:00, 393.55it/s]
118it [00:05, 22.37it/s]
7538it [00:01, 6708.75it/s]


at epoch 1
train info: logloss loss:1.5314006761512264
eval info: group_auc:0.5608, mean_mrr:0.234, ndcg@10:0.3176, ndcg@5:0.246
at epoch 1 , train time: 102.9 eval time: 17.3


step 540 , total_loss: 1.4438, data_loss: 1.3967: : 543it [01:40,  5.42it/s]
293it [00:00, 399.15it/s]
118it [00:02, 43.29it/s] 
7538it [00:01, 6633.15it/s]


at epoch 2
train info: logloss loss:1.4440718382103008
eval info: group_auc:0.5931, mean_mrr:0.2497, ndcg@10:0.3387, ndcg@5:0.266
at epoch 2 , train time: 100.1 eval time: 14.6


step 540 , total_loss: 1.4006, data_loss: 1.2802: : 543it [01:36,  5.61it/s]
293it [00:00, 446.93it/s]
118it [00:04, 26.16it/s]
7538it [00:01, 4129.84it/s]


at epoch 3
train info: logloss loss:1.400763091022358
eval info: group_auc:0.6024, mean_mrr:0.2593, ndcg@10:0.3506, ndcg@5:0.2799
at epoch 3 , train time: 96.8 eval time: 18.4


step 540 , total_loss: 1.3718, data_loss: 1.2602: : 543it [01:39,  5.44it/s]
293it [00:00, 426.90it/s]
118it [00:05, 23.22it/s]
7538it [00:01, 6720.16it/s]


at epoch 4
train info: logloss loss:1.3719985586064516
eval info: group_auc:0.6026, mean_mrr:0.2611, ndcg@10:0.3516, ndcg@5:0.2794
at epoch 4 , train time: 99.8 eval time: 15.7


step 540 , total_loss: 1.3529, data_loss: 1.4011: : 543it [01:39,  5.46it/s]
293it [00:00, 437.17it/s]
118it [00:04, 23.97it/s]
7538it [00:00, 7715.94it/s]


at epoch 5
train info: logloss loss:1.3530630095009426
eval info: group_auc:0.6027, mean_mrr:0.2625, ndcg@10:0.3535, ndcg@5:0.2795
at epoch 5 , train time: 99.5 eval time: 16.2


<recommenders.models.newsrec.models.nrms.NRMSModel at 0x7fe1c266bd90>

In [10]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")

Elapsed time: 0:10:7


In [11]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)


293it [00:01, 268.43it/s]
118it [00:05, 22.64it/s]
7538it [00:01, 6211.94it/s]


{'group_auc': 0.6027, 'mean_mrr': 0.2625, 'ndcg@5': 0.2795, 'ndcg@10': 0.3535}
CPU times: user 19.2 s, sys: 26 s, total: 45.2 s
Wall time: 15.9 s


In [12]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [13]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [14]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

293it [00:01, 280.84it/s]
118it [00:05, 23.19it/s]
7538it [00:01, 7391.29it/s]


In [15]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

7538it [00:00, 47475.18it/s]


In [16]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## References

https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf


https://github.com/recommenders-team/recommenders/tree/main

In [17]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")

Elapsed time: 0:10:31
