# NRMS -- Cicero

NRMS stands for `Neural News Reccomendaiton with Multi-head Self-Attention`.  The reference to the paper is provided below. 

---

## Understand the MIND dataset
The MIND dataset consists of several key files:

- news.tsv: Contains news articles and their metadata (news ID, category, subcategory, title, abstract, etc.).
- behaviors.tsv: Contains user interaction data, including the history of news articles clicked and the impressions list (clicked or not clicked).

The NRMS model uses this data to learn user preferences based on click history.

---

## Getting setup
Create the virtural environment. I am using Python 3.10.12.  I suspect Python 3.11 or 3.12 would work as well.
```bash

python -m venv nrms

```

Edit your .bashrc and add an alias:

```.bash

alias nrms='source ~/nrms/bin/activate'

```

Source the .bashrc file and activate the nrms enviroment'

```.bash

source .bashrc

nrms


```

Install the Python Modules Needed

First install recommenders

```.bash

pip install recommenders
```

Note, the original NRMS was done with TENSORFLOW !!!  This caused an awful lot of trouble.  The trick is to ensure the tensorflow is at verion 2.15.1 or `<2.16`. Below is an example.

```.bash


pip install tensorflow[and-cuda]==2.15.1



            
```

---

# NRMS Sequence of Steps


## Do the imports and ensure it works



In [1]:
import time



# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import os
import sys
import numpy as np
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))


System version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
Tensorflow version: 2.15.1


## Prepare Parameters

In [2]:
epochs = 5
seed = 42
batch_size = 32

# Options: demo, small, large
MIND_type = 'demo'

## Download and Load the Data

Note that you can make tmpdir whatever you want.  maybe something in the local folder.

In [3]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name
print(f"Data Path is {data_path}")
train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'nrms.yaml')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \
                               os.path.join(data_path, 'utils'), mind_utils)

Data Path is /tmp/tmp5t9unlko


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.0k/17.0k [00:01<00:00, 15.8kKB/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.84k/9.84k [00:00<00:00, 11.8kKB/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 95.0k/95.0k [00:06<00:00, 15.8kKB/s]


## Set Hyperparameters



In [4]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          show_step=10)
print(hparams)

HParams object with values {'support_quick_scoring': True, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 20, 'head_dim': 20, 'filter_num': 200, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 50, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 5, 'batch_size': 32, 'show_step': 10, 'title_size': 30, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 300, 'model_type': 'nrms', 'loss': 'cross_entropy_loss', 'wordEmb_file': '/tmp/tmp5t9unlko/utils/embedding.npy', 'wordDict_file': '/tmp/tmp5t9unlko/utils/word_dict.pkl', 'userDict_file': '/tmp/tmp5t9unlko/utils/uid2index.pkl'}


## Train the NRMS Model

In [5]:
iterator = MINDIterator

In [6]:
model = NRMSModel(hparams, iterator, seed=seed)

  super().__init__(name, **kwargs)


In [7]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

  updates=self.state_updates,
586it [00:01, 328.30it/s]
236it [00:05, 42.82it/s]
7538it [00:01, 6970.36it/s]


{'group_auc': 0.4794, 'mean_mrr': 0.2059, 'ndcg@5': 0.205, 'ndcg@10': 0.2701}


In [8]:
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

step 1080 , total_loss: 1.5162, data_loss: 1.4075: : 1086it [01:44, 10.42it/s]
586it [00:01, 566.64it/s]
236it [00:05, 44.79it/s]
7538it [00:02, 3561.71it/s]


at epoch 1
train info: logloss loss:1.5155886469844515
eval info: group_auc:0.5796, mean_mrr:0.2469, ndcg@10:0.334, ndcg@5:0.2637
at epoch 1 , train time: 104.2 eval time: 16.7


step 1080 , total_loss: 1.4199, data_loss: 1.3086: : 1086it [01:40, 10.81it/s]
586it [00:00, 596.59it/s]
236it [00:05, 44.12it/s]
7538it [00:00, 8173.04it/s]


at epoch 2
train info: logloss loss:1.4201494831845685
eval info: group_auc:0.5983, mean_mrr:0.2562, ndcg@10:0.3457, ndcg@5:0.2719
at epoch 2 , train time: 100.4 eval time: 15.5


step 1080 , total_loss: 1.3765, data_loss: 1.1727: : 1086it [01:41, 10.68it/s]
586it [00:00, -1901.00it/s]
236it [00:05, 46.28it/s]
7538it [00:01, 6366.39it/s]


at epoch 3
train info: logloss loss:1.3764338079517497
eval info: group_auc:0.6118, mean_mrr:0.2662, ndcg@10:0.3598, ndcg@5:0.2869
at epoch 3 , train time: 101.6 eval time: 14.3


step 1080 , total_loss: 1.3522, data_loss: 1.2181: : 1086it [01:42, 10.64it/s]
586it [00:00, 616.58it/s]
236it [00:05, 44.98it/s]
7538it [00:00, 8096.35it/s]


at epoch 4
train info: logloss loss:1.3522112568219502
eval info: group_auc:0.6123, mean_mrr:0.2689, ndcg@10:0.3618, ndcg@5:0.2908
at epoch 4 , train time: 102.1 eval time: 13.7


step 1080 , total_loss: 1.3313, data_loss: 1.3902: : 1086it [01:41, 10.65it/s]
586it [00:01, 571.32it/s]
236it [00:05, 45.42it/s]
7538it [00:01, 6528.88it/s]


at epoch 5
train info: logloss loss:1.3314230540290501
eval info: group_auc:0.611, mean_mrr:0.268, ndcg@10:0.3606, ndcg@5:0.2879
at epoch 5 , train time: 101.9 eval time: 14.0


<recommenders.models.newsrec.models.nrms.NRMSModel at 0x7f058e2466b0>

In [9]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")

Elapsed time: 0:10:15


In [10]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)


586it [00:01, 519.65it/s]
236it [00:05, 44.95it/s]
7538it [00:00, 8105.95it/s]


{'group_auc': 0.611, 'mean_mrr': 0.268, 'ndcg@5': 0.2879, 'ndcg@10': 0.3606}
CPU times: user 17.8 s, sys: 19.3 s, total: 37.1 s
Wall time: 15.6 s


In [11]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [12]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [13]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

586it [00:01, 483.57it/s]
236it [00:05, 45.29it/s]
7538it [00:01, 7458.90it/s]


In [14]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

7538it [00:00, 54738.09it/s]


In [15]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## References

https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf


https://github.com/recommenders-team/recommenders/tree/main

In [16]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")

Elapsed time: 0:10:39
