# CICERO -- NRMS Demo Dataset Adapted

This notebook is a run of the [NRMS notebook]( adapted from the [Recommenders Team](https://github.com/recommenders-team/recommenders/tree/main).  It has been modified as time has caused some things to not work as presneted originally. The original `demo` did work, however it did not work for the `small` or `large` datasets, those needed extra work.

NRMS stands for `Neural News Reccomendaiton with Multi-head Self-Attention`.  The reference to the paper is provided below. please look over and study the [recommenders team github repository](https://github.com/recommenders-team/recommenders/blob/main/README.md#Getting-Started) because this code is derived from it. 

Read up on the [MIND: MIcrosoft News Dataset](https://msnews.github.io/) and download the data into the datasets folder of this project and place them in `/apps/datasets`.  Review the [Readme](/app/datasets/README.md) 


## Notebook and GPU Check.  

Run the follow-on cell to check things out.  


In [None]:
import time

# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import os
import sys
import numpy as np
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata


print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

# List available devices
print("Available devices:")
for device in tf.config.list_physical_devices():
    print(device)

# Check if a GPU is detected
if tf.config.list_physical_devices('GPU'):
    print("GPU is available and TensorFlow is using it.")
else:
    print("GPU is NOT available. TensorFlow is using the CPU.")



## Prepare Parameters

adjust these as needed but this is fine for now. 

In [None]:
epochs = 5
seed = 42
batch_size = 64

# Options: MINDdemo, MINDsmall, MINDlarge

#
MIND_type = 'MINDdemo'
data_path_base = "/app/datasets/"
data_path = "/app/datasets/" + MIND_type

## Download and Load the Data

The [original, NRMS MIND notebook](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb) was crafted back in 2021 and the links for the data are inaccesable. Download from the MIND dataset website -- both the small and large datasets. 

While the `demo` dataset can be used the `small` and `large` datasets are unavailable.  The Notebook was modified to use a local copy of the datasets and supporting files.  Note there are several files in addition to the data files that are necessary.


## Data Paths

The original NRMS MIDN did alot of things for you.  the utils folder contains those things that it did. Remember that we have downloaded the data ahead of time for this.  The way I got the pkl and npy files is by executing hte original nrmsMIND notebook with demo set. In case you're wondering.

Glove embeddings and nltk are not provided here as the `demo` dataset downloads themselves.  

In [None]:
# Create the directory if it doesn't exist but it will
os.makedirs(data_path, exist_ok=True)

print(f"Data Path is {data_path}")
train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'nrms.yaml')


## Set Hyperparameters



In [None]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          show_step=10)
print(hparams)

## Train the NRMS Model

In [None]:
for i in hparams.values().keys():
    print(f"{i} = {hparams.values()[i]}")

In [None]:
iterator = MINDIterator

In [None]:
model = NRMSModel(hparams, iterator, seed=seed)

### Run the model without any training.

In [None]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

In [None]:
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

In [None]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")

In [None]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)


In [None]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [None]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [None]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

In [None]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

In [None]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## References

https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf


https://github.com/recommenders-team/recommenders/tree/main

In [None]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")