# CICERO -- NRMS Large Dataset Adapted

This notebook is a run of the [NRMS notebook]( adapted from the [Recommenders Team](https://github.com/recommenders-team/recommenders/tree/main).  It has been modified as time has caused some things to not work as presneted originally.

NRMS stands for `Neural News Reccomendaiton with Multi-head Self-Attention`.  The reference to the paper is provided below. please look over and study the [recommenders team github repository](https://github.com/recommenders-team/recommenders/blob/main/README.md#Getting-Started) because this code is derived from it. 

Read up on the [MIND: MIcrosoft News Dataset](https://msnews.github.io/) and download the data into the datasets folder of this project and place them in `/apps/datasets`.  Review the [Readme](/app/datasets/README.md) 

## do imports and check things out.

Are we using a GPU? If not then things will take a very long time.


In [1]:
import time

# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import sys
# Gets the SUBERX python modules included.
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

import numpy as np
import pandas as pd
import pickle
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

# This module was created to generate the necessary files if they dont' already exist
from environment.data_utils import generate_uid2index, load_glove_embeddings, create_word_dict_and_embeddings, setup_nltk_resources

from nltk.tokenize import word_tokenize
import nltk
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.datasets.mind import download_and_extract_glove
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

# List available devices
print("Available devices:")
for device in tf.config.list_physical_devices():
    print(device)

# Check if a GPU is detected
if tf.config.list_physical_devices('GPU'):
    print("GPU is available and TensorFlow is using it.")
else:
    print("GPU is NOT available. TensorFlow is using the CPU.")


  from tqdm.autonotebook import tqdm, trange


System version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
Tensorflow version: 2.15.1
Available devices:
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
GPU is available and TensorFlow is using it.


## Prepare Parameters

Adjust these as needed. 



In [2]:
epochs = 5
seed = 42
batch_size = 64


# My modification for MINDSMALL as I mount in the data into the container.  Further the small dataset is not accessable anymore.

# Options: MINDdemo, MINDsmall, MINDlarge

MIND_type = 'MINDlarge'

## Specify the dataset to use

There are three in the original notebook: `demo`, `small` and `large`.

From the [original  notebook](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb),  the `demo` set is 5000 samples of the `small` dataset.

I was able to download the original demo but that's not really needed. A sample of 5000 is good for showing the algorithm works but won't train it well enough.  


In [3]:
## This is how it was done originally
#tmpdir = TemporaryDirectory()
#data_path = tmpdir.name

## I mount in the datasets folder to /app
data_path_base="/app/datasets/"
data_path = data_path_base + MIND_type

# Create the directory if it doesn't exist
os.makedirs(data_path, exist_ok=True)

print(f"Data Path is {data_path}")

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
dev_news_file = os.path.join(data_path,'dev','news.tsv')
dev_behaviors_file = os.path.join(data_path,'dev','behaviors.tsv')

wordEmb_file = os.path.join(data_path, 'utils',"embedding.npy")
userDict_file = os.path.join(data_path, 'utils',"uid2index.pkl")
wordDict_file = os.path.join(data_path, 'utils',"word_dict.pkl")
yaml_file = os.path.join(data_path, "utils",'nrms.yaml')


Data Path is /app/datasets/MINDlarge


## Download Glove embeddings

The original NRMS used glove embeddings, this is how they are created. 

In [4]:
## All models can use the same glove embeddings

# Download and extract GloVe embeddings
glove_dir = data_path_base +'glove_embeddings'
download_and_extract_glove(glove_dir)
print(f"GloVe embeddings extracted to: {glove_dir}")

GloVe embeddings extracted to: /app/datasets/glove_embeddings


## Get the NLTK Data

In [5]:
nltk_data_dir = data_path_base + 'nltk_data'
setup_nltk_resources(download_dir=nltk_data_dir)
nltk.data.path.append(nltk_data_dir)

NLTK resources downloaded to /app/datasets/nltk_data


[nltk_data] Downloading package punkt to /app/datasets/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /app/datasets/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Identify the news and behavior files

In [6]:
# Generate uid2index.pkl
uid2index = generate_uid2index(train_behaviors_file, output_file= userDict_file)

Created /app/datasets/MINDlarge/utils/uid2index.pkl with 711222 users.


In [7]:
# Path to GloVe file (e.g., glove.6B.300d.txt)
glove_file = os.path.join(glove_dir, "glove/glove.6B.300d.txt")
embedding_dim = 300
glove_embeddings = load_glove_embeddings(glove_file, embedding_dim=embedding_dim)


glove_embeddings.pkl already exists. Loading embeddings from file.


In [8]:
# Generate word_dict.pkl and embedding.npy
word_dict, embedding_matrix = create_word_dict_and_embeddings(
    news_file= train_news_file,
    glove_embeddings=glove_embeddings,
    embedding_dim=embedding_dim,
    output_dir= data_path + '/utils/'
)


Tokenizing text: 100% 101527/101527 [00:32<00:00, 3152.71it/s]
Creating embedding matrix: 100% 101242/101242 [00:00<00:00, 159118.75it/s]


Saved word_dict.pkl and embedding.npy to /app/datasets/MINDlarge/utils/.


## Define the hyper-parameters

In [9]:
hparams = prepare_hparams(None, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          model_type="nrms",
                          title_size=30,
                          his_size=50,
                          npratio=4,
                          data_format='news',
                          word_emb_dim=300,
                          head_num=20,
                          head_dim=20,
                          attention_hidden_dim=200,
                          loss='cross_entropy_loss',
                          dropout=0.2,
                          support_quick_scoring=True,
                          show_step=10, 
                          metrics=['group_auc', 'mean_mrr', 'ndcg@5;10'])
#print(hparams)

In [10]:
for i in hparams.values().keys():
    print(f"{i} = {hparams.values()[i]}")

support_quick_scoring = True
dropout = 0.2
attention_hidden_dim = 200
head_num = 20
head_dim = 20
filter_num = 200
window_size = 3
vert_emb_dim = 100
subvert_emb_dim = 100
gru_unit = 400
type = ini
user_emb_dim = 50
learning_rate = 0.001
optimizer = adam
epochs = 5
batch_size = 64
show_step = 10
wordEmb_file = /app/datasets/MINDlarge/utils/embedding.npy
wordDict_file = /app/datasets/MINDlarge/utils/word_dict.pkl
userDict_file = /app/datasets/MINDlarge/utils/uid2index.pkl
model_type = nrms
title_size = 30
his_size = 50
npratio = 4
data_format = news
word_emb_dim = 300
loss = cross_entropy_loss
metrics = ['group_auc', 'mean_mrr', 'ndcg@5;10']


In [11]:
iterator = MINDIterator
model = NRMSModel(hparams, iterator, seed=seed)

  super().__init__(name, **kwargs)


### Run the model without any training -- can training do better than no training at all?  

In [13]:
print(model.run_eval(dev_news_file, dev_behaviors_file))

  updates=self.state_updates,
1126it [00:04, 240.65it/s]
5883it [04:27, 21.97it/s]
376471it [01:16, 4945.64it/s] 


{'group_auc': 0.4628, 'mean_mrr': 0.2005, 'ndcg@5': 0.2019, 'ndcg@10': 0.2641}


In [14]:
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

step 52870 , total_loss: 1.2507, data_loss: 1.0388: : 52870it [2:50:47,  5.16it/s]
1126it [00:02, 421.61it/s]
5883it [04:06, 23.82it/s]
376471it [01:12, 5184.24it/s] 


at epoch 1
train info: logloss loss:1.2507364407173285
eval info: group_auc:0.6716, mean_mrr:0.317, ndcg@10:0.4158, ndcg@5:0.3521
at epoch 1 , train time: 10247.3 eval time: 805.4


step 52870 , total_loss: 1.2107, data_loss: 1.1084: : 52870it [2:47:04,  5.27it/s]
1126it [00:02, 423.83it/s]
5883it [04:05, 23.93it/s]
376471it [01:42, 3656.45it/s] 


at epoch 2
train info: logloss loss:1.210744759015963
eval info: group_auc:0.6602, mean_mrr:0.3094, ndcg@10:0.4069, ndcg@5:0.342
at epoch 2 , train time: 10024.7 eval time: 825.6


step 52870 , total_loss: 1.2359, data_loss: 1.2254: : 52870it [2:47:17,  5.27it/s]
1126it [00:02, 452.32it/s]
5883it [04:05, 23.92it/s]
376471it [00:57, 6580.91it/s]


at epoch 3
train info: logloss loss:1.2359015685287207
eval info: group_auc:0.6398, mean_mrr:0.2897, ndcg@10:0.3864, ndcg@5:0.3191
at epoch 3 , train time: 10037.8 eval time: 782.0


step 52870 , total_loss: 1.3391, data_loss: 1.6094: : 52870it [2:46:18,  5.30it/s] 
1126it [00:02, 449.76it/s]
5883it [04:03, 24.15it/s]
376471it [00:54, 6915.08it/s]


at epoch 4
train info: logloss loss:1.3390802287854746
eval info: group_auc:0.5367, mean_mrr:0.2237, ndcg@10:0.2989, ndcg@5:0.2383
at epoch 4 , train time: 9978.1 eval time: 774.7


step 52870 , total_loss: 1.6223, data_loss: 1.6094: : 52870it [2:45:54,  5.31it/s] 
1126it [00:02, 452.79it/s]
5883it [04:03, 24.16it/s]
376471it [00:55, 6826.03it/s]


at epoch 5
train info: logloss loss:1.6222897745112244
eval info: group_auc:0.5269, mean_mrr:0.2306, ndcg@10:0.3024, ndcg@5:0.2401
at epoch 5 , train time: 9954.1 eval time: 776.0


<recommenders.models.newsrec.models.nrms.NRMSModel at 0x7fbc941e2560>

In [15]:
res_syn = model.run_eval(dev_news_file, dev_behaviors_file)
print(res_syn)

1126it [00:02, 404.73it/s]
5883it [04:02, 24.23it/s]
376471it [00:55, 6766.99it/s]


{'group_auc': 0.5269, 'mean_mrr': 0.2306, 'ndcg@5': 0.2401, 'ndcg@10': 0.3024}


In [16]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [17]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [18]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

1126it [00:02, 399.75it/s]
5883it [04:03, 24.20it/s]
376471it [00:55, 6812.36it/s]


In [19]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

376471it [00:06, 59852.85it/s]


## References

https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf


https://github.com/recommenders-team/recommenders/tree/main

In [20]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")

Elapsed time: 15:36:51
