# CICERO -- NRMS Small Dataset Adapted to SUBER first Steps

This notebook is a run of the [NRMS notebook]( adapted from the [Recommenders Team](https://github.com/recommenders-team/recommenders/tree/main).  It has been modified as time has caused some things to not work as presneted originally.

NRMS stands for `Neural News Reccomendaiton with Multi-head Self-Attention`.  The reference to the paper is provided below. please look over and study the [recommenders team github repository](https://github.com/recommenders-team/recommenders/blob/main/README.md#Getting-Started) because this code is derived from it. 

Read up on the [MIND: MIcrosoft News Dataset](https://msnews.github.io/) and download the data into the datasets folder of this project and place them in `/apps/datasets`.  Review the [Readme](/app/datasets/README.md) 

## do imports and check things out.

Are we using a GPU? If not then things will take a very long time.


In [1]:
import time

# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import sys
# Gets the SUBERX python modules included.
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

import numpy as np
import pandas as pd
import pickle
import zipfile
import random
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

# This module was created to generate the necessary files if they dont' already exist
from environment.data_utils import generate_uid2index, load_glove_embeddings, create_word_dict_and_embeddings, setup_nltk_resources

from nltk.tokenize import word_tokenize
import nltk
from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.datasets.mind import download_and_extract_glove
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

# List available devices
print("Available devices:")
for device in tf.config.list_physical_devices():
    print(device)

# Check if a GPU is detected
if tf.config.list_physical_devices('GPU'):
    print("GPU is available and TensorFlow is using it.")
else:
    print("GPU is NOT available. TensorFlow is using the CPU.")


  from tqdm.autonotebook import tqdm, trange


System version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
Tensorflow version: 2.15.1
Available devices:
PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
GPU is available and TensorFlow is using it.


## Prepare Parameters

Adjust these as needed. 



In [2]:
epochs = 5
seed = 42
batch_size = 64


# My modification for MINDSMALL as I mount in the data into the container.  Further the small dataset is not accessable anymore.

# Options: MINDdemo, MINDsmall, MINDlarge

MIND_type = 'MINDsmall'

## Specify the dataset to use

There are three in the original notebook: `demo`, `small` and `large`.

From the [original  notebook](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb),  the `demo` set is 5000 samples of the `small` dataset.

I was able to download the original demo but that's not really needed. A sample of 5000 is good for showing the algorithm works but won't train it well enough.  

In [3]:
## This is how it was done originally
#tmpdir = TemporaryDirectory()
#data_path = tmpdir.name

## I mount in the datasets folder to /app
data_path_base="/app/datasets/"
data_path = data_path_base + MIND_type

# Create the directory if it doesn't exist
os.makedirs(data_path, exist_ok=True)

print(f"Data Path is {data_path}")

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
dev_news_file = os.path.join(data_path,'dev','news.tsv')
dev_behaviors_file = os.path.join(data_path,'dev','behaviors.tsv')

wordEmb_file = os.path.join(data_path, 'utils',"embedding.npy")
userDict_file = os.path.join(data_path, 'utils',"uid2index.pkl")
wordDict_file = os.path.join(data_path, 'utils',"word_dict.pkl")
yaml_file = os.path.join(data_path, "utils",'nrms.yaml')


Data Path is /app/datasets/MINDsmall


## Download Glove embeddings

The original NRMS used glove embeddings, this is how they are created. 

In [4]:
## All models can use the same glove embeddings

# Download and extract GloVe embeddings
glove_dir = data_path_base +'glove_embeddings'
download_and_extract_glove(glove_dir)
print(f"GloVe embeddings extracted to: {glove_dir}")




GloVe embeddings extracted to: /app/datasets/glove_embeddings


## Get the NLTK Data

In [5]:
nltk_data_dir = data_path_base + 'nltk_data'
setup_nltk_resources(download_dir=nltk_data_dir)
nltk.data.path.append(nltk_data_dir)

[nltk_data] Downloading package punkt to /app/datasets/nltk_data...


NLTK resources downloaded to /app/datasets/nltk_data


[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /app/datasets/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Identify the news and behavior files

In [6]:
# Generate uid2index.pkl
uid2index = generate_uid2index(train_behaviors_file, output_file= userDict_file)

/app/datasets/MINDsmall/utils/uid2index.pkl already exists. Skipping creation.


In [7]:
# Path to GloVe file (e.g., glove.6B.300d.txt)
glove_file = os.path.join(glove_dir, "glove/glove.6B.300d.txt")
embedding_dim = 300
glove_embeddings = load_glove_embeddings(glove_file, embedding_dim=embedding_dim)


glove_embeddings.pkl already exists. Loading embeddings from file.


In [8]:
# Generate word_dict.pkl and embedding.npy
word_dict, embedding_matrix = create_word_dict_and_embeddings(
    news_file= train_news_file,
    glove_embeddings=glove_embeddings,
    embedding_dim=embedding_dim,
    output_dir= data_path + '/utils/'
)


Files already exist in /app/datasets/MINDsmall/utils/. Loading existing word_dict and embedding matrix.


## Define the hyper-parameters

In [9]:
hparams = prepare_hparams(None, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          model_type="nrms",
                          title_size=30,
                          his_size=50,
                          npratio=4,
                          data_format='news',
                          word_emb_dim=300,
                          head_num=20,
                          head_dim=20,
                          attention_hidden_dim=200,
                          loss='cross_entropy_loss',
                          dropout=0.2,
                          support_quick_scoring=True,
                          show_step=10, 
                          metrics=['group_auc', 'mean_mrr', 'ndcg@5;10'])
#print(hparams)

In [10]:
for i in hparams.values().keys():
    print(f"{i} = {hparams.values()[i]}")

support_quick_scoring = True
dropout = 0.2
attention_hidden_dim = 200
head_num = 20
head_dim = 20
filter_num = 200
window_size = 3
vert_emb_dim = 100
subvert_emb_dim = 100
gru_unit = 400
type = ini
user_emb_dim = 50
learning_rate = 0.001
optimizer = adam
epochs = 5
batch_size = 64
show_step = 10
wordEmb_file = /app/datasets/MINDsmall/utils/embedding.npy
wordDict_file = /app/datasets/MINDsmall/utils/word_dict.pkl
userDict_file = /app/datasets/MINDsmall/utils/uid2index.pkl
model_type = nrms
title_size = 30
his_size = 50
npratio = 4
data_format = news
word_emb_dim = 300
loss = cross_entropy_loss
metrics = ['group_auc', 'mean_mrr', 'ndcg@5;10']


In [11]:
iterator = MINDIterator
model = NRMSModel(hparams, iterator, seed=seed)

  super().__init__(name, **kwargs)


## Figuring out for SUBER Input

1. `parsed_data` is a generator.
2. `data_d` is a dictionary of one batch of size 64 each.
3. Showing the keys shows the 64 batch size. 

In [12]:
parsed_data = model.train_iterator.load_data_from_file(train_news_file, train_behaviors_file)
print(type(parsed_data))
data_d = next(parsed_data)
print(type(data_d))

for key, value in data_d.items():
    print(f"Key: {key}, Shape: {value.shape if hasattr(value, 'shape') else type(value)}")



<class 'generator'>
<class 'dict'>
Key: impression_index_batch, Shape: (64, 1)
Key: user_index_batch, Shape: (64, 1)
Key: clicked_title_batch, Shape: (64, 50, 30)
Key: candidate_title_batch, Shape: (64, 5, 30)
Key: labels, Shape: (64, 5)


### Description of keys
*Note the first dimension is 64 for all keys.*
- `impression_index_batch`(64,1): comes out of the behaviors tsv, one impression id per user in the batch.
- `user_index_batch` (64,1): One user id per entry.
- `clicked_title_batch`: (64,50,30) Each user has a click history of up to 50 articles. each article title has 30 words.
- `candidate_tile_batch` (64,5,30): Each user is shown 5 candidate articles. Each article title has 30 words.
-  `labels` (64,5): One label per candidate article foe each user in the batch.This indicates where the user clicked each candidate article (1 for click and 0 for not click).

---

Next, we run a few opochs for `show and tell` -- just to see that the format of our data is correct for input.  We do this by passing it to `model.train` located in `https://github.com/recommenders-team/recommenders/blob/main/recommenders/models/newsrec/models/base_model.py#L150 `. Take a look at `recommenders/models/newsrec/models/nrms.py` particularly the method `_build_nrms`; this is what "builds the model. 

In [16]:
# An example that illustrates the shape of the input to NRMS.

losses = []
for i, batch in enumerate(parsed_data):
    loss = model.train(batch)
    print(f"Batch {i + 1} Loss: {loss}")
    losses.append(loss)
    if i == 10:
        break

Batch 1 Loss: 1.6028367280960083
Batch 2 Loss: 1.5509140491485596
Batch 3 Loss: 1.6738035678863525
Batch 4 Loss: 1.508312463760376
Batch 5 Loss: 1.5546257495880127
Batch 6 Loss: 1.6112298965454102
Batch 7 Loss: 1.6047354936599731
Batch 8 Loss: 1.5383238792419434
Batch 9 Loss: 1.6257178783416748
Batch 10 Loss: 1.663732647895813
Batch 11 Loss: 1.5934019088745117


## Mid-Notebook checkpoint

OK, We know how the model will take data in and we know the structure of the data so how can we create it.  

In [14]:
from pydantic import BaseModel
from pydantic_ai import Agent


In [None]:
# Run the model without any training -- can training do better than no training at all?  

In [None]:
print(model.run_eval(dev_news_file, dev_behaviors_file))

In [None]:
dev_news_file

In [None]:
model.fit(train_news_file, train_behaviors_file, dev_news_file, dev_behaviors_file)

In [None]:
res_syn = model.run_eval(dev_news_file, dev_behaviors_file)
print(res_syn)
 

In [None]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [None]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "nrms_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [None]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

In [None]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

In [None]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## References

https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf


https://github.com/recommenders-team/recommenders/tree/main

In [None]:
# Calculate elapsed time
elapsed_time = time.time() - start_time

# Convert the elapsed time into hours, minutes, and seconds
hours, remainder = divmod(elapsed_time, 3600)
minutes, seconds = divmod(remainder, 60)

# Print the result in H:M:S format
print(f"Elapsed time: {int(hours)}:{int(minutes)}:{int(seconds)}")