# NRMS Model

NRMS stands for `Neural News Reccomendaiton with Multi-head Self-Attention`.  The reference to the paper is provided below. 

---

## Understand the MIND dataset
The MIND dataset consists of several key files:

- news.tsv: Contains news articles and their metadata (news ID, category, subcategory, title, abstract, etc.).
- behaviors.tsv: Contains user interaction data, including the history of news articles clicked and the impressions list (clicked or not clicked).

The NRMS model uses this data to learn user preferences based on click history.

---

## Getting setup
Create the virtural environment
```bash

python -m venv nrms

```

Edit your .bashrc and add an alias:

```.bash

alias nrms='source ~/nrms/bin/activate'

```

Source the .bashrc file and activate the nrms enviroment'

```.bash

source .bashrc

nrms


```

Install the Python Modules Needed

Note, the original NRMS was done with TENSORFLOW !!!
```.bash


pip install tensorflow[and-cuda]

pip install jupyterlab

# Start Jupyter lab at this point if you want.

pip install recommenders # The Microsoft Python Module with all the recommender models in it.

# We will need word embeddings
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip

            
```

---

# NRMS Sequence of Steps


## Do the imports and ensure it works



In [1]:
import time



# Start the timer
start_time = time.time()

# Remove warnings
import os
os.environ['TF_TRT_ALLOW_ENGINE_NATIVE_SEGMENT_EXECUTION'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'


import tensorflow as tf
tf.compat.v1.disable_eager_execution()  # Disable eager execution to use TF1.x style


import pickle
import pandas as pd
from collections import Counter
from tensorflow.keras.preprocessing.text import Tokenizer

# Load news dataset
dataset_path = '~/datasets/MINDsmall/train/'
df_news = pd.read_csv(f"{dataset_path}news.tsv", sep="\t", names=["news_id", "category", "subcategory", "title", "abstract", "url", "entity"])

# Drop rows where title or abstract is NaN
df_news.dropna(subset=['title', 'abstract'], inplace=True)

# Combine all titles and abstracts to create the vocabulary
all_texts = df_news['title'].tolist() + df_news['abstract'].tolist()

# Tokenize the texts
tokenizer = Tokenizer(num_words=50000)  # Limit vocabulary to 50000 words
tokenizer.fit_on_texts(all_texts)

# Save the word index dictionary to a file
word_dict = tokenizer.word_index
with open("word_dict.pkl", "wb") as f:
    pickle.dump(word_dict, f)


# End the timer
end_time = time.time()

# Print the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")


Elapsed time: 5.06 seconds


In [2]:
# Start the timer
start_time = time.time()

# Load behaviors dataset
df_behaviors = pd.read_csv(f"{dataset_path}behaviors.tsv", sep="\t", names=["impression_id", "user_id", "time", "history", "impressions"])
# Replace NaN values in history with an empty string
df_behaviors['history'] = df_behaviors['history'].fillna("")

# Generate a dictionary of user IDs and their indices
user_ids = df_behaviors['user_id'].unique()
user_dict = {user_id: idx for idx, user_id in enumerate(user_ids)}

# Save the user index dictionary as a pickle file
with open("user_dict.pkl", "wb") as f:
    pickle.dump(user_dict, f)


# End the timer
end_time = time.time()

# Print the elapsed time
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

Elapsed time: 0.94 seconds


## Set Hyperparameters

Note, this can be very different from what you read online in articles.  So be ready to dig deep to get it write. In some case new things are needed, such as userDict and wordDict -- but the abstracting away of these details by the recommender module is good, there is still some preprocessing to consider.

In [3]:
from recommenders.models.newsrec.newsrec_utils import prepare_hparams

# Prepare hyperparameters for NRMS
hparams = prepare_hparams(
    yaml_file=None,
    model_type="nrms",                   # Specify model type
    data_format="news",                  # Data format for NRMS (should be 'news')
    title_size=30,                       # Maximum number of words in the title
    word_emb_dim=300,                    # Word embedding dimension (consistent with GloVe)
    word_size=50000,                     # Vocabulary size to use
    dropout=0.2,                         # Dropout rate for regularization
    epochs=10,                           # Number of epochs to train
    batch_size=64,                       # Batch size for training
    learning_rate=0.001,                 # Learning rate for optimization
    npratio=4,                           # Negative sampling ratio
    his_size=50,                         # Number of historical news articles to consider
    head_num=8,                          # Number of attention heads in the multi-head attention mechanism
    head_dim=64,                         # Dimension size of each attention head
    attention_hidden_dim=200,            # Hidden dimension size for the attention mechanism
    loss="cross_entropy",                # Loss function to be used during training
    userDict_file="user_dict.pkl",       # Path to the generated user dictionary file
    wordDict_file="word_dict.pkl",       # Path to the generated word dictionary file
    wordEmb_file="glove.6B.300d.txt"     # Path to pre-trained GloVe embeddings
)


## The next step would be to initialize your data iterators and train the NRMS model. Here’s a summary of what’s next:

In [4]:
from recommenders.models.newsrec.io.mind_iterator import MINDIterator

# Initialize MIND data iterator
iterator = MINDIterator(hparams)

# Create training and validation datasets
train_data = iterator.load_data_from_file(f"{dataset_path}behaviors.tsv", f"{dataset_path}news.tsv")


In [5]:
### NOTE
##
'''
You'll need to make this look like the below so vi the file. 

vi ~/nrms_tf/lib/python3.10/site-packages/recommenders/models/newsrec/models/layers.py


#import tensorflow.compat.v1.keras as keras
from tensorflow import keras
from tensorflow.compat.v1.linalg import einsum
#from tensorflow.compat.v1.keras import layers
from tensorflow.keras import layers
#from tensorflow.compat.v1.keras import backend as K
from tensorflow.keras import backend as K


'''


from recommenders.models.newsrec.models.nrms import NRMSModel

# Initialize hyperparameters (assuming hparams is already defined)
iterator_creator = MINDIterator

# Initialize the NRMS model with both hparams and iterator_creator
model = NRMSModel(hparams, iterator, seed=42)

ValueError: Cannot load file containing pickled data when allow_pickle=False

In [6]:
NRMSModel??

[0;31mInit signature:[0m [0mNRMSModel[0m[0;34m([0m[0mhparams[0m[0;34m,[0m [0miterator_creator[0m[0;34m,[0m [0mseed[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mNRMSModel[0m[0;34m([0m[0mBaseModel[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""NRMS model(Neural News Recommendation with Multi-Head Self-Attention)[0m
[0;34m[0m
[0;34m    Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang,and Xing Xie, "Neural News[0m
[0;34m    Recommendation with Multi-Head Self-Attention" in Proceedings of the 2019 Conference[0m
[0;34m    on Empirical Methods in Natural Language Processing and the 9th International Joint Conference[0m
[0;34m    on Natural Language Processing (EMNLP-IJCNLP)[0m
[0;34m[0m
[0;34m    Attributes:[0m
[0;34m        word2vec_embedding (numpy.ndarray): Pretrained word embedding matrix.[0m
[0;34m        hparam (object): Global hyper-parameters.[0m
[0;34

In [None]:
import pickle

file_path = hparams.wordDict_file
try:
    with open(file_path, 'rb') as f:
        data = pickle.load(f)
        print("Successfully loaded:", data)
except pickle.UnpicklingError as e:
    print("UnpicklingError:", e)
except FileNotFoundError:
    print("File not found:", file_path)


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize Tokenizer with a fixed vocabulary size (e.g., 50,000 as per the NRMS paper)
vocab_size = 50000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<UNK>")  # OOV token for out-of-vocabulary words

# Fit the tokenizer on the combined text data (news titles and user histories)
all_titles = df_news['title'].tolist()
all_histories = df_behaviors['history'].fillna('').tolist()  # Fill NA with empty strings
tokenizer.fit_on_texts(all_titles + all_histories)

# Convert news titles to sequences and pad them to a fixed length
max_title_length = 30  # Based on NRMS paper
title_sequences = tokenizer.texts_to_sequences(df_news['title'])
title_sequences_padded = pad_sequences(title_sequences, maxlen=max_title_length, padding='post')

# Convert user histories to sequences and pad them to a fixed length
max_history_length = 50  # Based on NRMS paper
history_sequences = tokenizer.texts_to_sequences(df_behaviors['history'].fillna(''))
history_sequences_padded = pad_sequences(history_sequences, maxlen=max_history_length, padding='post')


In [None]:
# Step 2: Load MIND Dataset
# Now, we will load the MIND dataset, which contains user behaviors and news articles.
# We have the datasets already downloaded in ~/datasets/MINDlarge and ~/datasets/MINDsmall.
mind_large = '~/datasets/MINDlarge'
mind_large_train = mind_large + '/train/'
mind_large_dev = mind_large + '/dev/'  # Development -- help tune hyper-parameter 
mind_large_test = mind_large + '/test/'

mind_small = '~/datasets/MINDsmall'
mind_small_train = mind_small + '/train/'
mind_small_dev = mind_small + '/dev/'

dataset_path = mind_small_train

# Load training data
df_behaviors = pd.read_csv(f"{dataset_path}behaviors.tsv", sep="\t", names=["impression_id", "user_id", "time", "history", "impressions"])
df_news = pd.read_csv(f"{dataset_path}news.tsv", sep="\t", names=["news_id", "category", "subcategory", "title", "abstract", "url", "entity"])

In [None]:
# Step 3: Data Preprocessing
# Check for missing values in the title column and remove them.
df_news = df_news[df_news['title'].notna()].copy()

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# The next step is to preprocess the news dataset. We tokenize the news titles using BERT tokenizer.
def preprocess_news(news_df):
    # Tokenize the news title using the BERT tokenizer
    news_df.loc[:, 'title_tokens'] = news_df['title'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=50, truncation=True))
    return news_df

df_news = preprocess_news(df_news)

In [None]:
# Step 4: Create Vocabulary
# BERT tokenizer already provides the vocabulary, so no need to create a custom vocabulary.
vocab = tokenizer.get_vocab()

In [None]:
# Step 5: Dataset and DataLoader Classes
# We define a custom Dataset class to load and serve the data to the model. This class will convert news titles and user behaviors into tensors.
class NewsDataset(Dataset):
    def __init__(self, df_behaviors, df_news):
        self.behaviors = df_behaviors
        self.news = df_news

    def __len__(self):
        return len(self.behaviors)

    def __getitem__(self, idx):
        # Extract user history and impressions from the behaviors dataset.
        user_history = self.behaviors.iloc[idx]['history'].split()
        impressions = self.behaviors.iloc[idx]['impressions'].split()
        # Get the tokenized titles for each news article in the user's history.
        news_titles = []
        for news_id in user_history:
            matching_news = self.news[self.news['news_id'] == news_id]
            if not matching_news.empty:
                news_titles.append(matching_news['title_tokens'].values[0])
        if not news_titles:
            # If no valid news articles are found, return an empty tensor with padding
            news_titles = [[0]]
        return torch.tensor(news_titles, dtype=torch.long), torch.tensor([1 if '1' in imp else 0 for imp in impressions])

# Create the dataset and dataloader for training.
train_dataset = NewsDataset(df_behaviors, df_news)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

In [None]:
# Step 6: Define the NRMS Model
# Here, we define the NRMS model. The model uses BERT embeddings and a multi-head attention mechanism to capture the relationships between words.
class NRMS(nn.Module):
    def __init__(self, embedding_dim, attention_heads):
        super(NRMS, self).__init__()
        # BERT model to get embeddings
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        # Multi-head attention layer to capture interactions between words.
        self.attention = nn.MultiheadAttention(embedding_dim, attention_heads)
        # Fully connected layer to produce the final output.
        self.fc = nn.Linear(embedding_dim, 1)

    def forward(self, x):
        # Convert input sequences to embeddings using BERT.
        with torch.no_grad():
            x = self.bert(x)[0]  # Extract the last hidden state from BERT
        x = x.permute(1, 0, 2)  # Convert to (SeqLen, Batch, EmbeddingDim)
        # Apply multi-head attention.
        attn_output, _ = self.attention(x, x, x)
        # Average pooling over sequence length and pass through a fully connected layer.
        out = self.fc(attn_output.mean(dim=0))
        return torch.sigmoid(out)

In [None]:
# Step 7: Training Loop
# We define the training loop to train the NRMS model on the MIND dataset.
model = NRMS(embedding_dim, attention_heads)
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss for binary classification.
optimizer = optim.Adam(model.parameters(), lr=learning_rate)  # Adam optimizer.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    for news_tokens, labels in tqdm(train_loader, desc=f"Epoch {epoch + 1}"):
        # Move data to the appropriate device (CPU or GPU).
        news_tokens, labels = news_tokens.to(device), labels.to(device, dtype=torch.float)
        optimizer.zero_grad()  # Clear previous gradients.
        outputs = model(news_tokens)  # Forward pass through the model.
        loss = criterion(outputs.view(-1), labels.view(-1))  # Calculate loss.
        loss.backward()  # Backpropagate the loss.
        optimizer.step()  # Update model parameters.
        epoch_loss += loss.item()
    # Print the average loss for the epoch.
    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss / len(train_loader):.4f}")

# Step 8: Save the Model
# Finally, save the trained model so it can be used for inference or further training.
torch.save(model.state_dict(), 'nrms_model.pth')

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
from tqdm.notebook import tqdm
from transformers import BertTokenizer, BertModel

# Step 1: Define Parameters
# We start by defining some important parameters for our model such as embedding dimensions, number of attention heads, batch size, etc.
embedding_dim = 768  # Using BERT embedding dimensions
attention_heads = 8  # Number of attention heads in multi-head attention
batch_size = 16  # Number of samples in each batch
num_epochs = 10  # Number of epochs to train the model
learning_rate = 0.001  # Learning rate for the optimizer

# Step 2: Load MIND Dataset
# Now, we will load the MIND dataset, which contains user behaviors and news articles.
# We have the datasets already downloaded in ~/datasets/MINDlarge and ~/datasets/MINDsmall.
mind_large = '~/datasets/MINDlarge'
mind_large_train = mind_large + '/train/'
mind_large_dev = mind_large + '/dev/'  # Development -- help tune hyper-parameter 
mind_large_test = mind_large + '/test/'

mind_small = '~/datasets/MINDsmall'
mind_small_train = mind_small + '/train/'
mind_small_dev = mind_small + '/dev/'

dataset_path = mind_small_train

# Load training data
df_behaviors = pd.read_csv(f"{dataset_path}behaviors.tsv", sep="\t", names=["impression_id", "user_id", "time", "history", "impressions"])
df_news = pd.read_csv(f"{dataset_path}news.tsv", sep="\t", names=["news_id", "category", "subcategory", "title", "abstract", "url", "entity"])

# Step 3: Data Preprocessing
# Check for missing values in the title column and remove them.
df_news = df_news[df_news['title'].notna()].copy()

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# The next step is to preprocess the news dataset. We tokenize the news titles using BERT tokenizer.
def preprocess_news(news_df):
    # Tokenize the news title using the BERT tokenizer
    news_df.loc[:, 'title_tokens'] = news_df['title'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=50, truncation=True))
    return news_df

df_news = preprocess_news(df_news)

# Step 4: Create Vocabulary
# BERT tokenizer already provides the vocabulary, so no need to create a custom vocabulary.
vocab = tokenizer.get_vocab()

# Step 5: Dataset and DataLoader Classes
# We define a custom Dataset class to load and serve the data to the model. This class will convert news titles and user behaviors into tensors.
class NewsDataset(Dataset):
    def __init__(self, df_behaviors, df_news):
        self.behaviors = df_behaviors
        self.news = df_news

    def __len__(self):
        return len(self.behaviors)

    def __getitem__(self, idx):
        # Extract user history and impressions from the behaviors dataset.
        user_history = str(self.behaviors.iloc[idx]['history']).split()
        impressions = str(self.behaviors.iloc[idx]['impressions']).split()
        # Get the tokenized titles for each news article in the user's history.
        news_titles = []
        for news_id in user_history:
            matching_news = self.news[self.news['news_id'] == news_id]
            if not matching_news.empty:
                news_titles.append(matching_news['title_tokens'].values[0])
        if not news_titles:
            # If no valid news articles are found, return a tensor filled with padding
            news_titles = [[0]]
        # Convert list of token lists to tensors
        news_titles = [torch.tensor(tokens, dtype=torch.long) for tokens in news_titles]
        labels = torch.tensor([1 if '1' in imp else 0 for imp in impressions], dtype=torch.float)
        return news_titles, labels

# Custom collate function to handle batches with varying sequence lengths
def collate_fn(batch):
    news_titles_batch, labels_batch = zip(*batch)
    # Pad each list of news titles independently
    padded_news_titles = [pad_sequence(news, batch_first=True, padding_value=0) for news in news_titles_batch]
    # Stack all padded news titles into a batch
    news_titles_padded = pad_sequence(padded_news_titles, batch_first=True, padding_value=0)
    # Pad labels to ensure consistent batch size
    labels_padded = pad_sequence(labels_batch, batch_first=True, padding_value=0)
    # Create attention masks
    attention_mask = (news_titles_padded != 0).long()
    return news_titles_padded, attention_mask, labels_padded

# Create the dataset and dataloader for training.
train_dataset = NewsDataset(df_behaviors, df_news)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

# Step 6: Define the NRMS Model
# Here, we define the NRMS model. The model uses BERT embeddings and a multi-head attention mechanism to capture the relationships between words.
class NRMS(nn.Module):
    def __init__(self, embedding_dim, attention_heads):
        super(NRMS, self).__init__()
        # BERT model to get embeddings
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        # Multi-head attention layer to capture interactions between words.
        self.attention = nn.MultiheadAttention(embedding_dim, attention_heads)
        # Fully connected layer to produce the final output.
        self.fc = nn.Linear(embedding_dim, 1)

    def forward(self, x, attention_mask):
        # Convert input sequences to embeddings using BERT.
        with torch.no_grad():
            x = self.bert(x, attention_mask=attention_mask)[0]  # Extract the last hidden state from BERT
        x = x.permute(1, 0, 2)  # Convert to (SeqLen, Batch, EmbeddingDim)
        # Apply multi-head attention.
        attn_output, _ = self.attention(x, x, x)
        # Average pooling over sequence length and pass through a fully connected layer.
        out = self.fc(attn_output.mean(dim=0))
        return torch.sigmoid(out).squeeze()

# Step 7: Training Loop
# We define the training loop to train the NRMS model on the MIND dataset.
model = NRMS(embedding_dim, attention_heads)
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss for binary classification.
optimizer = optim.Adam(model.parameters(), lr=learning_rate)  # Adam optimizer.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    for news_tokens, attention_mask, labels in tqdm(train_loader, desc=f"Epoch {epoch + 1}"):
        # Move data to the appropriate device (CPU or GPU).
        news_tokens, attention_mask, labels = news_tokens.to(device), attention_mask.to(device), labels.to(device, dtype=torch.float)
        optimizer.zero_grad()  # Clear previous gradients.
        outputs = model(news_tokens, attention_mask)  # Forward pass through the model.
        loss = criterion(outputs.view(-1), labels.view(-1))  # Calculate loss.
        loss.backward()  # Backpropagate the loss.
        optimizer.step()  # Update model parameters.
        epoch_loss += loss.item()
    # Print the average loss for the epoch.
    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss / len(train_loader):.4f}")

# Step 8: Save the Model
# Finally, save the trained model so it can be used for inference or further training.
torch.save(model.state_dict(), 'nrms_model.pth')

## References

https://wuch15.github.io/paper/EMNLP2019-NRMS.pdf


