## Install required packages

In [None]:
#!pip install torch pandas numpy nltk

## Prepare data

In this example, I have used short description text in news papers dataset, since it's formal-styled concise sentence (not including slangs and it's today's modern English).<br>
Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset/versions/2) (collected by HuffPost) in Kaggle.

In [None]:
import os
os.chdir('./Teaching/s25-nlp/week7/')

In [1]:
import pandas as pd

df = pd.read_json("dataset/News_Category_Dataset_v2.json",lines=True)
train_data = df["short_description"]
train_data

0         She left her husband. He killed their children...
1                                  Of course it has a song.
2         The actor and his longtime girlfriend Anna Ebe...
3         The actor gives Dems an ass-kicking for not fi...
4         The "Dietland" actress said using the bags is ...
                                ...                        
200848    Verizon Wireless and AT&T are already promotin...
200849    Afterward, Azarenka, more effusive with the pr...
200850    Leading up to Super Bowl XLVI, the most talked...
200851    CORRECTION: An earlier version of this story i...
200852    The five-time all-star center tore into his te...
Name: short_description, Length: 200853, dtype: object

To get the better performance (accuracy), we standarize the input text as follows.
- Make all words to lowercase in order to reduce words
- Make "-" (hyphen) to space
- Remove all punctuation except " ' " (e.g, don't, isn't) and "&" (e.g, AT&T)

In [2]:
train_data = train_data.str.lower()
train_data = train_data.str.replace("-", " ", regex=True)
train_data = train_data.str.replace(r"[^'\&\w\s]", "", regex=True)
train_data = train_data.str.strip()
train_data

0         she left her husband he killed their children ...
1                                   of course it has a song
2         the actor and his longtime girlfriend anna ebe...
3         the actor gives dems an ass kicking for not fi...
4         the dietland actress said using the bags is a ...
                                ...                        
200848    verizon wireless and at&t are already promotin...
200849    afterward azarenka more effusive with the pres...
200850    leading up to super bowl xlvi the most talked ...
200851    correction an earlier version of this story in...
200852    the five time all star center tore into his te...
Name: short_description, Length: 200853, dtype: object

Finally we add ```<start>``` and ```<end>``` tokens in each sequence as follows, because these are important information for learning the ordered sequence.

```this is a pen``` --> ```<start> this is a pen <end>```

In [3]:
train_data = [" ".join(["<start>", x, "<end>"]) for x in train_data]
# print first row
train_data[0]

'<start> she left her husband he killed their children just another day in america <end>'

## Generate sequence inputs

We will generate the sequence of word's indices (i.e, tokenize) from text.

![Index vectorize](images/index_vectorize2.png)

First we create a list of vocabulary (```vocab```).

In [4]:
from nltk.tokenize import SpaceTokenizer

###
# define Vocab
###
class Vocab:
    def __init__(self, list_of_sentence, tokenization, special_token, max_tokens=None):
        # count vocab frequency
        vocab_freq = {}
        tokens = tokenization(list_of_sentence)
        for t in tokens:
            for vocab in t:
                if vocab not in vocab_freq:
                    vocab_freq[vocab] = 0 
                vocab_freq[vocab] += 1
        # sort by frequency
        vocab_freq = {k: v for k, v in sorted(vocab_freq.items(), key=lambda i: i[1], reverse=True)}
        # create vocab list
        self.vocabs = [special_token] + list(vocab_freq.keys())
        if max_tokens:
            self.vocabs = self.vocabs[:max_tokens]
        self.stoi = {v: i for i, v in enumerate(self.vocabs)}

    def _get_tokens(self, list_of_sentence):
        for sentence in list_of_sentence:
            tokens = tokenizer.tokenize(sentence)
            yield tokens

    def get_itos(self):
        return self.vocabs

    def get_stoi(self):
        return self.stoi

    def append_token(self, token):
        self.vocabs.append(token)
        self.stoi = {v: i for i, v in enumerate(self.vocabs)}

    def __call__(self, list_of_tokens):
        def get_token_index(token):
            if token in self.stoi:
                return self.stoi[token]
            else:
                return 0
        return [get_token_index(t) for t in list_of_tokens]

    def __len__(self):
        return len(self.vocabs)

###
# generate Vocab
###
max_word = 50000

# create tokenizer
tokenizer = SpaceTokenizer()

# define tokenization function
def yield_tokens(data):
    for text in data:
        tokens = tokenizer.tokenize(text)
        yield tokens

# build vocabulary list
vocab = Vocab(
    train_data,
    tokenization=yield_tokens,
    special_token="<unk>",
    max_tokens=max_word,
)

The generated token index is ```0, 1, ... , vocab_size - 1```.<br>
Now I set ```vocab_size``` (here 50000) as a token id in padded positions.

In [5]:
pad_index = vocab.__len__()
vocab.append_token("<pad>")

Get list for both index-to-word and word-to-index.

In [6]:
itos = vocab.get_itos()
stoi = vocab.get_stoi()

In [7]:
# test
print("The number of token index is {}.".format(vocab.__len__()))
print("The padded index is {}.".format(stoi["<pad>"]))

The number of token index is 50001.
The padded index is 50000.


Now we build a collator function, which is used for pre-processing in data loader.

In this collator, first we create a list of word's indices as follows.

```<start> this is pen <end>``` --> ```[2, 7, 5, 14, 1]```

Next we separate into features (x) and labels (y).<br>
In this task, we predict the next word in the sequence, and we then create the following features (x) and labels (y) in each row.

<u>before</u> :

```[2, 7, 5, 14, 1]```

<u>after</u> :

```x : [2, 7, 5, 14, 1]```

```y : [7, 5, 14, 1, -100]```

> Note : Here I set -100 as an unknown label id, because PyTorch cross-entropy function (```torch.nn.functional.cross_entropy()```) has a property ```ignore_index``` which default value is -100.

Finally we pad the inputs as follows.<br>
The padded index in features is ```pad_index``` and the padded index in label is -100. (See above note.)

```x : [2, 7, 5, 14, 1, N, ... , N]```

```y : [7, 5, 14, 1, -100, -100, ... , -100]```

In [8]:
import torch
from torch.utils.data import DataLoader

max_seq_len = 256

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, feature_list = [], []
    for text in batch:
        # tokenize to a list of word's indices
        tokens = vocab(tokenizer.tokenize(text))
        # separate into features and labels
        y = tokens[1:]
        y.append(-100)
        x = tokens
        # limit length to max_seq_len
        y = y[:max_seq_len]
        x = x[:max_seq_len]
        # pad features and labels
        y += [-100] * (max_seq_len - len(y))
        x += [pad_index] * (max_seq_len - len(x))
        # add to list
        label_list.append(y)
        feature_list.append(x)
    # convert to tensor
    label_list = torch.tensor(label_list, dtype=torch.int64).to(device)
    feature_list = torch.tensor(feature_list, dtype=torch.int64).to(device)
    return label_list, feature_list

dataloader = DataLoader(
    train_data,
    batch_size=32,
    shuffle=True,
    collate_fn=collate_batch
)

In [9]:
# test
for labels, features in dataloader:
    break

print("label shape in batch : {}".format(labels.size()))
print("feature shape in batch : {}".format(features.size()))
print("***** label sample *****")
print(labels[0])
print("***** features sample *****")
print(features[0])

label shape in batch : torch.Size([32, 256])
feature shape in batch : torch.Size([32, 256])
***** label sample *****
tensor([  29, 1548,    4, 5214, 1548,    2, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
   

## Build network

Now we build a model for this next word's prediction using simple RNN architecture.

![RNN network](images/rnn_network.png)

In PyTorch, you can use ```torch.nn.RNN``` module for processing simple RNN, and we also use this built-in module in this example.

In the following example, the shape of RNN input is expected to be ```(batch_size, sequence_length, input_dimension)```.<br>
However, to tell which time steps in each sequence should be processed in RNN (i.e, for RNN masking), we wrap this tensor as a packed sequence with ```torch.nn.utils.rnn.pack_padded_sequence()``` before passing into RNN module.<br>
For example, when batch size is 4 and we generate a packed sequence with ```lengths=[5, 3, 3, 2]``` in ```torch.nn.utils.rnn.pack_padded_sequence()```, the processed sequence# in each time-step will then be :

```
time-step 1 : {1, 2, 3, 4}
time-step 2 : {1, 2, 3, 4}
time-step 3 : {1, 2, 3}
time-step 4 : {1}
time-step 5 : {1}
```

As a result, it's processed with new batch size ```[4, 4, 3, 1, 1]```. (See below picture.)

![packed sequence](images/rnn_packed_sequence.png)

> Note : When the length is not sorted, first all sequences in batch are sorted by descending length of sequence, and planned to run batches to meet each time-steps. (When it's unpacked, the order is returned to the original position.)

In [10]:
import torch
import torch.nn as nn

embedding_dim = 64
rnn_units = 512

class SimpleRnnModel(nn.Module):
    def __init__(self, vocab_size, seq_len, embedding_dim, rnn_units, padding_idx):
        super().__init__()

        self.seq_len = seq_len
        self.padding_idx = padding_idx

        self.embedding = nn.Embedding(
            vocab_size,
            embedding_dim,
            padding_idx=padding_idx,
        )
        self.rnn = nn.RNN(
            input_size=embedding_dim,
            hidden_size=rnn_units,
            num_layers=1,
            batch_first=True,
        )
        self.classify = nn.Linear(rnn_units, vocab_size)

    def forward(self, inputs, states=None, return_final_state=False):
        # embedding
        #   --> (batch_size, seq_len, embedding_dim)
        outs = self.embedding(inputs)
        # build "lengths" property to pack inputs (see above)
        lengths = (inputs != self.padding_idx).int().sum(dim=1, keepdim=False)
        # pack inputs for RNN
        packed_inputs = torch.nn.utils.rnn.pack_padded_sequence(
            outs,
            lengths.cpu(),
            batch_first=True,
            enforce_sorted=False,
        )
        # apply RNN
        if states is None:
            packed_outs, final_state = self.rnn(packed_inputs)
        else:
            packed_outs, final_state = self.rnn(packed_inputs, states)
        # unpack results
        #   --> (batch_size, seq_len, rnn_units)
        outs, _ = torch.nn.utils.rnn.pad_packed_sequence(
            packed_outs,
            batch_first=True,
            padding_value=0.0,
            total_length=self.seq_len,
        )
        # apply feed-forward to classify
        #   --> (batch_size, seq_len, vocab_size)
        logits = self.classify(outs)
        # return results
        if return_final_state:
            return logits, final_state  # This is used in prediction
        else:
            return logits               # This is used in training

model = SimpleRnnModel(
    vocab_size=vocab.__len__(),
    seq_len=max_seq_len,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    padding_idx=pad_index).to(device)

## Train

Now run training with above model.

As I have mentioned above, the loss on label id=-100 is ignored in ```cross_entropy()``` function. The padded position and the end of sequence will then be ignored in optimization.

> Note : Because the default value of  ```ignore_index``` property in ```cross_entropy()``` function is -100. (You can change this default value.)

In [11]:
from torch.nn import functional as F

num_epochs = 5

optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
for epoch in range(num_epochs):
    for labels, seqs in dataloader:
        # optimize
        optimizer.zero_grad()
        logits = model(seqs)
        loss = F.cross_entropy(logits.transpose(1,2), labels)
        loss.backward()
        optimizer.step()
        # calculate accuracy
        pred_labels = logits.argmax(dim=2)
        num_correct = (pred_labels == labels).float().sum()
        num_total = (labels != -100).float().sum()
        accuracy = num_correct / num_total
        print("Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}".format(epoch+1, loss.item(), accuracy), end="\r")
    print("")

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.53 GiB. GPU 0 has a total capacity of 3.69 GiB of which 375.12 MiB is free. Including non-PyTorch memory, this process has 3.30 GiB memory in use. Of the allocated memory 3.19 GiB is allocated by PyTorch, and 13.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Generate Text (Simple RNN)

Here I simply generate several text with trained model.

The metrics to evaluate text generation task is not so easy. (Because simply checking an exact match to a reference text is not optimal.)<br>
Use some common metrics available in these cases, such as, BLEU or ROUGE.

> Note : Here I use greedy search and this will sometimes lead to wrong sequence. For drawbacks and solutions, see note in [this example](./05_language_model_basic.ipynb).

In [None]:
end_index = stoi["<end>"]
max_output = 128

def pred_output(text):
    generated_text = "<start> " + text
    _, inputs = collate_batch([generated_text])
    mask = (inputs != pad_index).int()
    last_idx = mask[0].sum() - 1
    final_states = None
    outputs, final_states = model(inputs, final_states, return_final_state=True)
    pred_index = outputs[0][last_idx].argmax()
    for loop in range(max_output):
        generated_text += " "
        next_word = itos[pred_index]
        generated_text += next_word
        if pred_index.item() == end_index:
            break
        _, inputs = collate_batch([next_word])
        outputs, final_states = model(inputs, final_states, return_final_state=True)
        pred_index = outputs[0][0].argmax()
    return generated_text

print(pred_output("prime"))
print(pred_output("chairman"))
print(pred_output("he was expected"))

Reference: https://github.com/tsmatz/nlp-tutorials/blob/master/06_language_model_rnn.ipynb