# Working with Text

So far we have only been working with images as inputs for our Neural Networks.

When dealing with text inputs, such as documents, articles, or social media posts, the architecture of our current neural networks falls short because we need numerical input values as inputs for our networks.

Text data is inherently different from images; it's unstructured, variable in length, and lacks the same spatial relationships that images possess. This necessitates a different approach to processing and understanding text data.

One approach to solve this problem is, to split the text into words or subwords, called **tokens**, and convert these tokens into numerical representations that our neural networks can understand and analyze.

This process is known as **text embedding** or vectorization of our tokens, and enables neural networks to comprehend and derive meaningful insights from textual information.

But how do we get meaningful vector representations of our tokens? Well, thats the neat part; we treat them as learnable parameters and let the model come up with fitting word embeddings.

## AG's News Corpus

Let's see what a text classification network would look like using the example of the AG's News Corpus dataset.

The AG's News Corpus is a dataset commonly used for text classification tasks. It consists of news articles categorized into four classes: *World*, *Sports*, *Business*, and *Science/Technology*. It contains $30,000$ training and $1,900$ test samples per class.

In [None]:
!wget https://hyperion.bbirke.de/data/datasets/ag_news.zip
!mkdir -p datasets/ag_news
!unzip ag_news.zip -d datasets/ag_news/

Import of our Python packages.

In [None]:
import torch
from torch import nn

import numpy as np
import pandas as pd
import seaborn as sn

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

Training and test split are stored as `.csv` files. We load them as `pandas.DataFrame` for convenient data manipulation and visualization.

In [None]:
df_train = pd.read_csv("datasets/ag_news/train.csv")
df_test = pd.read_csv("datasets/ag_news/test.csv")

We print the first few rows and header of our DataFrame.

In [None]:
df_train.head()

Because the dataset starts with class 1 instead of 0, we decrease our labels by one.

In [None]:
df_train["Class Index"] = df_train["Class Index"] - 1
df_test["Class Index"] = df_test["Class Index"] - 1

In [None]:
val2label = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech"
}

Now we can create our `NewsDataset`.

In [None]:
class NewsDataset(Dataset):
    def __init__(self, texts, labels, transform=None, target_transform=None):
        self.transform = transform
        self.target_transform = target_transform
        self.labels = labels
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        if self.transform:
            text = self.transform(text)
        if self.target_transform:
            label = self.target_transform(label)
        return text, label

We create a training and testing instance of our datasets.

In [None]:
train_data = NewsDataset(
        texts=df_train["Description"].to_list(),
        labels=df_train["Class Index"].to_list()
    )
test_data = NewsDataset(
        texts=df_test["Description"].to_list(),
        labels=df_test["Class Index"].to_list()
    )

Let's test our dataset and print some random texts and corresponding labels from our corpus.

In [None]:
examples = 10

for i in range(examples):
    sample_idx = torch.randint(len(train_data), size=(1,)).item()
    text, label = train_data[sample_idx]
    print('News Text:')
    print(text)
    print(f'Category: {val2label[label]}')
    print('-'*30 + '\n')

Until now, we've been treading familiar ground. However, it's time to dive into the realm of Natural Language Processing.

Let's start with the tokenization.

In [None]:
tokenizer = get_tokenizer('basic_english')

Here, we created a simple tokenizer, which splits sentences into a list of tokens. A sentence is split on whitespaces and punctuation characters. Resulting tokens are then converted to lower case characters.

In [None]:
tokenizer("This is a simple Test.")

The next step is to construct a vocabulary using the raw training dataset. In this step, we utilize the PyTorch function `build_vocab_from_iterator`, which takes an iterator that yields a list or an iterator of tokens.

We also add the custom token `<unk>` to our vocabulary, which we set as our default index. This token represents unknown words, which we have not encountered in our training dataset and is used as fallback index in case we would find an unseen token.

In [None]:
def yield_tokens(dataset):
    for idx in range(len(dataset)):
        text, _ = dataset[idx]
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

Let's see if everything works.

In [None]:
tokens = tokenizer("This is a simple test.")
indices = vocab(tokens)
reversed_tokens = vocab.lookup_tokens(indices)

print(tokens)
print(indices)
print(reversed_tokens)

the `collate_fn` parameter is used in conjunction with the `DataLoader` class, specifically when dealing with datasets that contain samples of varying sizes or shapes.

When you have a dataset with samples that have different shapes or sizes, you often need to pad or resize them to make them uniform before feeding them into a neural network for training.

The `collate_fn` parameter allows you to define a custom function that specifies how to collate (combine) the individual samples into batches.

Given the variable lengths of our sentences, we consolidate our token indices and labels for the minibatch into a single dimension. Additionally, to maintain clarity regarding the starting point of each text sequence, we'll include offsets that are fed into our model alongside our regular inputs.

In [None]:
text_processor = lambda x: vocab(tokenizer(x))
label_processor = lambda x: int(x)

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _text, _label in batch:
        label_list.append(label_processor(_label))
        processed_text = torch.tensor(text_processor(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return text_list, label_list, offsets

We define our hyperparameters.

In [None]:
batch_size = 32
epochs = 10
learning_rate = 1e-3

Now we can create an instance of our DataLoader with our custom collate function.

In [None]:
train_dataloader = DataLoader(
        train_data,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=collate_batch,
    )

test_dataloader = DataLoader(
        test_data,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=collate_batch,
    )

The model is rather minimal. We use an `nn.EmbeddingBag` layer, which computes the sum of our token embeddings in a sequence. The resulting tensor is then fed forward to a simple fully connected layer, as we have already seen.

Note that the `forward` function now also requires offsets for the embedding layer.

In [None]:
class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embedding = nn.EmbeddingBag(
                num_embeddings=vocab_size,
                embedding_dim=embed_dim,
                sparse=False
            )
        self.linear_stack = nn.Sequential(
            nn.Linear(in_features=embed_dim, out_features=8),
            nn.ReLU(),
            nn.Dropout(p=0.2),
            nn.Linear(8, 4)
        )

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.linear_stack(embedded)

We define our device and create an instance of our model.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
model = EmbeddingModel(vocab_size=len(vocab), embed_dim=16).to(device)

Let's check the forward pass.

In [None]:
texts, labels, offsets = next(iter(train_dataloader))

texts = texts.to(device)
offsets = offsets.to(device)
logits = model(texts, offsets)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)

print(f"Predicted class: {y_pred}")

We define our loss and optimizer.

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

The training and test loops are almost identical to what we have seen so far. The only difference is that our dataloader now also returns the offset, which is then fed into our model.

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y, offset) in enumerate(dataloader):
        X = X.to(device)
        y = y.to(device)
        offset = offset.to(device)
        pred = model(X, offset)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * batch_size + len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn, best_result):
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    with torch.no_grad():
        for X, y, offset in dataloader:
            X = X.to(device)
            y = y.to(device)
            offset = offset.to(device)
            pred = model(X, offset)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f}")
    if correct > best_result:
        print("New highscore! Saving model...\n")
        torch.save(model.state_dict(), 'best-model-parameters.pt')
        return correct
    print()
    return best_result

Let's start the training loop!

In [None]:
best_result = 0.0
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    best_result = test_loop(test_dataloader, model, loss_fn, best_result)
print("Done!")

Load the weights of our best model.

In [None]:
model.load_state_dict(torch.load("best-model-parameters.pt"))

Let's print some example texts alongside predicted and actual labels.

In [None]:
model.eval()

examples = 10

for i in range(examples):
    sample_idx = torch.randint(len(test_data), size=(1,)).item()
    text, actual_label = test_data[sample_idx]
    text_tensor, actual_label_tensor, offset_tensor = collate_batch([(text, actual_label)])
    text_tensor = text_tensor.to(device)
    offset_tensor = offset_tensor.to(device)
    logits = model(text_tensor, offset_tensor)
    pred_probab = nn.Softmax(dim=1)(logits)
    y_pred = pred_probab.argmax(1)
    y_prob = pred_probab.max()
    print('News Text:')
    print(text)
    print(f'Predicted Category: \"{val2label[y_pred.item()]}\" with p={y_prob.item():.3f}')
    print(f'Actual Category: \"{val2label[actual_label]}\"')
    print('-'*30 + '\n')

We can also check the confusion matrix.

In [None]:
model.eval()
y_pred = []
y_true = []

with torch.no_grad():
    # iterate over test data
    for X, y, offset in test_dataloader:

            X = X.to(device)
            y = y.to(device)
            offset = offset.to(device)
            output = model(X, offset)

            output = (torch.max(torch.exp(output), 1)[1]).data.cpu().numpy()
            y_pred.extend(output)

            labels = y.data.cpu().numpy()
            y_true.extend(labels)

cf_matrix = confusion_matrix(y_true, y_pred)
df_cm = pd.DataFrame(cf_matrix / np.sum(cf_matrix, axis=0)[:, None], index = [val2label[i] for i in range(4)],
                     columns = [val2label[i] for i in range(4)])
plt.figure(figsize = (12,8))
sn.heatmap(df_cm, annot=True)