## Objective

PyTorch tutorial on text classification for the AG news data. 

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

The classes are the following:

* World
* Sports
* Business
* Sci/Tec

### 1 - Load the data in raw form

We download the AG_NEWS data from `torchtext.datasets` and create a simple `DataLoader` to better understand the internal format of the data

In [1]:
import torch
from torchtext.datasets import AG_NEWS
from torch.utils.data import DataLoader

#Load the original data in form [tensor_label, (text,)]
raw_iterable_dataset = AG_NEWS(split='train')

# Show the type of the IterableDataset
print(type(raw_iterable_dataset))

<class 'torchtext.data.datasets_utils._RawTextIterableDataset'>


In [2]:
# Create a simple dataLoader and print the first two dataInstances to better understand the internal format
raw_dataloader = DataLoader(raw_iterable_dataset)
for i in range(0,2):
    print(next(iter(raw_dataloader)))

# Note: if we run this example, we have to then re-load the data, due to the use of an iteration 
# (torchtext.vocab was giving me issues with a previously iterated dataset)
# maybe an IterableDataSet can only be iterated once
raw_iterable_dataset = AG_NEWS(split='train')

[tensor([3]), ("Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",)]
[tensor([3]), ('Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',)]


### 2 - Preprocessing functions

Use some of the basic components of `torchtext` to preprocess the raw IterableDataset:
* tokenizer
* vocabulary builder

In [3]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# we use the basic english tokenizer
tokenizer = get_tokenizer('basic_english')

# method for applying the tokenizer on each data instance of the IterableDataset and returning the result as an iterator
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

# It builds the vocabulary from the iterator returned by `yield_tokens()`
vocab = build_vocab_from_iterator(yield_tokens(raw_iterable_dataset), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
len(vocab)

95811

**Example for understanding the result of tokenizer and vocabulary:**

In [4]:
test_string = "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."
test_string_tokenized = tokenizer(test_string) #  output in list format (makes sense because its contents are currently strings)
test_string_indexed = vocab(test_string_tokenized) #  output in list format (not tensor)

print(test_string)
print(test_string_tokenized)
print(test_string_indexed)

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short-sellers', ',', 'wall', 'street', "'", 's', 'dwindling\\band', 'of', 'ultra-cynics', ',', 'are', 'seeing', 'green', 'again', '.']
[431, 425, 1, 1605, 14838, 113, 66, 2, 848, 13, 27, 14, 27, 15, 50725, 3, 431, 374, 16, 9, 67507, 6, 52258, 3, 42, 4009, 783, 325, 1]


Now, we can define the two lambda functions that will be used as preprocessing steps when iterating over the data. These function will act as our "pipeline":

In [5]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. The label pipeline converts the label into integers. 

**Example:**

In [6]:
print(text_pipeline('here is the an example'))
print(label_pipeline("10"))

[475, 21, 2, 30, 5297]
9


### 3 - DataLoader and iterator

In order to learn our model, we need the dataset to be in `torch.utils.data.DataLoader` form. The format of this DataLoader is going to be influenced by our learning purpose. In this example, we want to do text classification where the text inputs do not have the same length. Therefore, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of `nn.EmbeddingBag`. The **offset** is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

In [7]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0] # offset list is initialized with the value 0, because it is the index of the beginning of the first sentence in the batch
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

For a better understanding of the input difference between `nn.Embedding` and `nn.EmbeddingBag`, consider the following image (ignore the last part, it is simply showing common applications of these layers):

<img src="notebook_images/regular_embedding_vs_embedding_bag_diagram.jpg" width=800>

In [8]:
# I considered this simple example to see what happened inside the DataLoader. Interestingly,
# each time the DataLoader was iterated, it returned a different set of objects in the batch,
# even if the DataLoader was created with shuffle=False

# for X in dataloader:
#     print(X[1].shape)
#     break

### 4 - Model

The model is composed of the `nn.EmbeddingBag` layer plus a `nn.LinearLayer` for the classification purpose. `nn.EmbeddingBag` computes by default the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.

Additionally, since `nn.EmbeddingBag` accumulates the average across the embeddings on the fly, it can enhance the performance and memory efficiency to process a sequence of tensors.

<img src="notebook_images/basic_model.png" width=600>

The output function for this example would either be a Softmax or a hierarchical Softmax. In this case, we are going to consider the Softmax. In PyTorch, when doing classification, we have the flexibility of manualyapplying the output function or not. It will depend on our selection of the loss function (the first time it seems a bit counterintuitive):

* If we use `nn.CrossEntropyLoss` as our loss function, it will automatically apply the logSoftmax function to the neural network output.
* If we use `nn.NLLLoss()` as our loss function, it is necessary that we establish a LogSoftmax activation function at the end of our model (return of the `forward()` method)

In this case we are going to choose the first option

In [9]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

In [10]:
raw_iterable_dataset = AG_NEWS(split='train')
num_class = len(set([label for (label, text) in raw_iterable_dataset]))
vocab_size = len(vocab)
num_hid = 64
model = TextClassificationModel(vocab_size, num_hid, num_class).to(device)

print("Number of classes in the data: " + str(num_class))
print("Number of hidden dimensions (both embedding and linear layers): " + str(num_hid))
print("Size of vocabulary: " + str(vocab_size))

Number of classes in the data: 4
Number of hidden dimensions (both embedding and linear layers): 64
Size of vocabulary: 95811


### 5 - Training

We first define the `train()` and `evaluate()` functions

In [11]:
import time

def train(dataloader, criterion, optimizer, current_epoch):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(current_epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader, criterion):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

Then, we consider the splitted data (train/test) and define the validation dataset with 5% of the training data instances

In [12]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Hyperparameters
EPOCHS = 5 # number of epochs
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

Finally, we run the training process

In [13]:
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader, criterion, optimizer, epoch)
    accu_val = evaluate(valid_dataloader, criterion)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 1782 batches | accuracy    0.688
| epoch   1 |  1000/ 1782 batches | accuracy    0.855
| epoch   1 |  1500/ 1782 batches | accuracy    0.880
-----------------------------------------------------------
| end of epoch   1 | time: 13.27s | valid accuracy    0.880 
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.897
| epoch   2 |  1000/ 1782 batches | accuracy    0.902
| epoch   2 |  1500/ 1782 batches | accuracy    0.902
-----------------------------------------------------------
| end of epoch   2 | time: 13.92s | valid accuracy    0.896 
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.918
| epoch   3 |  1000/ 1782 batches | accuracy    0.912
| epoch   3 |  1500/ 1782 batches | accuracy    0.914
-----------------------------------------------------------
| end of epoch   3 | time: 13.78s | valid accuracy    0.899 
-------------------------------

### 6 - Evaluation

We check the accuracy of our model on the test dataset. It is important to remember that since SGD is stochastic and we also have shuffle=True, the accuracy on both the validation and test datasets may vary on training executions (obviously it should not change on evaluation if the model is not re-learned)

In [14]:
accu_test = evaluate(test_dataloader, criterion)
print('test accuracy {:8.3f}'.format(accu_test))

test accuracy    0.906


### 7 - Inference example

In [15]:
ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. - Four days ago, Jon Rahm was \
    enduring the season's worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday's first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he'd never played the \
    front nine at TPC Southwind."

model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news
