# Text Classification with TorchText Libary

Basic tutorial using the torchtext libary to build the dataset for text classication.

Source: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

## Access to raw dataset iterators

Torchtext libary provides a few raw dataset iterators, which yield the raw text strings.

In [1]:
import torch
from torchtext.datasets import AG_NEWS



In [2]:
train_iter = iter(AG_NEWS(split="train"))



In [3]:
next(train_iter)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [4]:
next(train_iter)

(3,
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')

## Prepare Data Processing Pipelines

We have revisted the basic components of torchtext library, including vocab, work vector, tokenizer. Those are basic data processing building blocks for raw text string.

Here is an example for a typical nlp data processing which tokenizer and vocabulary. 

The first step is to build a vocabulary with the raw training dataset. Here we built in factory function build_vocab_from_iterator which accepts iterator that yeilds lists or iterator of tokens. Users can also pass any special symbols to be added to the vocabulary

In [5]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator



In [6]:
tokenizer = get_tokenizer("basic_english")
train_iter = AG_NEWS(split="train")

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

vocabulary block converts a list of tokens into integers

In [7]:
vocab(["here", "is", "an", "example"])

[475, 21, 30, 5297]

Prepare the text processing pipeline with the tokenizer and vocabulary. 

The text and label pipelines will be used to process the raw data strings from the datset iterators

In [8]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

Im unsure why label pipeline is offset by one. Not sure if an additional label is place in the list itself. I know keras adds a unk value for labels.

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 

The label pipeline converts the label into integers

In [9]:
text_pipeline('here is an example')

[475, 21, 30, 5297]

In [10]:
torch.tensor(text_pipeline('here is an example'), dtype=torch.int64)

tensor([ 475,   21,   30, 5297])

Checking the offsets from the tensor.

In [11]:
torch.tensor(text_pipeline('here is an example'), dtype=torch.int64).size(0)

4

Basically the length or size of return value from text_pipeline

In [12]:
label_pipeline('10')

9

## Generate data batch and iterator

Dataloader is recommended for pytorch users. It works with map-style dataset that implements the getitem and len protocols, and represents a map from indices/keys to data samples. It also works with an iterable dataset with the shuffle argument of false

Before sending to the model, collate_fn function works on a batch of samples generated from dataloader. The input to collate_fn is a batch of data with the batch size in Dataloader, and collate_fn processes them according to the data processing pipeline declared previously.

Note make sure collate_fn is declared as a top level def. This ensures that the function is available in each worker.

Text entries in the orginal data batch input are packed into a list and concatenated as a single tensor for the input of nn.embeddingbag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor.

Label is a tensor saving the labels of individual text entries.

In [13]:
from torch.utils.data import DataLoader

In [14]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)

print(f"Using {device} device")

Using cpu device


In [15]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)


In [16]:
train_iter = AG_NEWS(split="train")
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

In [17]:
print(dataloader)

<torch.utils.data.dataloader.DataLoader object at 0x130fa0370>


## Define the Model

The model is composed of nn.embeddingbag layer plus a linear layer for the classification purpose. Emnbeddingbag with the default mode of "mean" computes the mean value of a "bag" of embedding. Although the text entries here have differnt lengths, embeddingbag module requires no padding here since the text length are saved in offsest.

Additionally since embeddingbag accumulates the average across the emebdding on the fly, emebddingbag can enhance the performance and memory efficiency to prcess a sequence of tensors

In [18]:
from torch import nn

In [19]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size,embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
    
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

### Initiate an instance

The AG_news dataset has for labels and therefore the number of class is 4

We build a model with th emebedding dimension of 64. 
The vocab size is equal to the length of the vocabulary instance. 
The number of classes is equal to the number of labels

In [20]:
train_iter = AG_NEWS(split="train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64 # need to review why 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

In [21]:
print(f"vocab size: {vocab_size} \n")

vocab size: 95811 



### Define funtions to train the model and evaluate results

In [22]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()

In [23]:
def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0 

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

## Split the dataset and run the model

Since the orginal AG_new has no valid dataset, we split the training dataset into train/valid sets with a split ratio of 0.95 (train) and 0.05 (valid).

Crossentropy criterion combines logsoftmax and nllloss in a single class. Useful when training a classification problem with c classes. SGD implements stochastic gradient descent method as the optimzier. 

The initial learning rate is set to 5. Steplr is used to adjust the learning rate though epochs

In [24]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

In [25]:
# hyperparameters

EPOCHS = 10
LR = 5
BATCH_SIZE = 64

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_acc = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

In [26]:
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_acc is not None and total_acc > accu_val:
        scheduler.step()
    else:
        total_acc = accu_val
    
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)


| epoch   1 |   500/ 1782 batches | accuracy    0.683
| epoch   1 |  1000/ 1782 batches | accuracy    0.852
| epoch   1 |  1500/ 1782 batches | accuracy    0.875
-----------------------------------------------------------
| end of epoch   1 | time: 12.62s | valid accuracy    0.889 
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.899
| epoch   2 |  1000/ 1782 batches | accuracy    0.897
| epoch   2 |  1500/ 1782 batches | accuracy    0.901
-----------------------------------------------------------
| end of epoch   2 | time: 12.58s | valid accuracy    0.904 
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.914
| epoch   3 |  1000/ 1782 batches | accuracy    0.916
| epoch   3 |  1500/ 1782 batches | accuracy    0.915
-----------------------------------------------------------
| end of epoch   3 | time: 12.59s | valid accuracy    0.902 
-------------------------------

## Evalutate the model with test dataset

check the results of the test dataset

In [27]:
print("checking the results of test dataset")
accu_test = evaluate(test_dataloader)
print("test accuracy {:9.3f}".format(accu_test))

checking the results of test dataset
test accuracy     0.905


### test on random news

Use the model to test out random news

In [28]:
ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

In [29]:
ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to(device)

print("This is a %s news" % ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news
