## [TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#text-classification-with-the-torchtext-library)

#### In this tutorial, we will show how to use the torchtext library to build the dataset for the text classification analysis. Users will have the flexibility to
1. Access to the raw data as an iterator
2. Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
3. Shuffle and iterate the data with torch.utils.data.DataLoader

In [1]:
import torch
from torchtext.datasets import AG_NEWS

### [Prepare data processing pipelines](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#prepare-data-processing-pipelines)

Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. Here we use built in factory function build_vocab_from_iterator which accepts iterator that yield list or iterator of tokens. Users can also pass any special symbols to be added to the vocabulary.

In [2]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [3]:
tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

In [4]:
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

In [5]:
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])

In [6]:
vocab.set_default_index(vocab['<unk>'])

In [7]:
vocab(['man'])

[335]

In [8]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [9]:
text_pipeline('here is the an example')

[475, 21, 2, 30, 5297]

In [10]:
label_pipeline('10')

9

### [Generate data batch and iterator](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#generate-data-batch-and-iterator)

In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

In [11]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [12]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_txt = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_txt)
        offsets.append(processed_txt.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

In [13]:
downloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

### [Model](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#define-the-model)

#### The model is composed of the nn.EmbeddingBag layer plus a linear layer for the classification purpose. nn.EmbeddingBag with the default mode of “mean” computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.Additionally, since nn.EmbeddingBag accumulates the average across the embeddings on the fly, nn.EmbeddingBag can enhance the performance and memory efficiency to process a sequence of tensors.

![](https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png)

In [14]:
from torch import nn

In [15]:
class TxtClassModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class) -> None:
        super().__init__()

        self.embed = nn.EmbeddingBag(vocab_size,embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        init_range = 0.5
        self.embed.weight.data.uniform_(-init_range, init_range)
        self.fc.weight.data.uniform_(-init_range, init_range)
        self.fc.bias.data.zero_()

    def forward(self,text, offsets):
        embed = self.embed(text, offsets)
        return self.fc(embed)

In [16]:
vocab_size = len(vocab)
num_classes = len(set([label for (label, text) in train_iter]))
embed_dim = 64

In [17]:
model = TxtClassModel(vocab_size, embed_dim, num_classes).to(device)

[Define functions to train the model and evaluate results](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#define-functions-to-train-the-model-and-evaluate-results)

In [31]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, txt, offsets) in enumerate(dataloader):
        optim.zero_grad()
        pred_label = model(txt, offsets)
        loss = criterion(pred_label, label)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optim.step()
        total_acc += (pred_label.argmax(1) == label).sum().item()
        total_count += label.size(0)

        if idx % log_interval == 0 and idx > 0:
            elapsed_time = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | accuracy {:8.3f}'.format(
                epoch,
                idx,
                len(dataloader),
                total_acc / total_count
            ))

            total_acc, total_count = 0, 0
            start_time = time.time()

In [19]:
def evaluate(dataloader):
    model.eval()
    total_acc, total_count =0, 0
    with torch.no_grad():
        for idx, (label, txt, offsets) in enumerate(dataloader):
            pred_label = model(txt, offsets)
            loss = criterion(pred_label,  label)
            total_acc += (pred_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count


In [20]:
from torch.utils.data.dataset import random_split
from  torchtext.data.functional import to_map_style_dataset

In [21]:
EPOCHS = 10
LR = 5
BATCH_SIZE = 64

In [22]:
criterion = torch.nn.CrossEntropyLoss()
optim = torch.optim.SGD(model.parameters(), lr=LR)

In [23]:
scheduler = torch.optim.lr_scheduler.StepLR(optimizer = optim, step_size=1.0, gamma = 0.1)

In [29]:
total_accu = None

In [25]:
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train =  int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset)-num_train])

In [26]:
train_dl = DataLoader(split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
valid_dl = DataLoader(split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
test_dl = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

In [32]:
for epoch in range(1, EPOCHS+1):
    train(train_dl)
    accu_val = evaluate(valid_dl)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val

    print('-'*59)
    print('end of epoch {:3d} with validation accuracy {:8.3f}'.format(epoch, accu_val))
    print('-'*59)

| epoch   1 |   500/ 1782 batches | accuracy    0.989
| epoch   1 |  1000/ 1782 batches | accuracy    0.989
| epoch   1 |  1500/ 1782 batches | accuracy    0.989
-----------------------------------------------------------
end of epoch   1 with validation accuracy    0.902
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.989
| epoch   2 |  1000/ 1782 batches | accuracy    0.989
| epoch   2 |  1500/ 1782 batches | accuracy    0.989
-----------------------------------------------------------
end of epoch   2 with validation accuracy    0.902
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.989
| epoch   3 |  1000/ 1782 batches | accuracy    0.989
| epoch   3 |  1500/ 1782 batches | accuracy    0.989
-----------------------------------------------------------
end of epoch   3 with validation accuracy    0.902
-----------------------------------------------------------
|

### [Evaluate the model with test dataset](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#evaluate-the-model-with-test-dataset)

In [33]:
accu_test = evaluate(test_dl)

In [34]:
accu_test

0.900921052631579

In [35]:
ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was enduring the season’s worst weather conditions on Sunday at The Open on his way to a closing 75 at Royal Portrush, which considering the wind and the rain was a respectable showing. Thursday’s first round at the WGC-FedEx St. Jude Invitational was another story. With temperatures in the mid-80s and hardly any wind, the Spaniard was 13 strokes better in a flawless round. Thanks to his best putting performance on the PGA Tour, Rahm finished with an 8-under 62 for a three-stroke lead, which was even more impressive considering he’d never played the front nine at TPC Southwind."

In [38]:
txt_tensor = torch.tensor(text_pipeline(ex_text_str))

In [44]:
with torch.no_grad():
    output = model(txt_tensor, torch.tensor([0]))

In [50]:
output.argmax(1).item()

1