# Text Classification with TorchText Libary

Basic tutorial using the torchtext libary to build the dataset for text classication.

Source: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

## Access to raw dataset iterators

Torchtext libary provides a few raw dataset iterators, which yield the raw text strings.

In [1]:
import torch
from torchtext.datasets import AG_NEWS



In [2]:
train_iter = iter(AG_NEWS(split="train"))



In [5]:
next(train_iter)

(3,
 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.')

In [8]:
next(train_iter)

(3,
 'Fed minutes show dissent over inflation (USATODAY.com) USATODAY.com - Retail sales bounced back a bit in July, and new claims for jobless benefits fell last week, the government said Thursday, indicating the economy is improving from a midsummer slump.')

## Prepare Data Processing Pipelines

We have revisted the basic components of torchtext library, including vocab, work vector, tokenizer. Those are basic data processing building blocks for raw text string.

Here is an example for a typical nlp data processing which tokenizer and vocabulary. 

The first step is to build a vocabulary with the raw training dataset. Here we built in factory function build_vocab_from_iterator which accepts iterator that yeilds lists or iterator of tokens. Users can also pass any special symbols to be added to the vocabulary

In [9]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [10]:
tokenizer = get_tokenizer("basic_english")
train_iter = AG_NEWS(split="train")

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

vocabulary block converts a list of tokens into integers

In [11]:
vocab(["here", "is", "an", "example"])

[475, 21, 30, 5297]

Prepare the text processing pipeline with the tokenizer and vocabulary. 

The text and label pipelines will be used to process the raw data strings from the datset iterators

In [12]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

Im unsure why label pipeline is offset by one. Not sure if an additional label is place in the list itself. I know keras adds a unk value for labels.

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 

The label pipeline converts the label into integers

In [14]:
text_pipeline('here is an example')

[475, 21, 30, 5297]

In [24]:
torch.tensor(text_pipeline('here is an example'), dtype=torch.int64)

tensor([ 475,   21,   30, 5297])

Checking the offsets from the tensor.

In [25]:
torch.tensor(text_pipeline('here is an example'), dtype=torch.int64).size(0)

4

Basically the length or size of return value from text_pipeline

In [15]:
label_pipeline('10')

9

## Generate data batch and iterator

Dataloader is recommended for pytorch users. It works with map-style dataset that implements the getitem and len protocols, and represents a map from indices/keys to data samples. It also works with an iterable dataset with the shuffle argument of false

Before sending to the model, collate_fn function works on a batch of samples generated from dataloader. The input to collate_fn is a batch of data with the batch size in Dataloader, and collate_fn processes them according to the data processing pipeline declared previously.

Note make sure collate_fn is declared as a top level def. This ensures that the function is available in each worker.

Text entries in the orginal data batch input are packed into a list and concatenated as a single tensor for the input of nn.embeddingbag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor.

Label is a tensor saving the labels of individual text entries.

In [16]:
from torch.utils.data import DataLoader

In [17]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

print(f"Using {device} device")

Using mps device


In [18]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1].cumsum(dim=0))
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

In [19]:
train_iter = AG_NEWS(split="train")
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

In [22]:
print(dataloader)

<torch.utils.data.dataloader.DataLoader object at 0x126a3c2b0>


## Define the Model

The model is composed of nn.embeddingbag layer plus a linear layer for the classification purpose. Emnbeddingbag with the default mode of "mean" computes the mean value of a "bag" of embedding. Although the text entries here have differnt lengths, embeddingbag module requires no padding here since the text length are saved in offsest.

Additionally since embeddingbag accumulates the average across the emebdding on the fly, emebddingbag can enhance the performance and memory efficiency to prcess a sequence of tensors