## [TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#text-classification-with-the-torchtext-library)

#### In this tutorial, we will show how to use the torchtext library to build the dataset for the text classification analysis. Users will have the flexibility to
1. Access to the raw data as an iterator
2. Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
3. Shuffle and iterate the data with torch.utils.data.DataLoader

In [1]:
import torch
from torchtext.datasets import AG_NEWS

### [Prepare data processing pipelines](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#prepare-data-processing-pipelines)

Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. Here we use built in factory function build_vocab_from_iterator which accepts iterator that yield list or iterator of tokens. Users can also pass any special symbols to be added to the vocabulary.

In [2]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [3]:
tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

In [4]:
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

In [5]:
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])

In [6]:
vocab.set_default_index(vocab['<unk>'])

In [7]:
vocab(['man'])

[335]

In [8]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [9]:
text_pipeline('here is the an example')

[475, 21, 2, 30, 5297]

In [10]:
label_pipeline('10')

9

### [Generate data batch and iterator](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#generate-data-batch-and-iterator)

In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

In [11]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [12]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_txt = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_txt)
        offsets.append(processed_txt.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

In [13]:
downloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

### [Model](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#define-the-model)

#### The model is composed of the nn.EmbeddingBag layer plus a linear layer for the classification purpose. nn.EmbeddingBag with the default mode of “mean” computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.Additionally, since nn.EmbeddingBag accumulates the average across the embeddings on the fly, nn.EmbeddingBag can enhance the performance and memory efficiency to process a sequence of tensors.

![](https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png)

In [14]:
from torch import nn

In [23]:
class TxtClassModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class) -> None:
        super().__init__()

        self.embed = nn.EmbeddingBag(vocab_size,embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        init_range = 0.5
        self.embed.weight.data.uniform_(-init_range, init_range)
        self.fc.weight.data.uniform_(-init_range, init_range)
        self.fc.bias.data.zero_()

    def forward(self,text, offsets):
        embed = self.embed(text, offsets)
        return self.fc(embed)

In [28]:
vocab_size = len(vocab)
num_classes = len(set([label for (label, text) in train_iter]))
embed_dim = 64