## [TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#text-classification-with-the-torchtext-library)

#### In this tutorial, we will show how to use the torchtext library to build the dataset for the text classification analysis. Users will have the flexibility to
1. Access to the raw data as an iterator
2. Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
3. Shuffle and iterate the data with torch.utils.data.DataLoader

In [1]:
import torch
from torchtext.datasets import AG_NEWS

### [Prepare data processing pipelines](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#prepare-data-processing-pipelines)

Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. Here we use built in factory function build_vocab_from_iterator which accepts iterator that yield list or iterator of tokens. Users can also pass any special symbols to be added to the vocabulary.

In [2]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [3]:
tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

In [4]:
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

In [5]:
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])

In [6]:
vocab.set_default_index(vocab['<unk>'])

In [9]:
vocab(['man'])

[335]

In [10]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [11]:
text_pipeline('here is the an example')

[475, 21, 2, 30, 5297]

In [12]:
label_pipeline('10')

9

### [Generate data batch and iterator](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#generate-data-batch-and-iterator)

In [13]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")