# Section 1: Text Classification

## Text Classification With The Torchtext Library


### Access to the raw dataset iterators

In [1]:
import torch
from torchtext.datasets import AG_NEWS

In [3]:
AG_NEWS

<function torchtext.datasets.ag_news.AG_NEWS>

AG_NEWS consits of tupel of label and text.

In [6]:
train_iter = AG_NEWS(split='train')
print(f'example of AG_NEWs : {next(train_iter)}')
print(f'example of AG_NEWs : {next(train_iter)}')
print(f'example of AG_NEWs : {next(train_iter)}')

example of AG_NEWs : (3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")
example of AG_NEWs : (3, 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')
example of AG_NEWs : (3, "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.")


### Prepare data processing pipeline

In [12]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter =  AG_NEWS(split='train')

def yield_tokens(data_iter):
  for _, text in data_iter:
    yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

In [17]:
# convert a list of tokens into integers
vocab(['here','is' ,'an', 'example'])

[475, 21, 30, 5297]

Let's build fuctions for process text and label. 

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. The label pipeline converts the label into integers

In [18]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) -1

In [19]:
text_pipeline('here is an example')

[475, 21, 30, 5297]

In [20]:
label_pipeline('10')

9

### Generate data batch and iterator

In [33]:
from torch.utils.data import DataLoader

def collate_batch(batch):
  label_list, text_list, offsets = [], [], [0]

  for (_label, _text) in batch:
    label_list.append(label_pipeline(_label))
    processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
    text_list.append(processed_text)
    offsets.append(processed_text.size(0))
  label_list = torch.tensor(offsets[:-1]).cumsum(dim=0)
  text_list = torch.cat(text_list)
  return label_list, text_list, offsets

In [34]:
train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

In [35]:
print(f'length of train_iter is : {len(train_iter)} and length of dataloader is : {len(dataloader)}')

length of train_iter is : 120000 and length of dataloader is : 15000


Let's take a look at a batch of the dataloader.

In [45]:
examples = iter(dataloader)
#unpack examples
label, text , offset = examples.next()
print(f'label shape : {label.shape} ,text shape : {text.shape}, length of offset : {len(offset)}')
print(f'offset -> {offset}')
print(f'the sum of the length of texts in the batch : {sum(offset)}')
print(f'length of the first text in the batch : {offset[1]}')

label shape : torch.Size([8]) ,text shape : torch.Size([390]), length of offset : 9
offset -> [0, 45, 47, 37, 47, 36, 51, 64, 63]
the sum of the length of texts in the batch : 390
length of the first text in the batch : 45


Offset is a list with a length of 9(batch size + 1) that stores the length of each text in the batch.

### Define the model