# Introduction to `torchtext`

**Attributions: This tutorial is directly taken from and adapted a bit from MDS-CL materials.** 

[`torchtext`](https://pytorch.org/text/stable/index.html) is a python library which helps you easily process text data for NLP task such as read text file from disk, tokenize the text, convert the text to lists of integers, and pad sequences in a batch.

You can then install torchtext using `pip`:

`pip3 install torchtext`

## Overview

`torchtext` takes in raw data in the form of text files, such as CSV, TSV, or JSON files, and converts them to `torchtext.data.Datasets`. 

`torchtext` then passes the `Dataset` to an `Iterator`. Iterators handle numericalizing, batching, packaging, and moving the data to the given device (CPU or GPU).

<img src="img/torchtext.png" height="800" width="800"> 

<br><br>

## Dataset

For demonstration purpose, let's consider a sample of size 100 from [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset). 

In [8]:
import pandas as pd

sms_df_all = pd.read_csv("data/spam.csv", encoding="latin-1")
sms_df_all = sms_df_all.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df_all = sms_df_all.rename(columns={"v1": "label", "v2": "sms"})
sms_df = sms_df_all.sample(100)
sms_df.head()

Unnamed: 0,label,sms
1837,ham,And how's your husband.
5287,ham,Hey ! Don't forget ... You are MINE ... For ME...
4153,ham,Haf u eaten? Wat time u wan me 2 come?
2000,ham,But i'll b going 2 sch on mon. My sis need 2 t...
2184,ham,I know a few people I can hit up and fuck to t...


`torchtext` needs train, valid, and test data into three separate files. So let's use `sklearn` to split data set into train, validation, and test sets and write three CSV files. 

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train, validation_test = train_test_split(sms_df, test_size=0.2, random_state=123)
validation, test = train_test_split(validation_test, test_size=0.5)

In [11]:
print("Train set shape:", train.shape)
print("Validation set shape:", validation.shape)
print("Test set shape:", test.shape)

Train set shape: (80, 2)
Validation set shape: (10, 2)
Test set shape: (10, 2)


In [12]:
# create a new directory to save split datasets.
import os

data_path = "./data/split/"
if not os.path.exists(data_path):
    os.mkdir(data_path)

In [13]:
cols =['sms','label']
train.to_csv(data_path + "train.csv", columns = cols, index=False)
validation.to_csv(data_path + "val.csv", columns = cols, index=False)
test.to_csv(data_path + "test.csv", columns = cols, index=False)

<br><br>

## Declaring the fields

`torchtext` takes a declarative approach to load its data: you tell `torchtext` how you want the data to look like, and `torchtext` handles it for you.

In [16]:
# import related packages
import nltk
import torch
import torchtext
from nltk.tokenize import sent_tokenize, word_tokenize
from torchtext.legacy.data import (
    BucketIterator,
    Field,
    Iterator,
    LabelField,
    TabularDataset,
)

Let's define `tokenize_nltk` function for tokenization. It will be passed to `TorchText` and take in the sentence as a string and return the sentence as a list of tokens.

In [17]:
def tokenize_nltk(text):
    """
    Simple tokenization on white spaces.
    """
    return word_tokenize(text)

In [18]:
tokenize_nltk("This is a test! ")

['This', 'is', 'a', 'test', '!']

`torchtext's` Fields handle how data should be processed. You can read all of the possible arguments [here](https://torchtext.readthedocs.io/en/latest/data.html#field). These arguments load and process the input data appropriately.

- **For the sms text**, we pass in the preprocessing we want the field to do as keyword arguments. We give it the tokenizer we want the field to use, tell it to convert the input to lowercase, and also tell it the input is sequential.

- **For the label input**, it is not sequential input and doesn't need unknown token for out-of-vocabulary words.

In [20]:
TEXT = Field(sequential=True, tokenize=tokenize_nltk, lower=True)
LABEL = Field(sequential=False, unk_token=None)

<br><br>

## Constructing the `Dataset`

The fields know what to do when given raw data. Now, we need to tell the fields what data it should work on. This is where we use Datasets. To process the tsv file, we use `TabularDataset` class.

In [21]:
train, val, test = TabularDataset.splits(
    path=data_path,  # the root directory where the data lies
    train="train.csv",
    validation="val.csv",
    test="test.csv", 
    format="csv",
    skip_header=True,
    fields=[("sms", TEXT), ("label", LABEL)],
)

For the TabularDataset, we pass in a list of `(name, field)` pairs as the fields argument. The fields we pass in must be in the same order as the columns.

The splits method creates three datasets for the train, validation, and test data by applying the same processing. We give the the root directory of datasets and corresponding names.

Each dataset can be treated as a list-like dataset and can be indexed and iterated over like normal lists.

In [22]:
print(train[0])

<torchtext.legacy.data.example.Example object at 0x7f8ba13c91c0>


We get an `Example object`. Each `Example object` includes the attributes of a single data point. 

In [23]:
print(train[0].__dict__.keys())
print(train[0].sms)
print(train[0].label)

dict_keys(['sms', 'label'])
['x', 'course', 'it', '2yrs', '.', 'just', 'so', 'her', 'messages', 'on', 'messenger', 'lik', 'you', 'r', 'sending', 'me']
ham


We now need to build our vocabulary to map words to integers. We'll use vocabulary from GloVe embeddings. When you run the following for the first time, it'll download the word embeddings.  

In [24]:
TEXT.build_vocab(train, min_freq=2, vectors='glove.6B.100d')
LABEL.build_vocab(train)

Then, let us take a look at the vocabulary of sms texts. 

In [25]:
TEXT.vocab.freqs.most_common(20)

[('.', 66),
 ('i', 50),
 ('...', 34),
 ('you', 27),
 ('to', 27),
 (',', 19),
 ('a', 18),
 ('u', 17),
 ('?', 14),
 ('!', 14),
 (':', 14),
 ('&', 13),
 ('call', 13),
 ('my', 13),
 ('2', 13),
 (';', 12),
 ('*', 12),
 ("'m", 11),
 ('the', 11),
 ('me', 10)]

and the vocabulary of labels. 

Let's show an example to convert text strings to tensor:

In [28]:
sent1 = TEXT.preprocess("I like to watch the sunset.")
print('preprocessed sent1: ', sent1)
sent2 = TEXT.preprocess("Be calm.")
print('preprocessed sent2: ', sent2)
# convert tokens to tensor
tensor = TEXT.process([sent1, sent2])
print(tensor)
print(tensor.shape)

preprocessed sent1:  ['i', 'like', 'to', 'watch', 'the', 'sunset', '.']
preprocessed sent2:  ['be', 'calm', '.']
tensor([[ 3, 46],
        [ 0,  0],
        [ 5,  2],
        [ 0,  1],
        [20,  1],
        [ 0,  1],
        [ 2,  1]])
torch.Size([7, 2])


In the output tensor, second sentence is padded because it's shorter than the first one. 

<br><br>

## Constructing the `Iterator`

The final step of preparing data is to create the iterators. 
Dataset can be iterated by iterator. At each step, the iterator generates a batch of data which will have a text attribute (the PyTorch tensors containing a batch of numericalized text) and a label attribute (the PyTorch tensors containing a batch of numericalized labels).

Below is code for how you would initialize the Iterators for the train, validation, and test data, repsectively.

In [30]:
from torchtext.legacy.data import BucketIterator, Iterator

train_iter, val_iter, test_iter = BucketIterator.splits(
    (
        train,
        val,
        test,
    ),  # we pass in the datasets we want the iterator to draw data from
    batch_sizes=(4, 64, 64),  # batch size for Train, dev and Test, respectively.
    sort_key=lambda x: len(x.sms),
    sort=True,
    # A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding.
    sort_within_batch=True,
)

 The `sort_within_batch` argument, when set to True, sorts the data within each minibatch in decreasing order according to the sort_key.
 
The [`BucketIterator`](https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.BucketIterator) automatically shuffles and buckets the input sequences into sequences of similar length. In each batch,  we need to pad the input sequences to be of the same length to enable batch processing. The amount of padding necessary is determined by the longest sequence in the batch. Therefore, padding is most efficient when the sequences are of similar lengths.

<br><br>

### Using the `Iterator`
We use for-loop to load batch in a iterator. The iterator returns a custom datatype called `torchtext.data.batch.Batch`. The `Batch` object is a similar API to generate a batch samples from each field in the dataset as attributes. 

In [31]:
for batch in train_iter:
    print(batch)
    messages = batch.sms
    labels = batch.label
    break  # we use first batch as an example.


[torchtext.legacy.data.batch.Batch of size 4]
	[.sms]:[torch.LongTensor of size 5x4]
	[.label]:[torch.LongTensor of size 4]


`messages` and `labels` are tensors, you can use them to train any machine model with [PyTorch](https://pytorch.org/).

The messages tensor size is `[maximum sequence length * batch_size]`.

The label tensor size is `[batch_size]`.

Let us see how `messages` tensor looks like.

In [32]:
messages

tensor([[ 51, 176,   0,   0],
        [  0, 145,   0, 115],
        [ 31,   0,   2,   1],
        [ 24,   2,   1,   1],
        [  0,   1,   1,   1]])

and `label` tensor

In [33]:
labels

tensor([0, 0, 0, 0])

**We also can convert indexes to text tokens.**

We convert these four samples back to text.

In [34]:
print("processed sentence: ")
for j in range(messages.shape[1]):  # sample loop
    tmp = []  # create a output container
    for i in range(messages.shape[0]):  # token loop
        tmp.append(TEXT.vocab.itos[messages[i, j]])
    print(j, " smaple:", tmp)

processed sentence: 
0  smaple: ['just', '<unk>', '..', 'and', '<unk>']
1  smaple: ['yup', 'next', '<unk>', '.', '<pad>']
2  smaple: ['<unk>', '<unk>', '.', '<pad>', '<pad>']
3  smaple: ['<unk>', 'babe', '<pad>', '<pad>', '<pad>']


You can see the first 3 samples are padded to length of 7. The out-of-vocabulary tokens are replaced with `<unk>` token.

**How do the orignal samples look like?**

In [35]:
samples = []
for i in range(len(train)):
    samples.append(train.examples[i].sms)
samples.sort(key=len)

In [36]:
samples[0:4]

[['\\alright', 'babe'],
 ['say', 'thanks2', '.'],
 ['yup', 'next', 'stop', '.'],
 ['just', 'sleeping', '..', 'and', 'surfing']]

## References:
* https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
* https://medium.com/@adam.wearne/lets-get-sentimental-with-pytorch-dcdd9e1ea4c9
* https://spacy.io/
* https://www.aclweb.org/anthology/P17-1067.pdf
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
* https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
* https://pytorch.org/tutorials/index.html