# Introduction to Torchtext

Torchtext is a python library that helps you easily process text data for NLP task such as read text file from disk, tokenize the text, convert the text to lists of integers, and pad the sequence in batch, etc.

Make sure you have Python 3.5+ and PyTorch 1.4.0 or newer. You can then install torchtext using pip:

`pip3 install torchtext`

We will use English tokenizer from [SpaCy](https://spacy.io/), you need to install SpaCy and download its English model:

```
pip3 install spacy
python3 -m spacy download en_core_web_sm
```

## The Overview

Torchtext takes in raw data in the form of text files, e.g., csv/tsv files, or json files, and converts them to `torchtext.data.Datasets`. 

Torchtext then passes the Dataset to an Iterator. Iterators handle numericalizing, batching, packaging, and moving the data to the given device (CPU or GPU).

![](./images/torchtext.png)

## Dataset

We create a small trial dataset which include 100 samples of [EmoNet](https://www.aclweb.org/anthology/P17-1067.pdf). Fisrt row is header of this corpus. The first column is sample id, second column is the tweet text, and the third columm is corresponding labels (emotion). We can use pandas to load this tsv file. You can find more information of pandas [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). It provides many arguments to help you load file appropriately. 

In [29]:
import pandas as pd
df = pd.read_csv("./data/small_trial.tsv", sep = '\t', header=0) # the separator of tsv file is `\t`
df.head()

Unnamed: 0,id,tweet,label
0,0,"Ann , Bruce and I went to Selkirk Council of C...",trust
1,1,At the movies .. seeing sex and the city .. by...,sadness
2,2,Do not your memory ; it is a net full of holes...,trust
3,3,I seriously hate it when older men act like th...,fear
4,4,The fact that my vote will be based on who I d...,sadness


## Select Columns 
1. `tweet`: the tweet text, i.e., input x
2. `label`: label of given tweet, i.e., prediction goal y.

In [30]:
ddf = df[['tweet','label']]

In [31]:
ddf.head()

Unnamed: 0,tweet,label
0,"Ann , Bruce and I went to Selkirk Council of C...",trust
1,At the movies .. seeing sex and the city .. by...,sadness
2,Do not your memory ; it is a net full of holes...,trust
3,I seriously hate it when older men act like th...,fear
4,The fact that my vote will be based on who I d...,sadness


Then, we use `sklearn` to splite data set into train, validation, and test sets.

You can install `sklearn` by 
```
pip3 install scikit-learn
```

In [32]:
#load sklearn
from sklearn.model_selection import train_test_split

#### `train_test_split() ` only can split data into two partitions  

In [33]:
# First, randomize samples and split into train (80%) and validation_test (20%). 
# random_state is the seed used by the random number generator, you can use any integer number.
train, validation_test = train_test_split(ddf, test_size=0.2, random_state=200) 

In [34]:
# Then, split dev_test into validation (10%) and test (10%) set.
validation, test = train_test_split(validation_test, test_size=0.5) 

In [35]:
print("Train set shape:", train.shape)
print("Validation set shape:", validation.shape)
print("Test set shape:", test.shape)

Train set shape: (80, 2)
Validation set shape: (10, 2)
Test set shape: (10, 2)


In [40]:
# create a new directory to save split datasets.
import os
save_path = "./split/"
if not os.path.exists(save_path):
    os.mkdir(save_path)

Use `to_csv()` to save pandas DataFrame to tsv file. 
Please find more information about this function [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html).

In [42]:
# save to tsv files
train.to_csv("./split/train.tsv", sep='\t', index=False)
validation.to_csv("./split/val.tsv", sep='\t', index=False)
test.to_csv("./split/test.tsv", sep='\t', index=False)

## Declaring the Fields

Torchtext takes a declarative approach to loading its data: you tell torchtext how you want the data to look like, and torchtext handles it for you.

In [16]:
# import related packages
import torch
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset
import spacy

##### Use [SpaCy English model](https://spacy.io/models/en) to tokenize text

In [17]:
# load SpaCy English model to process 
spacy_en = spacy.load('en_core_web_sm')

If you cannot load the model. 
You may try this: 

`pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz`

We use [`tokenizer()`](https://spacy.io/api/tokenizer) method from spacy to tokenize string text to list of tokens. For example:

In [77]:
[tok.text for tok in spacy_en.tokenizer("Use SpaCy English model to tokenize text.")]

['Use', 'SpaCy', 'English', 'model', 'to', 'tokenize', 'text', '.']

We create the `tokenize_en` functions. It will be passed to `TorchText` and take in the sentence as a string and return the sentence as a list of tokens.

In [18]:
def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

`TorchText's` Fields handle how data should be processed. You can read all of the possible arguments [here](https://torchtext.readthedocs.io/en/latest/data.html#field). These arguments load and process the input data appropriately.

* **For the tweet text**, we pass in the preprocessing we want the field to do as keyword arguments. We give it the tokenizer we want the field to use, tell it to convert the input to lowercase, and also tell it the input is sequential.

* **For the label input**, it is not sequential input and doesn't need unknown token for out-of-vocabulary words.

In [44]:
TEXT = Field(sequential=True, tokenize=tokenize_en, lower=True)
LABEL = Field(sequential=False, unk_token = None)

## Constructing the Dataset

The fields know what to do when given raw data. Now, we need to tell the fields what data it should work on. This is where we use Datasets. To process the tsv file, we use `TabularDataset` class.

In [45]:
train, val, test = TabularDataset.splits(
               path="./split/", # the root directory where the data lies
               train='train.tsv', validation="val.tsv", test="test.tsv", # file names
               format='tsv',
               skip_header=True, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
               fields=[('tweet', TEXT), ('label', LABEL)])

For the TabularDataset, we pass in a list of `(name, field)` pairs as the fields argument. The fields we pass in must be in the same order as the columns.

The splits method creates three datasets for the train, validation, and test data by applying the same processing. We give the the root directory of datasets and corresponding names.

Each dataset can be treated as a list-like dataset and can be indexed and iterated over like normal lists.

In [81]:
print(train[0])

<torchtext.data.example.Example object at 0x12d79bac8>


We get an `Example object`. Each `Example object` includes the attributes of a single data point. 

In [82]:
print(train[0].__dict__.keys())
print(train[0].tweet)
print(train[0].label)

dict_keys(['tweet', 'label'])
['<', 'user', '>', 'me', 'too', 'i', 'fell', 'off', 'pretty', 'much', 'everything']
sadness


We now need to build our vocabulary to map words to integers. To build the vocabulary in an unbiased way, we need to do so on the training data only. We will take the words which occur more than one in the train set, so we set `min_freq=2`. You can also use `max_size=n` to define the maximal number of vocabulary size. It will only keep the top $n$ most frequent tokens. 

In [48]:
TEXT.build_vocab(train, min_freq=2)
LABEL.build_vocab(train)

Then, let us take a look at the vocabulary of tweet texts. 

In [86]:
print(TEXT.vocab.stoi)

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x12d799eb8>>, {'<unk>': 0, '<pad>': 1, '#': 2, '.': 3, 'i': 4, ',': 5, '!': 6, 'the': 7, '<': 8, 'a': 9, 'to': 10, '>': 11, 'user': 12, 'of': 13, '..': 14, 'is': 15, 'in': 16, 'you': 17, 'my': 18, '?': 19, 'do': 20, "n't": 21, 'on': 22, 'and': 23, 'for': 24, 'just': 25, 'me': 26, 'that': 27, 'this': 28, 'be': 29, '"': 30, "'m": 31, "'s": 32, '-': 33, 'at': 34, 'but': 35, 'have': 36, 'like': 37, ':(': 38, 'as': 39, 'it': 40, 'no': 41, 'not': 42, 'one': 43, 'people': 44, 'want': 45, 'was': 46, 'we': 47, 'your': 48, 'all': 49, 'are': 50, 'god': 51, 'how': 52, 'never': 53, 'she': 54, 'so': 55, 'what': 56, 'when': 57, 'with': 58, '&': 59, "'": 60, '3': 61, 'an': 62, 'back': 63, 'before': 64, 'by': 65, 'day': 66, 'go': 67, 'got': 68, 'her': 69, 'ko': 70, 'morning': 71, 'much': 72, 'off': 73, 'oh': 74, 'or': 75, 'them': 76, 'there': 77, 'up': 78, 'vote': 79, 'wanna': 80, 'week': 81, 'were': 82, "'re": 83, 

and the vocabulary of labels. 

In [87]:
print(LABEL.vocab.stoi)

defaultdict(None, {'sadness': 0, 'joy': 1, 'disgust': 2, 'fear': 3, 'trust': 4, 'anger': 5, 'surprise': 6})


Let's show an example to convert text strings to tensor:

In [89]:
token_1 = TEXT.preprocess('a woman with a large purse is walking by a gate.')
print("token 1:", token_1)
token_2 = TEXT.preprocess('i am very good.')
print("token 2:", token_2)
# convert tokens to tensor
tensor = TEXT.process([token_1,token_2])
print(tensor)
print(tensor.shape)

token 1: ['a', 'woman', 'with', 'a', 'large', 'purse', 'is', 'walking', 'by', 'a', 'gate', '.']
token 2: ['i', 'am', 'very', 'good', '.']
tensor([[ 9,  4],
        [ 0,  0],
        [58,  0],
        [ 9,  0],
        [ 0,  3],
        [ 0,  1],
        [15,  1],
        [ 0,  1],
        [65,  1],
        [ 9,  1],
        [ 0,  1],
        [ 3,  1]])
torch.Size([12, 2])


In the output tensor, second sentence is paded to length of 12. 

## Constructing the Iterator

The final step of preparing data is to create the iterators. 
Dataset can be iterated by iterator. At each step, the iterator generates a batch of data which will have a text attribute (the PyTorch tensors containing a batch of numericalized text) and a label attribute (the PyTorch tensors containing a batch of numericalized labels).

Below is code for how you would initialize the Iterators for the train, validation, and test data, repsectively.

In [88]:
from torchtext.data import Iterator, BucketIterator

train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(4,64,64), #batch size for Train, dev and Test, respectively.
 sort_key=lambda x: len(x.tweet), 
 sort=True,
# A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=False
)

 The `sort_within_batch` argument, when set to True, sorts the data within each minibatch in decreasing order according to the sort_key.
 
The [`BucketIterator`](https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.BucketIterator) automatically shuffles and buckets the input sequences into sequences of similar length. In each batch,  we need to pad the input sequences to be of the same length to enable batch processing. The amount of padding necessary is determined by the longest sequence in the batch. Therefore, padding is most efficient when the sequences are of similar lengths.

### Use Iterator
We use for-loop to load batch in a iterator. The iterator returns a custom datatype called `torchtext.data.batch.Batch`. The Batch object is a similar API to generate a batch samples from each field in the dataset as attributes. 

In [94]:
for batch in train_iter:
    print(batch)
    tweets = batch.tweet
    labels = batch.label
    break  #we use first batch as an example.


[torchtext.data.batch.Batch of size 4]
	[.tweet]:[torch.LongTensor of size 7x4]
	[.label]:[torch.LongTensor of size 4]


`tweets` and `labels` are tensors, you can use them to train any machine model with [PyTorch](https://pytorch.org/).

The tweets tensor size is `[maximal sequence length * batch_size]`.

The label tensor size is `[batch_size]`.

**Let us see how `tweet` tensor looks like. **

In [90]:
tweets

tensor([[ 20,   0,   4,   0],
        [ 21,   0,   0,  39],
        [ 86,   0, 105,   0],
        [  2,  53,   0,   4],
        [  0,   0,   0, 112],
        [  1,  38,   3,   0],
        [  1,   1,   1,   0]])

and `label` tensor

In [91]:
labels

tensor([4, 0, 2, 0])

**We also can convert indexes to text tokens.**

We convert these four samples back to text.

In [27]:
print("processed sentence: ")
for j in range(tweets.shape[1]): # sample loop
    tmp = [] # create a output container
    for i in range(tweets.shape[0]): # token loop
        tmp.append(TEXT.vocab.itos[tweets[i,j]])
    print(j," smaple:",tmp)

processed sentence: 
0  smaple: ['do', "n't", 'anyone', '#', '<unk>', '<pad>', '<pad>']
1  smaple: ['<unk>', '<unk>', '<unk>', 'never', '<unk>', ':(', '<pad>']
2  smaple: ['i', '<unk>', 'hate', '<unk>', '<unk>', '.', '<pad>']
3  smaple: ['<unk>', 'as', '<unk>', 'i', 'm', '<unk>', '<unk>']


You can see the first 3 samples are padded to length of 7. The out-of-vocabulary tokens are replaced with `<unk>` token.

**How do the orignal samples look like?**

In [92]:
samples = []
for i in range(len(train)):
    samples.append(train.examples[i].tweet)
samples.sort(key=len)

In [93]:
samples[0:4]

[['do', "n't", 'anyone', '#', 'onefreeadvice'],
 ['long', 'distance', 'relationships', 'never', 'work', ':('],
 ['i', 'absolutely', 'hate', 'burger', 'king', '.'],
 ['feels', 'as', 'though', 'i', 'm', 'slowly', 'dyeing']]

### Exceise: 

I provide a corpus from the [CL-Aff shared task](https://sites.google.com/view/affcon2019/cl-aff-shared-task?authuser=0). The `happy_db_10k.csv` includes all the labeled data of this task. Please `happy_db_10k.csv` to complete following questions.

1. Load this `csv` file which delimiter is `,`. 
2. What columns are in this file?
3. Split this file into train (80%), validation (10%), and test (10%).
4. Create a new directory called `happydb_split`, and save train, validation, and test as `train.tsv`, `val.tsv`, and `test.tsv` respectively. 
5. Define `Field(s)` to process datasets. Use `tokenize_en` function as tokenizer. Note: there are two labels in this dataset (i.e., agency and social).
6. Use your defined Field(s) and ` TabularDataset` to load the train, validation, and test of `happydb` corpus.
7. What the keys are in each Example object?
8. Create the vocabulary for each defined `Field(s)` with train set. Only include the top 5,000 most frequent tokens.  
9. How many elements are in your each Field?
10. Use `BucketIterator` to create `Iterators` for train, validation, and test set with same batch size of 32. Sort by the length of samples, also sort samples within each batch. 
11. What elements are in the first batch of train set? What shape are they?
12. How many batch in Iterator of train set, validation set, and test set?
13. Can you covert the tensor of first sample to text?

## References:
* https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
* https://medium.com/@adam.wearne/lets-get-sentimental-with-pytorch-dcdd9e1ea4c9
* https://spacy.io/
* https://www.aclweb.org/anthology/P17-1067.pdf
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
* https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
* https://pytorch.org/tutorials/index.html