# Preprocessing with torchext

By [Dominika Woszczyk](https://github.com/domiwk)

This notebook gives a quick explanation on how to use torchtext.data.Field to prepare your data.

Resources to explore it further are linked at the end.

## 1. Data.Field

In Pytorch, torchtext.data is a module that helps you with preprocessing your data and load popular datasets.

As you can find in the documentation (https://torchtext.readthedocs.io/en/latest/data.html)

The data module provides the following:

- Ability to define a preprocessing pipeline
- Batching, padding, and numericalizing (including building a vocabulary object)
- Wrapper for dataset splits (train, validation, test)
- Loader a custom NLP dataset


In the [second lab](https://colab.research.google.com/github/ImperialNLP/NLPLabs/blob/master/lab02/lab02.ipynb), we use the subclass [data.Field](https://torchtext.readthedocs.io/en/latest/data.html#fields) to make the preprocessing of our dataset faster, before feeding it to your model. 


When calling ``data.Field()`` you have many parameters that you can set to define how to process your dataset before turning it into tensors. In the example below we use :
- **sequential**: If set to True, allows tokenization.
- **lower**: If True, apply lowercase to all text.
- **tokenizer**: Can be assigned a tokenizer function. Can be set to ``"spacy"``. By default is ``string.split``.   

Other useful parameters are: 
- **eos_token**: Adds end of sentence token
- **stop_words**: Takes as value list of stop words to remove from our tokens.
- **preprocessing** : Takes as value a preprocessing pipeline that is called after tokenizing
- **fix_length** : pads all samples to given length
- **use_vocab** : If False, keeps samples as numerical data instead of creating a word2idx ``Vocab`` object.



## 2. Sentiment analysis lab code
The following code follows the use of torchtext for processing from the second lab and adds some more explanation.

### 2.1 Import

In [None]:
import torch
from torchtext import data, datasets
from torch.utils.data import DataLoader
import spacy
import random

SEED = 42

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


### 2.2 Load and process the dataset

In the code below we download the IMDb dataset and splits it into the canonical train/test splits as torchtext.datasets objects. We process the data using the ``Field`` objects.

In [None]:

spacy_en = spacy.load('en')

def tokenizer(text): # create a custom tokenizer function
    return [tok.text for tok in spacy_en.tokenizer(text)]

text_field = data.Field(sequential=True, tokenize=tokenizer, lower=True)
label_field = data.Field(sequential=False, use_vocab=False)

# get pre-defined split and apply Field transformations
train, test_init = datasets.IMDB.splits(text_field, label_field)

# define our own validation and test set (initial test set is too large)
train, valid_test = train.split(split_ratio=0.9, random_state=random.seed(SEED))
valid, test = valid_test.split(split_ratio=0.5, random_state=random.seed(SEED))

print(f'Train size: {len(train)}')
print(f'Validation size: {len(valid)}')
print(f'Test size: {len(test)}')

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:09<00:00, 8.56MB/s]


Train size: 22500
Validation size: 1250
Test size: 1250


Our ``data.Field`` object has a vocab attribute that we can build by calling the ``build_vocab()`` function with our dataset as input. This will create a lookup table for our vocabulary and their embedding ( aka numerical representation). Here we supply the parameter "vectors" to assign glove embeddings to id's corresponding to words in our vocabulary.

In [None]:
# build vocabulary with maximum size (less frequent words are not considered)
# load the pre-trained word embeddings.
EMBEDDING_DIM = 50

text_field.build_vocab(train, max_size=25000, vectors=f"glove.6B.{EMBEDDING_DIM}d")
label_field.build_vocab(train)

.vector_cache/glove.6B.zip: 862MB [06:52, 2.09MB/s]                           
100%|█████████▉| 399494/400000 [00:12<00:00, 31214.80it/s]

We can check our vocabulary by printing the most common words.

In [None]:
print(text_field.vocab.freqs.most_common(20))

[('the', 295618), (',', 247478), ('.', 212955), ('and', 146607), ('a', 145345), ('of', 130731), ('to', 121602), ('is', 99075), ('it', 84058), ('in', 83452), ('i', 74098), ('this', 65970), ('that', 65542), ('"', 57220), ("'s", 55615), ('-', 47369), ('/><br', 45678), ('was', 45034), ('as', 41501), ('for', 39538)]


And check our labels.
 

In [None]:
print(label_field.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f9937224488>, {'<unk>': 0, 'neg': 1, 'pos': 2})


We can also access the vocabulary size and our embeddings, useful when training our models.

In [None]:
voc_size = len(text_field.vocab) 
pretrained_embeddings = TEXT.vocab.vectors

Here is an example on how it can be used to initialise a model with our Glove embeddings.

In [None]:

# Build an FFNN model with an Embedding layer.
class FFNN(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # hidden layer
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)   
        # activation
        self.relu1 = nn.ReLU()       
        # output layer
        self.fc2 = nn.Linear(hidden_dim, num_classes)  

    def forward(self, x):
        # x has shape (batch_size, max_sent_len)
        embedded = self.embedding(x)        
        sent_lens = x.ne(0).sum(1, keepdims=True)
        averaged = embedded.sum(1) / sent_lens
        out = self.fc1(averaged)
        out = self.relu1(out)
        out = self.fc2(out)
        return out

# Get vocabulary size for the input dimension of the first layer
INPUT_DIM = len(text_field.vocab) 

EPOCHS = 10
LRATE = 0.5

# we define our embedding dimension (dimensionality of the output of the first layer)
EMBEDDING_DIM = 50
# dimensionality of the output of the second hidden layer
HIDDEN_DIM = 50
# the output dimension is the number of classes, 1 for binary classification
OUTPUT_DIM = 1

# Construct the model
model = FFNN(EMBEDDING_DIM, HIDDEN_DIM, INPUT_DIM, OUTPUT_DIM)

# Initialize the embedding layer with the Glove embeddings from the
# vocabulary
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

### 2.3 Batch iterator

Finally, we build our iterator object. The iterator splits our sets into batches for training and for validation and testing if necessary (not enough memory to hold all samples at once). We then iterate over those batches during our training/validation loop.

In the field of computer vision, we often use `DataLoader` to iterate over batches, but for text we'll use a `BucketIterator`. It is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example. Torchtext will pad for us automatically (handled by the `Field` object).


We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator.



In [None]:
# get iterators over the data
# place iterators on the GPU if possible

# define our batch size
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
  (train, valid, test),
  batch_sizes=(BATCH_SIZE, BATCH_SIZE, BATCH_SIZE), device=DEVICE)

Batch object is not iterable like pytorch Dataloader. A single Batch object contains the data of one batch .The text and labels can be accessed via column names.

Here we will check the first batch of th iterator.

In [None]:
print(next(iter(train_iter)))


[torchtext.data.batch.Batch of size 64]
	[.Text]:[torch.LongTensor of size 202x64]
	[.Label]:[torch.LongTensor of size 64]


We can also iterate over all batches.


In [None]:
# will output all elements
for batch in train_iter:
    print(batch.Text)
    print(batch.Label)
    # Training/evaluation code

## 3. Some more examples

### 3.1 Using spacy tokenizer and stop words

In [None]:
spacy_nlp = spacy.load('en_core_web_sm')
spacy_stop_words = spacy.lang.en.stop_words.STOP_WORDS
print(spacy_stop_words)

text_field = data.Field(tokenize='spacy', lower=True, stop_words=spacy_stop_words)
label_field = data.Field(sequential=False, use_vocab=False) # we set sequential to false as we don't tokenise our labels


### 3.2 LabelField
We can use the normal Field() object for our labels or we can also use the specialised object LabelField(). Here we are forcing our labels to be of float type.

In [None]:
label_field = data.LabelField(dtype=torch.float)


### 3.3 Using our own tokenizer and dataset

In this example, we will import our own dataset and process it with torchtext.

In [None]:
# We create a new folder where we will put our downloaded dataset - in this case a text file
!mkdir dataset
!wget -O dataset/corpus.txt https://gist.githubusercontent.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235/raw/1d2261e2276cbb0257a2ed6e2f1f4320464c7c07/corpus

--2021-02-16 12:57:49--  https://gist.githubusercontent.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235/raw/1d2261e2276cbb0257a2ed6e2f1f4320464c7c07/corpus
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4507148 (4.3M) [text/plain]
Saving to: ‘dataset/corpus.txt’


2021-02-16 12:57:51 (22.8 MB/s) - ‘dataset/corpus.txt’ saved [4507148/4507148]




To use batch iterators over our dataset such as ``BucketIterator``, we need to load our data in a ``Dataset`` class. With torchext we commonly use ``TabularDataset``, which a wrapper around classical ``Dataset``. It is specifically designed to load csv, tsv or json files and process them using the Field objects.  

Our dataset is a .txt file so we will load its content and put our data and labels into a dataframe. We then divide it into train, validation and test sets and save the results into csv files.



In [None]:
import os
import pandas as pd


def load_data(filename):
  data = open(filename).read()
  labels, texts = [], []

  for line in data.split("\n"):
      content = line.split(' ', 1)
      labels.append(content[0])
      texts.append(content[1])
  
  return texts, labels
  
dataset_dir = './dataset'
data_file = os.path.join(dataset_dir,'corpus.txt')

text_data, labels = load_data(data_file)


In [None]:
# building our dataframe

raw_data = {'Text' : text_data, 'Label': labels}
df = pd.DataFrame(raw_data, columns=["Text", "Label"])

df.head()

Unnamed: 0,Text,Label
0,Stuning even for the non-gamer: This sound tra...,__label__2
1,The best soundtrack ever to anything.: I'm rea...,__label__2
2,Amazing!: This soundtrack is my favorite music...,__label__2
3,Excellent Soundtrack: I truly like this soundt...,__label__2
4,"Remember, Pull Your Jaw Off The Floor After He...",__label__2


In [None]:
from sklearn.model_selection import train_test_split

#splitting into  train,val,test sets

train,test = train_test_split(df, test_size = 0.33, random_state = random.seed(SEED))
train, val = train_test_split(train, test_size = 0.10, random_state = random.seed(SEED))

print(f'Train size: {len(train)}')
print(f'Validation size: {len(val)}')
print(f'Test size: {len(test)}')


# save it to csv files 
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)
val.to_csv("val.csv", index=False)

Train size: 6030
Validation size: 670
Test size: 3300


In [None]:
from torchtext.data import Field, BucketIterator, TabularDataset

# create a custom tokenizer function
def tokenizer(text): 
  doc = nlp(text)
  # Remove stop words, punctuation symbols and non alphabetic characters
  tokens = [token.text.lower() for token in doc if not token.is_stop 
            and not token.is_punct
            and token.is_alpha] #keep only alphabetic characters
  return tokens

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(sequential=False)

# order should match the columns order in our csv/tsv file
# if no processing was required, we set None
data_fields = [('Text', TEXT), ('Label', LABEL)]

# We will load our csv files into Dataset objects 
train, val, test = data.TabularDataset.splits(
                                        path = './',
                                        train = 'train.csv',
                                        validation = 'val.csv',
                                        test = 'test.csv',
                                        format = 'csv',
                                        fields = data_fields,
                                        skip_header = True)

# possible dimensions for glove embeddings
EMBEDDING_DIM = [25, 50, 100, 200, 300]

TEXT.build_vocab(train,max_size=25000, vectors=f"glove.6B.{EMBEDDING_DIM[1]}d")
LABEL.build_vocab(train) 


In [None]:
print(train[0].Text)
print(train[0].Label)

['I', 'was', 'extremely', 'disappointed', '!', ':', 'The', 'book', 'just', 'did', 'not', 'hold', 'my', 'interest', 'very', 'well', '.', 'It', 'was', 'long', 'and', 'tedious', ',', 'and', 'I', 'could', "n't", 'wait', 'for', 'it', 'to', 'be', 'over', '.', 'The', 'characters', 'were', 'not', 'believable', 'nor', 'were', 'they', 'erotic', '.', 'They', 'came', 'across', 'quite', 'clown', '-', 'ish', '.', 'Anne', 'Rice', 'is', 'an', 'excellent', 'writer', ',', 'but', 'from', 'this', 'novel', 'you', 'would', 'never', 'believe', 'it', '.']
__label__1


In [None]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f9937224488>, {'__label__1': 0, '__label__2': 1})


In [None]:
train_iter, val_iter,_test_iter = data.BucketIterator.splits(
                                    (train, val, test), batch_sizes= (BATCH_SIZE, BATCH_SIZE, BATCH_SIZE),
                                    sort_key=lambda x: len(x.Text), device=DEVICE)

### 3.4 Using a Pipeline

We can define a pipeline that will be applied after we tokenised our documents. This can be useful if we want to clearly separate tokenisation from cleaning our tokens.

In [None]:
# defining our pipelines

def clean_string(tokens):
  tokens = [t.replace(">","") for t in tokens]
  return tokens

def convert_to_int(l):
  return [int(y) for y in l]

preprocess_pipeline = data.Pipeline(clean_string)
preprocess_pipeline_label = data.Pipeline(convert_to_int)

text_field = data.Field(tokenize='spacy', lower=True, preprocessing=preprocess_pipeline)
label_field = data.Field(sequential=False, use_vocab=False, postprocessing = preprocess_pipeline_label)

## 4. More Tutorials

* A tutorial for using torchtext for preprocessing can be found here: [Part 1](https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84) and [Part 2](https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496). You can find a deeper tutorial [here](http://anie.me/On-Torchtext/).

* Torchtext for machine translation [here](https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95)

* Pytorch example for using torchtext for BERT [here](https://github.com/pytorch/text/tree/master/examples/BERT)

* Other examples of using torchtext for Transformers: 
  * [Language model](https://ryanong.co.uk/2020/06/28/day-180-learning-pytorch-language-model-with-nn-transformer-and-torchtext-part-1/)
  * [Ben Trevett - Sentiment analysis](https://colab.research.google.com/github/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb)