# TorchText

Often, in data science projects or applications, a large amount of time is spent on data preprocessing. This is similar for Natural Language Processing. NLP consists of its own set of data cleaning and processing techniques that is rather different from other applications given its unique data type. 

Torchtext is a package that helps with this. It consist of data processing utilities which makes cleaning much simpler. They grouped various steps into classes and functions where all we have to do is finding the right ones for our use case. However, it is still important for us to understand the reasons behind each preprocessing step and its possible impacts. Futhermore, this understanding is necessary when we select our hyperparameter. In addition, torchtext also consists of popular dataset for NLP which provides an easy source to test our algorithms without needing to search for dataset or importing from a secondary source.

In this script, I will go through a bunch of functions that is expected to be commonly used, their hyperparameters and their outputs.

References:
torchtext - https://pytorch.org/text/

# Importing the data
In most cases, we will end up with data stored in a csv or txt file format. We will begin exploring from the importing of the file itself. I will be using the Toxic Comment Classification dataset to explore the different functions available. This dataset is saved as a csv file.

In [1]:
import pandas as pd

In [2]:
text = pd.read_csv('../../data/NLP/jigsaw-toxic-comment-classification-challenge/train.csv')

In [3]:
text.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
print(f"Columns: {', '.join(text.columns)}")
print(f"Number of Rows: {text.shape[0]}")

Columns: id, comment_text, toxic, severe_toxic, obscene, threat, insult, identity_hate
Number of Rows: 159571


# Fields - Center piece of torchtext

Before we continue with the importing of our data, we first have to understand a Fields type classes. Fields can be seen as the preprocessing machine/ pipeline. It consists of most the commonly used preprocessing transformations/ steps that are controlled using the input variables. For example, eos_token is used to represent the end of a sentence, fix_length represents the need for padding, lower transforms the text to lower cases, and so forth. With this, rather than typing long functions to transform the data, we can simply change a particular input variable from False to True, and torchtext does it for us. They even convert our data into Tensors for us which allow us to directly feed into our Networks, assuming you build them in PyTorch. We often will declare this first, as such, it is ideal if we have an idea of how we want to preprocess the text data. Or, we can just change it along the way.

They have a bunch of field classes.
1. RawField - A general datatype (most customisable). This is the most primitive class out of the Fields where it does not assume any property of the data type. As such, it does not have any specific variable inputs but instead takes in a preprocessing and postprocessing pipeline that allows you to specify without any limitations. The reason for this is likely because if we want to read a non-text labels file, we can do so using RawField
2. Field - Used for common text processing datatypes. This is the simplest data preprocessing class that provides most of the common text processing steps.
3. ReversibleField - Field + ability to reverse map our word index to words.
4. NestedField - Takes an untokenised string or a list of string tokens and groups and treats them as one field. Basically, it is used for character embeddings where we want to break down into individual alphabets but still contain information of the word.

In [5]:
from torchtext.data import Field

In [6]:
tokeniser = lambda x : x.split() # typical split by spaces
prep_text = Field(sequential=True, tokenize = tokeniser, lower=True, batch_first=True, 
                  eos_token="<eos>", unk_token="<unk>")
prep_labels = Field(sequential=False, use_vocab=False, is_target=True)

In [41]:
# After preprocessing
prep_text.preprocess('Today, the sky is so blue.')

['today,', 'the', 'sky', 'is', 'so', 'blue.']

In [42]:
prep_labels.preprocess(1)

1

# Datasets

In order to see our Fields perform, we need a dataset to be loaded in-memory. Dataset classes does that for us. The Datasets classes in torchtext loads our files based on a specified path (input). 

1. TabularDataset - Used to import datasets stored in tabular formats such as csv, tsv or json.
2. LanguageModelingDataset - Used for datasets stored in .txt format. Mainly used for Language Models
3. TranslationDataset - Used for datasets stored in extensions unique to its language. e.g. English (.en), French (.fr), etc
4. SequenceTaggingDataset - Datasets that are seperated by tabs

In [7]:
from torchtext.data import TabularDataset

In [8]:
# features in the dataset --> declare each column in a tuple such that ({name}, {preprocessing object})
toxic_fields = [
    ("id", None), # we do  not need the id
    ("comment_text", prep_text), # we preprocess using the Field class defined earlier
    ("toxic", prep_labels), # remaining columns are part of our labels
    ("severe_toxic", prep_labels),
    ("obscene", prep_labels),
    ("threat", prep_labels),
    ("insult", prep_labels),
    ("identity_hate", prep_labels)
]

In [44]:
# create class and import dataset
toxic_dataset = TabularDataset(path="../../data/NLP/jigsaw-toxic-comment-classification-challenge/train.csv",
                               format="CSV",
                               fields=toxic_fields,
                               skip_header=True
                              )

Dataset classes in torchtext stores the data under 2 variables shown below

In [10]:
# 2 variables
toxic_dataset.__dict__.keys()

dict_keys(['examples', 'fields'])

In [11]:
# fields variable stores our column names in a dictionary as defined earlier
toxic_dataset.fields

{'id': None,
 'comment_text': <torchtext.data.field.Field at 0x7f4121728210>,
 'toxic': <torchtext.data.field.Field at 0x7f4121728250>,
 'severe_toxic': <torchtext.data.field.Field at 0x7f4121728250>,
 'obscene': <torchtext.data.field.Field at 0x7f4121728250>,
 'threat': <torchtext.data.field.Field at 0x7f4121728250>,
 'insult': <torchtext.data.field.Field at 0x7f4121728250>,
 'identity_hate': <torchtext.data.field.Field at 0x7f4121728250>}

In [12]:
# examples stores each row in a list
toxic_dataset.examples[:5]

[<torchtext.data.example.Example at 0x7f4121728dd0>,
 <torchtext.data.example.Example at 0x7f4121728e50>,
 <torchtext.data.example.Example at 0x7f4121728e90>,
 <torchtext.data.example.Example at 0x7f4121728ed0>,
 <torchtext.data.example.Example at 0x7f4121728f50>]

In [36]:
# example object used to store our data
toxic_dataset.examples[0].__dict__

{'comment_text': ['explanation',
  'why',
  'the',
  'edits',
  'made',
  'under',
  'my',
  'username',
  'hardcore',
  'metallica',
  'fan',
  'were',
  'reverted?',
  'they',
  "weren't",
  'vandalisms,',
  'just',
  'closure',
  'on',
  'some',
  'gas',
  'after',
  'i',
  'voted',
  'at',
  'new',
  'york',
  'dolls',
  'fac.',
  'and',
  'please',
  "don't",
  'remove',
  'the',
  'template',
  'from',
  'the',
  'talk',
  'page',
  'since',
  "i'm",
  'retired',
  'now.89.205.38.27'],
 'toxic': '0',
 'severe_toxic': '0',
 'obscene': '0',
 'threat': '0',
 'insult': '0',
 'identity_hate': '0'}

## Splitting our data

Often, we will split our dataset into a train, validation & test set depending on each individual approach. We can do so using torchtext via the following function.

In [13]:
# train test split
train, val = toxic_dataset.split(split_ratio=0.7)

In [14]:
print(f"Total Dataset size: {len(toxic_dataset.examples)}")
print(f"Total Train size: {len(train.examples)}")
print(f"Total Validation size: {len(val.examples)}")

Total Dataset size: 159571
Total Train size: 111700
Total Validation size: 47871


In [15]:
# we can split 3 ways as well
# Note: torchtext output and input split_ratio order is in a different order
# split returns in the following order --> train, validation, test splits
# split split_ratio input is in the following order --> train, test, validation splits
train_size, val_size, test_size = 0.7, 0.1, 0.2
train, val, test = toxic_dataset.split(split_ratio=[train_size, test_size, val_size])

In [16]:
print(f"Total Dataset size: {len(toxic_dataset.examples)}")
print(f"Total Train size: {len(train.examples)}")
print(f"Total Validation size: {len(val.examples)}")
print(f"Total Test size: {len(test.examples)}")

Total Dataset size: 159571
Total Train size: 111700
Total Validation size: 15957
Total Test size: 31914


Or, if you have the splits saved in different csv files, torchtext can load the data using the different file paths as well.

In [45]:
train, test = TabularDataset.splits(path="../../data/NLP/jigsaw-toxic-comment-classification-challenge",
                                    train="train.csv", test="test.csv", # splits path, they  have a validation input as well
                                    format="CSV", fields=toxic_fields, skip_header=True)

In [46]:
print(type(train))
print(type(test))

<class 'torchtext.data.dataset.TabularDataset'>
<class 'torchtext.data.dataset.TabularDataset'>


In [47]:
print(f"Total Train size: {len(train.examples)}")
print(f"Total Test size: {len(test.examples)}")

Total Train size: 159571
Total Test size: 153164


# Fields Functions
Now that we have initialise our Datasets, we can look into some of the Fields' functions.

In [20]:
# build our vocab list
# one of the first things you want to do is to build our vacabulary list
# we will want to do this one our train set
prep_text.build_vocab(train.comment_text)

In [21]:
# the Field object stores the vocab list in an torchtext object called Vocab under a variable called vocab
prep_text.vocab

<torchtext.vocab.Vocab at 0x7f409b9cf210>

In [22]:
prep_text.vocab.stoi

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7f409b9cf210>>,
            {'<unk>': 0,
             '<pad>': 1,
             '<eos>': 2,
             'the': 3,
             'to': 4,
             'of': 5,
             'and': 6,
             'a': 7,
             'i': 8,
             'you': 9,
             'is': 10,
             'that': 11,
             'in': 12,
             'it': 13,
             'for': 14,
             'not': 15,
             'this': 16,
             'on': 17,
             'be': 18,
             '"': 19,
             'as': 20,
             'have': 21,
             'are': 22,
             'your': 23,
             'with': 24,
             'if': 25,
             'was': 26,
             'or': 27,
             'but': 28,
             'my': 29,
             'an': 30,
             'by': 31,
             'from': 32,
             'article': 33,
             'at': 34,
             'do': 35,
             'about': 36,
             'can': 

# Iterators

Defines an iterator that loads batches of data from a Dataset. Iterators help us mini-batch our dataset using techniques specific to NLP such as batching similar length sentences together to minimise padding.

1. Iterator - Simply iterates over the entire dataset in order or based on HPs
2. BucketIterator - Iterator + helps you batch examples of similar lengths together to minimise padding.
3. BPTTIterator - Defines an iterator for language modeling tasks that use BPTT. Provides examples with targets that are one timestep further forward.

In [43]:
from torchtext.data import BucketIterator

In [49]:
# train dataset iterator
train_iter = BucketIterator(dataset=train, 
                            batch_size=64,
                            sort_key=lambda x: len(x.comment_text), # by the length of our comments
                            train=True,
                            sort=True,
                            device='cuda' # set whether to store the tensors on gpu or cpu
                           )

In [50]:
# test dataset iterator
test_iter = BucketIterator(dataset=test, 
                           batch_size=64,
                           sort_key=lambda x: len(x.comment_text), # by the length of our comments
                           train=False,
                           sort=True,
                           device='cuda' # set whether to store the tensors on gpu or cpu
                          )

In [58]:
# example of a batch
sample_batch = next(enumerate(train_iter))

In [59]:
sample_batch

(0, 
 [torchtext.data.batch.Batch of size 64]
 	[.comment_text]:[torch.cuda.LongTensor of size 64x3 (GPU 0)]
 	[.toxic]:[torch.cuda.LongTensor of size 64 (GPU 0)]
 	[.severe_toxic]:[torch.cuda.LongTensor of size 64 (GPU 0)]
 	[.obscene]:[torch.cuda.LongTensor of size 64 (GPU 0)]
 	[.threat]:[torch.cuda.LongTensor of size 64 (GPU 0)]
 	[.insult]:[torch.cuda.LongTensor of size 64 (GPU 0)]
 	[.identity_hate]:[torch.cuda.LongTensor of size 64 (GPU 0)])

In [64]:
# observe how all the sentences are of similar length
sample_batch[1].comment_text

tensor([[ 16443, 113053,      2],
        [ 13732,  37329,      2],
        [    77, 321641,      2],
        [   648, 278134,      2],
        [  8581, 140914,      2],
        [ 61261,  44571,      2],
        [    57,  66469,      2],
        [161173,  27287,      2],
        [ 11240, 406048,      2],
        [  5149,  14142,      2],
        [   457,  46067,      2],
        [304063,  57161,      2],
        [  2305, 359064,      2],
        [  5928,  57161,      2],
        [ 76051,    407,      2],
        [     9, 456340,      2],
        [ 47658,  40468,      2],
        [    69, 320830,      2],
        [   345, 271389,      2],
        [  2344,  73992,      2],
        [  4998,  57113,      2],
        [  7962, 112180,      2],
        [  4716, 309280,      2],
        [289276,  10401,      2],
        [  1971, 278381,      2],
        [  1123,   2304,      2],
        [  3899, 221009,      2],
        [  1101, 264264,      2],
        [104247,  32200,      2],
        [   64

In [67]:
# toxic field
sample_batch[1].toxic

tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')

In [70]:
# another way to access the variables
getattr(sample_batch[1], 'comment_text')

tensor([[ 16443, 113053,      2],
        [ 13732,  37329,      2],
        [    77, 321641,      2],
        [   648, 278134,      2],
        [  8581, 140914,      2],
        [ 61261,  44571,      2],
        [    57,  66469,      2],
        [161173,  27287,      2],
        [ 11240, 406048,      2],
        [  5149,  14142,      2],
        [   457,  46067,      2],
        [304063,  57161,      2],
        [  2305, 359064,      2],
        [  5928,  57161,      2],
        [ 76051,    407,      2],
        [     9, 456340,      2],
        [ 47658,  40468,      2],
        [    69, 320830,      2],
        [   345, 271389,      2],
        [  2344,  73992,      2],
        [  4998,  57113,      2],
        [  7962, 112180,      2],
        [  4716, 309280,      2],
        [289276,  10401,      2],
        [  1971, 278381,      2],
        [  1123,   2304,      2],
        [  3899, 221009,      2],
        [  1101, 264264,      2],
        [104247,  32200,      2],
        [   64