In [3]:
import pandas as pd
import pathlib

In [10]:
path  = pathlib.Path('../../data/')
comp  = pathlib.Path('competitions/jigsaw-toxic-comment-classification-challenge/')
TRAIN = pathlib.Path(path/comp/'train.csv')
TEST  = pathlib.Path(path/comp/'test.csv')

## 1. Overview

1. Read data from disk
2. Tokenize text
3. Create word-unique-integer mappings
4. Convert text to list of integers
5. Load data into format req'd by DL framekwork
6. Pad text so all seqs same len ==> for batch processing

Torchtext follows the basic formula for transforming data into working input for your neural network:

<img src="https://i0.wp.com/mlexplained.com/wp-content/uploads/2018/02/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-02-07-10.32.59.png?w=1500"/>

## 2. Declaring Fields

Torch text takes a declarative approach to laoding its data: you tell torchtext how you want the data to look, and torchtext hands it for you.

The way you do this is by declaring a Field. The Field specifies how you want a certain (you guessed it) field to be processed. Let's look at an example:

In [26]:
from torchtext.data import Field

tokenize = lambda x : x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

LABEL = Field(sequential=False, use_vocab=False)

In the Toxic Comment Classification dataset there are 2 kinds of fields: the common text and the labels (toxic, severe toxic, etc..)

In [11]:
pd.read_csv(TRAIN).head(2)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0


If you're passing a field that's already numericalized by default and not sequential, you should pass `use_vocab=False` and `sequential=False`

For the comment text, we pass in the preprocessing we want the field to do as keyword arguents. We give it the tokenizer we want the field to use, tell it to convert the input to lowercase, and also tell it the input is sequential.

In addition to the keyword arguments mentioned above, the Field class also allows the user to speciy special tokens (the `unk_token` for out-of-vocab words, the `pad_token` for padding, `eos_token` for end-of-sentence, and an optional `init_token` for the start of a sentence), choose whether to make the first dimension the batch or the sequence (the 1st dim is the seq by default), and choose whether to allow the sequence lengths to be decided at runtime or in advance. Fortunately, [the docstrings](https://github.com/pytorch/text/blob/c839a7934930819be7e240ea972e4d600966afdc/torchtext/data/field.py#L61) for the **Field** class are relatively well written, so if you need some advanced preprocessing you should refer to them for more information.

The **Field** class is at the center of torchtext and is what makes preprocessing such an ease. Aside from the standard field class, here's a list of the fields that are currently available (along w/ their use cases):

|Name | Description | Use Case|
|-----|-------------|---------|
|Field|A regular field that defines preprocessing and post processing|Non-text fields and text fields where you don't need to map integers back to words.|
|ReversibleField|An extension of the field that allows reverse mapping of word ids to words|Text fields if you want to map the integers back to natural language (such as in the case of language modeling)|
|NestedField|A field that processes non-tokenized text into a set of smaller fields|Char-based models|
|LabelField (New!)|A regular field with `sequential=False` and no `<unk>` token. Newly added on the master branch.|Label fields in text classification|

## 3. Constructing the Dataset

The fields know what to do when given raw data. Now we need to tell the fields what data they should work on. This is where we use Datasets.

There're various built-in Datasets in torchtext that handle common data formats. For CSV/TSV fiels the **`TabularDataset`** class is convenient. Here's how we'd read data from a CSV file using `TabularDataset`:

In [29]:
from torchtext.data import TabularDataset

tv_datafields = [("id", None), # we won't be needing the id, so we pass None as the field
                 ("comment_text", TEXT), ("toxic", LABEL), 
                 ("severe_toxic", LABEL), ("threat", LABEL), 
                 ("obscene", LABEL), ("insult", LABEL), ("identity_hate", LABEL)]
trn, vld = TabularDataset.splits(
                path=path/comp, # the root directory where the data lies
                train='train.csv', validation='train.csv',
                format='csv',
                skip_header=True, # if your csv has a header, make sure to pass this to ensure it doesn't get processed as data!
                fields=tv_datafields)
tst_datafields = [("id", None), # we won't be needing the id, so we pass in Noen as the field
                  ("comment_text", TEXT)]
tst = TabularDataset(
            path=TRAIN, # the file path
            format='csv',
            skip_header=True,
            fields=tst_datafields)

For the `TabularDataset`, we pass in a list of (name, field) pairs as the fields argument. The fields we pass in must be in the same order as the columns. For the columns we don't use, we pass in a tuple where the field element is None.

The splits method creates a dataset for the train and validation data by applying the same processing. It can also handle the data, but since our test data has a different frmat from the train and validation data, we create a different dataset.

Datasets can mostly be treated in the same way as lists. To understand this, it's instructive to take a look inside our Dataset. Datasets can be indexed and iterated over like normal lists, so let's see what the first element looks like:

In [30]:
trn[0]

<torchtext.data.example.Example at 0x7f76ab2ba630>

In [32]:
trn[1].__dict__.keys()

dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

In [35]:
trn[0].comment_text[:3]

['explanation']

In [34]:
trn[1].comment_text[:3]

['just', 'closure', 'on']

Torchtext handles mapping words to integers, but it has to be told the full range of words it should handle. In our case, we probably want to build the vocabulary on the training set only, so we run the following code: `TEXT.build_vocab(trn)`

This makes torchtext go through all the elements in the training set, check the contents corresp----

---

List of currently available datasets and the format of data they take:

|Name|Description|Use Case|
|-|-|-|
|`TabularDataset`|Takes the path to CSV/TSV and JSON files or Python dictionaries as inputs.|Any problem that involves a label (or labels) for each piece of text.|
|`LanguageModelingDataset`|Takes the path to a text file.|Language modeling|
|`TranslationDataset`|Takes a path and extensions to a file for each language. eg: If the files are English: "hoge.en", French: "hoge.fr", path="hoge", exts=("en","fr")|Translation|
|`SequenceTaggingDataset`|Takes a path to a file with the input sequence and output sequence separated by tabs.|Sequence tagging.|

Now that we have our data formatted and read into memory, we turn to the next step: creating an iterator to pass the data to our model:

## 4. Constructing the Iterator

In torchvision and PyTorch, the processing and batching of data is handled by DataLoaders. For some reason torchtext has renamed the objects that do the exact same thing to Iterators. The basic functionality is the same, but Iterators, as we will see, have some convenient functionality that is unique to NLP.

Below is code for how you 'd initialize the Iterators for the train, validation, and test data:

In [37]:
from torchtext.data import Iterator, BucketIterator

train_iter, val_iter = BucketIterator.splits(
    (trn, vld), # we pass in the datasets we want the iterator to draw data from
    batch_sizes=(64,64),
    device=0, # if you want to use the GPU, specify GPU number here
    sort_key=lambda x: len(x.comment_text), # the BucketIterator needs to be told what function it should use to group the data.
    sort_within_batch=False,
    repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)
test_iter = Iterator(tst, batch_size=64, device=0, sort=False, sort_within_batch=False, repeat=False)


***NOTE***: using the `sort_within_batch` argument, when set to True, sorts the data within each minibatch indecreasing order acc. to the `sort_key`. This is necessary when you want to use `pack_padded_sequence` with the padded sequence data and convert the padded sequence tensor to a `PackedSequence` object.

The `BucketIterator` is one of the most powerful features of torchtext. It automatically shuffles and buckets the input sequences into sequences of similar length.

The reason this is powerful is that we need to pad the input sequennces to be of the same length to enable batch processing. For instance, the sequences:
```
[ [3, 15, 2, 7], 
  [4, 1], 
  [5, 5, 6, 8, 1] ]
```
would need to be padded to become:
```
[ [3, 15, 2, 7, 0],
  [4, 1, 0, 0, 0],
  [5, 5, 6, 8, 1] ]
```

The amount of padding necessary is determined by the longest sequence in the batch. Therefore, padding is most efficient when the sequences are of similar lengths. The BucketIterator does all this behind the scenes. As a word of caution, you need to tell the BucketIterator what attribute you want to bucket the data on. In our case, we want to bucket based on the lengths of the comment_text field, so we pass that in as a keyword argument.

For the test data, we don't want to shuffle the data since we'll be ouputting the predictions at the end of training. This is why we use a standard iterator.

---

List of iterators that torchtext currently implements:

|Name|Description|Use Case|
|-|-|-|
|`Iterator`|Iterates over the data in the order of the dataset.|Test data, or any other data where the order is important.|
|`BucketIterator`|Buckets sequences of similar lengths together.|Text classification, sequence tagging, etc. (use cases where the input is of variable length)|
|`BPTTIterator`|An iterator built especially for language modeling that also generates the input sequence delayed by one timestep. It also varies the BPTT length.|Language modeling|

## Wrapping the Iterator

Currently, the iterator returns a custom datatype called `torchtext.data.Batch`. The **`Batch`** class has a similar API to the `Example` type, with a batch of data from each field as attributes. Unfortunately, this custom datatype makes code reuse difficult (since each time the column names change, we need to modify the code), and makes torchtexk hard to use with other libraries for some use cases (like torchsample and fastai).

In the meantime we'll hack on a simple wrapper to make the batches easy to use. Concretely, we'll convert the batch to a tuple in the form (x, y) where x is the independent variable (input) and y is the dependent variable (labels). Code:

In [None]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl,self.x_var,self.y_vars = dl,x_var,y_vars
        
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is #TODO:
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeroes((1))
            yield (x, y)
    
    def __len__(self):
        return len(self.dl)
    
train_dl = BatchWrapper(train_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)