In [27]:
import torch
from torchtext import data
from torchtext import datasets
import random


SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Field
One of the main concepts of TorchText is the `Field`. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either "pos" or "neg".

The parameters of a `Field` specify how the data should be processed. We use the `TEXT` field to define how the review should be processed, and the `LABEL` field to process the sentiment. Our `TEXT` field has `tokenize='spacy'` as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the [spaCy](https://spacy.io) tokenizer. If no `tokenize` argument is passed, the default is simply splitting the string on spaces.

`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels.

`dtype:`This is because TorchText sets tensors to be LongTensors by default, however our criterion expects both inputs to be FloatTensors. Setting the dtype to be torch.float, did this for us. The alternative method of doing this would be to do the conversion inside the train function by passing batch.label.float() instad of batch.label to the criterion.

All fields, by default, expect a sequence of words to come in, and they expect to build a mapping from the words to integers. If you are passing a field that is already numericalized by default and is not sequential, you should pass `use_vocab=False` and `sequential=False`.

In addition to the keyword arguments mentioned above, the Field class also allows the user to specify `special tokens` (the unk_token for out-of-vocabulary words, the `pad_token` for padding, the `eos_token` for the end of a sentence, and an optional `init_token` for the start of the sentence), choose whether to make the `first dimension` the batch or the sequence (the first dimension is the sequence by default), and choose whether to allow the sequence lengths to be decided at runtime or decided in advance.

For more on `Fields`, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

We also set the random seeds for reproducibility.

```
    Field Attributes:
        sequential: Whether the datatype represents sequential data. If False,
            no tokenization is applied. Default: True.
        use_vocab: Whether to use a Vocab object. If False, the data in this
            field should already be numerical. Default: True.
        init_token: A token that will be prepended to every example using this
            field, or None for no initial token. Default: None.
        eos_token: A token that will be appended to every example using this
            field, or None for no end-of-sentence token. Default: None.
        fix_length: A fixed length that all examples using this field will be
            padded to, or None for flexible sequence lengths. Default: None.
        tensor_type: The torch.Tensor class that represents a batch of examples
            of this kind of data. Default: torch.LongTensor.
        preprocessing: The Pipeline that will be applied to examples
            using this field after tokenizing but before numericalizing. Many
            Datasets replace this attribute with a custom preprocessor.
            Default: None.
        postprocessing: A Pipeline that will be applied to examples using
            this field after numericalizing but before the numbers are turned
            into a Tensor. The pipeline function takes the batch as a list,
            the field's Vocab, and train (a bool).
            Default: None.
        lower: Whether to lowercase the text in this field. Default: False.
        tokenize: The function used to tokenize strings using this field into
            sequential examples. If "spacy", the SpaCy English tokenizer is
            used. Default: str.split.
        include_lengths: Whether to return a tuple of a padded minibatch and
            a list containing the lengths of each examples, or just a padded
            minibatch. Default: False.
        batch_first: Whether to produce tensors with the batch dimension first.
            Default: False.
        pad_token: The string token used as padding. Default: "<pad>".
        unk_token: The string token used to represent OOV words. Default: "<unk>".
        pad_first: Do the padding of the sequence at the beginning. Default: False.
```

In [28]:
TEXT = data.Field(tokenize = 'spacy', batch_first=True)
LABEL = data.LabelField(sequential=False, dtype = torch.float, batch_first=True)

### Dataset
Another handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP). 

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.datasets` objects. It process the data using the `Fields` we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.

By default this splits 70/30, however by passing a `split_ratio` argument, we can change the ratio of the split, i.e. a `split_ratio` of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set. 

In [29]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [30]:
train_data, valid_data = train_data.split(random_state = random.seed(SEED), split_ratio=0.8)

In [31]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 20000
Number of validation examples: 5000
Number of testing examples: 25000


### Numericalize
Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer). We do this as our machine learning model cannot operate on strings, only numbers. What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or <unk> token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I <unk> it".

The following builds the vocabulary, only keeping the most common max_size tokens.

In [32]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

In [33]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


In [34]:
print(TEXT.vocab.freqs.most_common(20))
print(" ")
print(TEXT.vocab.itos[:10])
print(" ")
print(LABEL.vocab.stoi)

[('the', 232441), (',', 221500), ('.', 189857), ('and', 125689), ('a', 125534), ('of', 115489), ('to', 107557), ('is', 87581), ('in', 70216), ('I', 62190), ('it', 61399), ('that', 56406), ('"', 50836), ("'s", 49934), ('this', 48383), ('-', 42634), ('/><br', 40664), ('was', 39911), ('as', 34825), ('with', 34402)]
 
['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']
 
defaultdict(None, {'neg': 0, 'pos': 1})


### DataIterator
The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration. We'll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using torch.device, we then pass this device to the iterator.

In torchvision and PyTorch, the processing and batching of data is handled by DataLoaders. For some reason, torchtext has renamed the objects that do the exact same thing to Iterators. The basic functionality is the same, but Iterators, as we will see, have some convenient functionality that is unique to NLP.The BucketIterator is one of the most powerful features of torchtext. It automatically shuffles and buckets the input sequences into sequences of similar length.


| Name        | Description           | Use Case  |
| ------------- |:-------------:| -----:|
| Iterator      | Iterates over the data in the order of the dataset. |  Test data, or any other data where the order is important. |
| BucketIterator | Buckets sequences of similar lengths together.	      |   Text classification, sequence tagging, etc. (use cases where the input is of variable length) |
| BPTTIterator | An iterator built especially for language modeling that also generates the input sequence delayed by one timestep. It also varies the BPTT (backpropagation through time) length. This iterator deserves its own post, so I'll omit the details here. |    Language modeling |

In [35]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = 64,
    device = device)

In [36]:
next(train_iterator.__iter__())


[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 64x1084 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]

text size: [sentence length, batch size]

In [37]:
for batch in train_iterator:
    print("batch.text:")
    print(batch.text)
    print("batch.text.size:")
    print(batch.text.size())
    print("batch.label:")
    print(batch.label, len(batch.label))

    break

batch.text:
tensor([[ 7223,  3035,     5,  ...,     1,     1,     1],
        [  146,   146, 13551,  ...,     1,     1,     1],
        [   11,    35,   233,  ...,     1,     1,     1],
        ...,
        [ 3131,  4152,  2568,  ...,     1,     1,     1],
        [   11,    57,    29,  ...,     1,     1,     1],
        [  323,    11,    19,  ...,     1,     1,     1]], device='cuda:0')
batch.text.size:
torch.Size([64, 1021])
batch.label:
tensor([0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 1.,
        0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 1.,
        0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0.,
        1., 0., 0., 1., 0., 0., 0., 0., 0., 0.], device='cuda:0') 64


Feed above data into our model