In [1]:
import pandas as pd
import numpy as np
import torch
import tqdm

from torchtext import data

### Loading the Data

In [2]:
pd.read_csv("./data/imdb_train.csv").head(2)

Unnamed: 0,id,sentiment,review
0,4353_9,1,So fortunate were we to see this fantastic fil...
1,7127_10,1,Marvelous film again dealing with the trials a...


In [3]:
pd.read_csv("./data/imdb_test.csv").head(2)

Unnamed: 0,id,sentiment,review
0,8169_4,0,"The movie starts with a pair of campers, a man..."
1,7830_10,1,"In \Die Nibelungen: Siegfried\"", Siegfried was..."


In [4]:
pd.read_csv("./data/imdb_val.csv").head(2)

Unnamed: 0,id,sentiment,review
0,1565_1,0,The movie uses a cutting edge title for a lame...
1,4842_10,1,If the very thought of Arthur Askey twists you...


### Declaring Fields

The Field class determines how the data is preprocessed and converted into a numeric format. We want comment_text field to be converted to lowercase, tokenized on whitespace, and preprocessed. So we tell that to the Field

In [5]:
TEXT = data.Field(sequential=True, tokenize='spacy', lower=True,batch_first=True)

The preprocessing of the labels is even easier, since they are already converted into a binary encoding.
All we need to do is to tell the Field class that the labels are already processed. We do this by passing the `use_vocab=False` keyword to the constructor

In [6]:
LABEL = data.Field(sequential=False, use_vocab=False,batch_first=True)

In [7]:
#           id              sentiment        review
fields = [(None, None),('sentiment',LABEL),('review',TEXT)]

If your data has a header, which ours does, it must be skipped by passing skip_header = True. If not, TorchText will think the header is an example. By default, skip_header will be False.

In [8]:
train_data, valid_data, test_data = data.TabularDataset.splits(
                                        path = './data',
                                        train = 'imdb_train.csv',
                                        validation = 'imdb_val.csv',
                                        test = 'imdb_test.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

In [9]:
TEXT.build_vocab(train_data,max_size = 25000)

In [10]:
TEXT.vocab.freqs.most_common(10)

[('the', 194862),
 (',', 163792),
 ('.', 140517),
 ('and', 97684),
 ('a', 96528),
 ('of', 86952),
 ('to', 81110),
 ('is', 66171),
 ('it', 56081),
 ('in', 55610)]

In [11]:
print(vars(train_data[0]))

{'sentiment': '1', 'review': ['so', 'fortunate', 'were', 'we', 'to', 'see', 'this', 'fantastic', 'film', 'at', 'the', 'palm', 'springs', 'international', 'film', 'festival', '.', 'upon', 'entering', 'the', 'theater', 'we', 'were', 'handed', 'a', 'small', 'opinion', 'card', 'that', 'would', 'be', 'used', 'for', 'our', 'personal', 'rating', 'of', 'the', 'film', '.', 'looking', 'at', 'the', 'card', 'i', 'turned', 'to', 'my', 'wife', 'and', 'said', ',', '\\how', 'many', 'movies', 'in', 'your', 'life', 'do', 'you', 'think', 'you', 'can', 'rate', 'as', 'superb', '?', 'only', 'about', '5', 'for', 'me.\\', '"', 'but', 'then', 'watching', 'the', 'interaction', 'between', 'peter', 'falk', 'and', 'paul', 'reiser', 'while', 'viewing', 'the', 'spectacular', 'scenery', 'in', 'the', 'film', "'s", 'setting', 'of', 'new', 'york', 'state', ',', 'i', 'slowly', 'starting', 'bumping', 'the', 'movie', 'up', 'a', 'category', 'at', 'a', 'time', '.', 'certainly', 'it', 'was', 'good', 'but', 'the', 'totally', '

### Creating the Iterator

During training, we'll be using a special kind of Iterator, called the **BucketIterator**.

When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:
e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

By default, the train data is shuffled each epoch, but the validation/test data is sorted. However, TorchText doesn't know what to use to sort our data and it would throw an error if we don't tell it.

There are two ways to handle this, you can either tell the iterator not to sort the validation/test data by passing sort = False, or you can tell it how to sort the data by passing a sort_key. A sort key is a function that returns a key on which to sort the data on. For example, lambda x: x.s will sort the examples by their s attribute, i.e their quote. Ideally, you want to use a sort key as the BucketIterator will then be able to sort your examples and then minimize the amount of padding within each batch.

We can then iterate over our iterator to get batches of data. Note how by default TorchText has the batch dimension second but we added batch_first=True

In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE= 4


train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    device=device,
    sort_key=lambda x: len(x.review), # the BucketIterator needs to be told what function it should use to group the data.
    sort_within_batch=False,
    repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

In [13]:
data = next(iter(train_iterator))
data


[torchtext.data.batch.Batch of size 4]
	[.sentiment]:[torch.cuda.LongTensor of size 4 (GPU 0)]
	[.review]:[torch.cuda.LongTensor of size 4x287 (GPU 0)]

### Wrapping the Iterator


Currently, the iterator returns a custom datatype called torchtext.data.Batch. This makes code reuse difficult (since each time the column names change, we need to modify the code), and makes torchtext hard to use with other libraries for some use cases (like torchsample and fastai). 

Concretely, we'll convert the batch to a tuple in the form (x, y) where x is the independent variable (the input to the model) and y is the dependent variable (the supervision data).

In [14]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)

We'll use this to wrap the BucketIterator

In [15]:
train_dl = BatchWrapper(train_iterator, "review", ["sentiment"])
valid_dl = BatchWrapper(valid_iterator, "review", ["sentiment"])
test_dl = BatchWrapper(test_iterator, "review", ["sentiment"])

In [16]:
result = next(train_dl.__iter__())

print(result[0].size())
print(result[1].size())

torch.Size([4, 635])
torch.Size([4, 1])


In [17]:
for x,y in tqdm.tqdm(train_dl):
    print(x.size())
    print(y.size())
    #Feed data into our model and train here
    break

  0%|          | 0/3750 [00:00<?, ?it/s]

torch.Size([4, 458])
torch.Size([4, 1])



