In [1]:
import pandas as pd
import numpy as np
import torch

In [2]:
train_path = 'IroSvA2019/train/irosva.mx.training.csv'

# Loading the data

First, let's examine what the data looks like.

(Note: This repo does not contain the full data. To get the full data, go to the [Kaggle competition page](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and download the data for yourself.

In [3]:
pd.read_csv(train_path).head(10)

Unnamed: 0,ID,TOPIC,IS_IRONIC,MESSAGE
0,6424ee0864a0af40660686e135f5652b,asuntosConacyt,1,"Rica económicamente, pero muy pobre en objetiv..."
1,f59978451dd7fb228830fed2ae00c3ef,asuntosConacyt,1,"En algo tiene razón, mafias hay en todo, hasta..."
2,280963c5eb0d162858caf3480a7ea08c,asuntosConacyt,1,¿De cuándo acá tan preocupados por la ciencia ...
3,69af1d02743953e5a0e971cf6ac00c69,asuntosConacyt,1,"De una vez que paren las titulaciones, que tod..."
4,b0be044f4682eea74c6fe83a87fff22a,asuntosConacyt,1,@LopesDorigaa @Dolores_PL Es el que también t...
5,bb3895deabd4aba3e328abdfa53323b4,asuntosConacyt,1,Pero ahí viene el joven prodigio y digno repre...
6,07e712aec9b92319cfff48e07d00731c,asuntosConacyt,1,@lopezobrador_ Habla de la Mafia del poder y...
7,99baf109f3a25721ffdb30c529990b0a,asuntosConacyt,1,¿y la diseñadora de modas en el CONACYT?
8,5f70ec64d4cc6f4d628ce74008b815c5,asuntosConacyt,1,Esta 4T odia todo al que se tiene estudios. Na...
9,d8deeac521c1c8ad396f680f6b11db76,asuntosConacyt,1,Si es tan inteligente que lo manden a la Secre...


### Declaring Fields

The Field class determines how the data is preprocessed and converted into a numeric format

In [4]:
from torchtext.data import Field

We want comment_text field to be converted to lowercase, tokenized on whitespace, and preprocessed. So we tell that to the Field

In [5]:
tokenize = lambda x: x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

That was simple. The preprocessing of the labels is even easier, since they are already converted into a binary encoding.
All we need to do is to tell the Field class that the labels are already processed. We do this by passing the use_vocab=False keyword to the constructor

In [6]:
LABEL = Field(sequential=False, use_vocab=False)

### Creating the Dataset

We'll use the TabularDataset class to read our data, since it is in csv format (TabularDataset handles csv, tsv, and json files as of now)

In [7]:
from torchtext.data import TabularDataset

For the train and validation data, we need to process the labels. The fields we pass in must be in the same order as the columns. For fields we don't use, we pass in a tuple where the second element is None

In [8]:
%%time
datafields = [("id", None), # we won't be needing the id, so we pass in None as the field
              ("topic", TEXT),
              ("is_ironic", LABEL),
              ("message", TEXT)]

trn = TabularDataset(path=train_path,
        format='CSV',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=datafields)

Wall time: 106 ms


For the test data, we don't have any labels

In [9]:
trn

<torchtext.data.dataset.TabularDataset at 0x1b935a0bc48>

For the TEXT field to convert words into integers, it needs to be told what the entire vocabulary is. To do this, we run TEXT.build_vocab, passing in the dataset to build the vocabulary on.

In [10]:
%%time
TEXT.build_vocab(trn)

Wall time: 113 ms


Let's take a look at what the vocab looks like.

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [11]:
TEXT.vocab.freqs.most_common(10)

[('que', 2165),
 ('de', 1965),
 ('la', 1600),
 ('y', 1282),
 ('a', 1145),
 ('no', 1065),
 ('el', 1008),
 ('es', 943),
 ('en', 761),
 ('los', 655)]

In [12]:
len(TEXT.vocab)

11809

It is also instructive to take a look inside the Dataset. Datasets can be indexed like normal lists, so we'll look at the first element.

In [13]:
trn[0]

<torchtext.data.example.Example at 0x1b935a0be88>

Each element of the dataset is an Example object that bundles the attributes of a single data point together.

In [14]:
trn[0].__dict__.keys()

dict_keys(['topic', 'is_ironic', 'message'])

We see that the comment text is already tokenized for us.

In [15]:
trn[0].message[:3]

['rica', 'económicamente,', 'pero']

Looking good. Now, let's build the Iterator which will allow us to load the data into our model.

### Creating the Iterator

In [16]:
from torchtext.data import Iterator, BucketIterator

During training, we'll be using a special kind of Iterator, called the **BucketIterator**.

When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time.

The BucketIterator groups sequences of similar lengths together for each batch to minimize padding. Handy, right?

In [17]:
train_iter= BucketIterator(
        trn, # we pass in the datasets we want the iterator to draw data from
        batch_size=64,
        device=0, # if you want to use the GPU, specify the GPU number here
        sort_key=lambda x: len(x.message), # the BucketIterator needs to be told what function it should use to group the data.
        sort_within_batch=False,
        repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.


Let's take a look at what the output of the BucketIterator looks like

In [18]:
batch = next(train_iter.__iter__()); batch


[torchtext.data.batch.Batch of size 64]
	[.topic]:[torch.LongTensor of size 1x64]
	[.is_ironic]:[torch.LongTensor of size 64]
	[.message]:[torch.LongTensor of size 45x64]

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [19]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'topic', 'is_ironic', 'message'])

For the test set, we don't want the data to be shuffled. This is why we'll be using a standard Iterator.

### Wrapping the Iterator

Currently, the iterator returns a custom datatype called torchtext.data.Batch. This makes code reuse difficult (since each time the column names change, we need to modify the code), and makes torchtext hard to use with other libraries for some use cases (like torchsample and fastai). 

I hope this will be dealt with in the future (I'm considering filing a PR if I can decide what the API should look like), but in the meantime, we'll hack on a simple wrapper to make the batches easy to use. 

Concretely, we'll convert the batch to a tuple in the form (x, y) where x is the independent variable (the input to the model) and y is the dependent variable (the supervision data).

In [20]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)

We'll use this to wrap the BucketIterator

In [21]:
train_dl = BatchWrapper(train_iter, "message", ['is_ironic'])

In [22]:
batch = next(train_dl.__iter__()); batch

(tensor([[   8,  859,   10,  ...,   52, 7766,    4],
         [  95, 1705, 7446,  ...,  151,   39,  266],
         [8833,  220,   11,  ...,   82,   19,    7],
         ...,
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1]]), tensor([[0.],
         [0.],
         [1.],
         [1.],
         [1.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [1.],
         [1.],
         [1.],
         [0.],
         [1.],
         [1.],
         [1.],
         [0.],
         [1.],
         [0.],
         [1.],
         [0.],
         [1.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [1.],
         [0.],
         [1.],
         [0.],
         [0.],
         [0.],
         [1.],
         [1.],
         [0.],
         [1.],
         [0.],
         [0.],
         [0.],
         [1.],
       

Now we're ready to start training a model!

# Training a Text Classifier

We'll use a simple LSTM as a baseline example.

In [23]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

In [39]:
class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 1)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

In [40]:
em_sz = 100
nh = 500
nl = 3
model = SimpleBiLSTMBaseline(nh, emb_dim=em_sz); model

SimpleBiLSTMBaseline(
  (embedding): Embedding(11809, 100)
  (encoder): LSTM(100, 500, dropout=0.1)
  (linear_layers): ModuleList()
  (predictor): Linear(in_features=500, out_features=1, bias=True)
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [33]:
# model.cuda()

### The training loop

In [34]:
import tqdm

In [41]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()

In [42]:
epochs = 2

In [47]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        
        running_loss += loss.item() * x.size(0)
        print(loss.item())
        
    epoch_loss = running_loss / len(trn)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.item() * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

100%|███████████████████████████████████████████████████████████████████████████████| 38/38 [00:53<00:00,  1.41s/it]


NameError: name 'valid_dl' is not defined

# Writing Predictions

Finally, we output the data in the format required by the competition

In [31]:
test_dl

NameError: name 'test_dl' is not defined

In [None]:
test_preds = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
test_preds = np.hstack(test_preds)

In [None]:
df = pd.read_csv("data/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    df[col] = test_preds[:, i]

# if you want to write the submission file to disk, uncomment and run the below code
# df.drop("comment_text", axis=1).to_csv("submission.csv", index=False)

In [None]:
df.head(3)