In [62]:
import pandas as pd
import numpy as np
import torch

# Loading the data

First, let's examine what the data looks like.

(Note: This repo does not contain the full data. To get the full data, go to the [Kaggle competition page](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) and download the data for yourself.

In [63]:
pd.read_csv("data/train_2.csv").head(2)

Unnamed: 0,comment_text,gender
0,Workout today was 95% body weight stuff & ther...,1
1,Holy Props! Thank you! and thanks for following!,1


In [64]:
pd.read_csv("data/valid_2.csv").head(2)

Unnamed: 0,comment_text,gender
0,"Thanks for following (I'm a bit late, I know!!...",1
1,"Thanks for following, following right back. G...",1


Apparently we have to predict 6 labels

In [65]:
pd.read_csv("data/test_2.csv").head(2)

Unnamed: 0,comment_text,gender
0,HOW COME I CANT SEE YOUR WORKOUTS KHEST,0
1,Thanks for the follow my fellow burned runner!!,1


### Declaring Fields

The Field class determines how the data is preprocessed and converted into a numeric format

In [66]:
from torchtext.data import Field

We want comment_text field to be converted to lowercase, tokenized on whitespace, and preprocessed. So we tell that to the Field

In [67]:
tokenize = lambda x: x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)

That was simple. The preprocessing of the labels is even easier, since they are already converted into a binary encoding.
All we need to do is to tell the Field class that the labels are already processed. We do this by passing the use_vocab=False keyword to the constructor

In [68]:
LABEL = Field(sequential=False, use_vocab=False)

### Creating the Dataset

We'll use the TabularDataset class to read our data, since it is in csv format (TabularDataset handles csv, tsv, and json files as of now)

In [69]:
from torchtext.data import TabularDataset

For the train and validation data, we need to process the labels. The fields we pass in must be in the same order as the columns. For fields we don't use, we pass in a tuple where the second element is None

In [70]:
%%time
tv_datafields = [("comment_text", TEXT),
                 ("gender", LABEL)]

trn, vld = TabularDataset.splits(
        path="data", # the root directory where the data lies
        train='train_2.csv', validation="valid_2.csv",
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tv_datafields)

CPU times: user 1.52 s, sys: 179 ms, total: 1.7 s
Wall time: 1.98 s


For the test data, we don't have any labels

In [71]:
%%time
tst_datafields = [("comment_text", TEXT),
                 ("gender", LABEL)]

tst = TabularDataset(
        path="data/test_2.csv", # the file path
        format='csv',
        skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tst_datafields)

CPU times: user 184 ms, sys: 13.1 ms, total: 197 ms
Wall time: 210 ms


For the TEXT field to convert words into integers, it needs to be told what the entire vocabulary is. To do this, we run TEXT.build_vocab, passing in the dataset to build the vocabulary on.

In [72]:
%%time
TEXT.build_vocab(trn, vectors="glove.6B.100d")

CPU times: user 1.29 s, sys: 255 ms, total: 1.55 s
Wall time: 1.72 s


Let's take a look at what the vocab looks like.

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [73]:
vocab = TEXT.vocab
vocab.freqs.most_common(10)

[('the', 40312),
 ('i', 31926),
 ('to', 28161),
 ('for', 22815),
 ('a', 20101),
 ('and', 18637),
 ('you', 15419),
 ('my', 13691),
 ('of', 11935),
 ('thanks', 11830)]

It is also instructive to take a look inside the Dataset. Datasets can be indexed like normal lists, so we'll look at the first element.

In [74]:
trn[0]

<torchtext.data.example.Example at 0x11b53e940>

Each element of the dataset is an Example object that bundles the attributes of a single data point together.

In [75]:
trn[0].__dict__.keys()

dict_keys(['comment_text', 'gender'])

We see that the comment text is already tokenized for us.

In [76]:
trn[0].comment_text

['workout',
 'today',
 'was',
 '95%',
 'body',
 'weight',
 'stuff',
 '&',
 'there',
 "isn't",
 'a',
 'real',
 'way',
 'to',
 'track',
 'the',
 'exercises',
 'in',
 'here',
 'so',
 'big',
 'sad',
 'face',
 ':(',
 'oh',
 'well',
 'at',
 'least',
 'the',
 'work',
 'out',
 'was',
 'good!!']

Looking good. Now, let's build the Iterator which will allow us to load the data into our model.

### Creating the Iterator

In [77]:
from torchtext.data import Iterator, BucketIterator

During training, we'll be using a special kind of Iterator, called the **BucketIterator**.

When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time.

The BucketIterator groups sequences of similar lengths together for each batch to minimize padding. Handy, right?

In [78]:
train_iter, val_iter = BucketIterator.splits(
        (trn, vld), # we pass in the datasets we want the iterator to draw data from
        batch_sizes=(64, 64),
        device=-1, # if you want to use the GPU, specify the GPU number here
        sort_key=lambda x: len(x.comment_text), # the BucketIterator needs to be told what function it should use to group the data.
        sort_within_batch=False,
        repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.


Let's take a look at what the output of the BucketIterator looks like

In [79]:
batch = next(train_iter.__iter__()); batch


[torchtext.data.batch.Batch of size 64]
	[.comment_text]:[torch.LongTensor of size 73x64]
	[.gender]:[torch.LongTensor of size 64]

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [80]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'gender'])

For the test set, we don't want the data to be shuffled. This is why we'll be using a standard Iterator.

In [81]:
test_iter = Iterator(tst, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.


### Wrapping the Iterator

Currently, the iterator returns a custom datatype called torchtext.data.Batch. This makes code reuse difficult (since each time the column names change, we need to modify the code), and makes torchtext hard to use with other libraries for some use cases (like torchsample and fastai). 

I hope this will be dealt with in the future (I'm considering filing a PR if I can decide what the API should look like), but in the meantime, we'll hack on a simple wrapper to make the batches easy to use. 

Concretely, we'll convert the batch to a tuple in the form (x, y) where x is the independent variable (the input to the model) and y is the dependent variable (the supervision data).

In [82]:
class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars # we pass in the list of attributes for x and y
    
    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_vars is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)
    
    def __len__(self):
        return len(self.dl)

We'll use this to wrap the BucketIterator

In [83]:
train_dl = BatchWrapper(train_iter, "comment_text", ["gender"])
valid_dl = BatchWrapper(val_iter, "comment_text", ["gender"])
test_dl = BatchWrapper(test_iter, "comment_text", ["gender"])

In [84]:
batch_ = next(train_dl.__iter__())

In [85]:
batch_

(tensor([[ 26432,   1916,     47,  ...,  72709,    111,    569],
         [  2467,   3201,     31,  ...,  69009,     33,   1749],
         [     8,      1,   1748,  ...,    179,     12,   1041],
         ...,
         [     1,      1,      1,  ...,      1,      1,      1],
         [     1,      1,      1,  ...,      1,      1,      1],
         [     1,      1,      1,  ...,      1,      1,      1]]),
 tensor([[ 1.],
         [ 0.],
         [ 1.],
         [ 1.],
         [ 0.],
         [ 1.],
         [ 0.],
         [ 0.],
         [ 0.],
         [ 1.],
         [ 0.],
         [ 1.],
         [ 1.],
         [ 1.],
         [ 1.],
         [ 0.],
         [ 0.],
         [ 0.],
         [ 1.],
         [ 1.],
         [ 0.],
         [ 0.],
         [ 1.],
         [ 1.],
         [ 1.],
         [ 0.],
         [ 0.],
         [ 0.],
         [ 0.],
         [ 0.],
         [ 0.],
         [ 0.],
         [ 1.],
         [ 0.],
         [ 0.],
         [ 1.],
         [ 0.],
  

Now we're ready to start training a model!

# Training a Text Classifier

We'll use a simple LSTM as a baseline example.

In [86]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

In [87]:
class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=100,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.embedding.weight.data.copy_(vocab.vectors)
        
        print (self.embedding)
        
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 1)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

In [88]:
em_sz = 100
nh = 500
nl = 3
model = SimpleBiLSTMBaseline(nh, emb_dim=em_sz); model

Embedding(75110, 100)


  "num_layers={}".format(dropout, num_layers))


SimpleBiLSTMBaseline(
  (embedding): Embedding(75110, 100)
  (encoder): LSTM(100, 500, dropout=0.1)
  (linear_layers): ModuleList()
  (predictor): Linear(in_features=500, out_features=1, bias=True)
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

### The training loop

In [89]:
import tqdm

In [90]:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()

In [93]:
epochs = 4

In [94]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()
        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        
        running_loss += loss.data[0] * x.size(0)
        
    epoch_loss = running_loss / len(trn)
    
    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.data[0] * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))


  if sys.path[0] == '':

  0%|          | 1/909 [00:01<30:12,  2.00s/it][A
  0%|          | 2/909 [00:04<35:11,  2.33s/it][A
  0%|          | 3/909 [00:07<36:18,  2.40s/it][A
Exception in thread Thread-15:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/site-packages/tqdm/_monitor.py", line 63, in run
    for instance in self.tqdm_cls._instances:
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

100%|██████████| 909/909 [20:29<00:00,  1.35s/it]
  0%|          | 0/909 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.9654, Validation Loss: 0.2391


100%|██████████| 909/909 [21:08<00:00,  1.40s/it]
  0%|          | 0/909 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.9581, Validation Loss: 0.2330


100%|██████████| 909/909 [29:11<00:00,  1.93s/it]
  0%|          | 0/909 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.9394, Validation Loss: 0.2610


100%|██████████| 909/909 [1:38:08<00:00,  6.48s/it]


Epoch: 4, Training Loss: 0.8806, Validation Loss: 0.2852
CPU times: user 4h 56min 53s, sys: 17min 4s, total: 5h 13min 58s
Wall time: 2h 50min 18s


# Writing Predictions

Finally, we output the data in the format required by the competition

In [95]:
test_dl

<__main__.BatchWrapper at 0x116733f98>

In [96]:
pairs = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    
    pairs.extend([(pred[0], v.item()) for pred, v in zip(preds, y)])

100%|██████████| 171/171 [01:19<00:00,  2.14it/s]


In [97]:
sum([(1 if (x==y) else 0) for x,y in [((1 if (x>0.5) else 0), y) for x, y in pairs]])/len(pairs)

0.5906224293940224

In [98]:
pairs = []
for x, y in tqdm.tqdm(train_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    
    pairs.extend([(pred[0], v.item()) for pred, v in zip(preds, y)])

100%|██████████| 909/909 [06:50<00:00,  2.22it/s]


In [99]:
sum([(1 if (x==y) else 0) for x,y in [((1 if (x>0.5) else 0), y) for x, y in pairs]])/len(pairs)

0.6989108553140969

In [61]:
df = pd.read_csv("data/test.csv")

In [None]:
df.head(3)