<a href="https://colab.research.google.com/github/byrondennis1/NLP_Projects/blob/master/Toxic_Comment_Classification_Challenge_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1st Place Results in the Kaggle Toxic Comment Classification Challenge

**Using the HuggingFace library I was able to fine tune bert on the competition data and obtain 1st place results.**  

- The results are really a testament to the power of BERT and the improvements made in NLP since the competition which ended 2 years ago.  as the process was fairly straightforward and I did not spend any time tuning parameters.  

- Another point to mention is that I fine-tuned the classification layer and then did additional training on the entire network, but there are still methodologies that can be implemented to improve performance such more strategic layer unfreezing, discriminative layer training.  

- I also did not attempt to optimize the learning rate, which would be worth doing even to speed up training time.

**About the competition.**

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview 


## Import Libraries

In [1]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |▊                               | 10kB 25.4MB/s eta 0:00:01[K     |█▌                              | 20kB 3.2MB/s eta 0:00:01[K     |██▏                             | 30kB 3.9MB/s eta 0:00:01[K     |███                             | 40kB 3.0MB/s eta 0:00:01[K     |███▋                            | 51kB 3.5MB/s eta 0:00:01[K     |████▍                           | 61kB 4.1MB/s eta 0:00:01[K     |█████▏                          | 71kB 4.4MB/s eta 0:00:01[K     |█████▉                          | 81kB 4.8MB/s eta 0:00:01[K     |██████▋                         | 92kB 5.3MB/s eta 0:00:01[K     |███████▎                        | 102kB 5.0MB/s eta 0:00:01[K     |████████                        | 112kB 5.0MB/s eta 0:00:01[K     |████████▉                       | 122kB 5.0M

In [2]:
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset

from transformers import (
    WEIGHTS_NAME,
    AdamW,
    BertConfig,
    BertForSequenceClassification,
    BertTokenizer,
    get_linear_schedule_with_warmup
)

## Import Dataset and Convert to Tokens

In [0]:
train = pd.read_csv('train.csv.zip')
test = pd.read_csv('test.csv.zip')

In [73]:
print("train shape: ", train.shape)
print("test shape:", test.shape)

train shape:  (159571, 8)
test shape: (153164, 2)


In [74]:
train.head(2)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0


**Create single column with all labels and remove new line characters (\n)**

In [75]:
# create labels
labels = train.iloc[:,2:8].values

# clean text
train['text'] = train.comment_text.replace('\n', ' ', regex=True)
test['text'] = test.comment_text.replace('\n', ' ', regex=True)

train.head(2)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,text
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,Explanation Why the edits made under my userna...
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,D'aww! He matches this background colour I'm s...


**Get Tokens using BertTokenizer**

In [0]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [0]:
train['encoded'] = train.text.apply((lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=100, pad_to_max_length=True)))
test['encoded'] = test.text.apply((lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=100, pad_to_max_length=True)))

In [0]:
# create attention masks that identify padding

def attn_msk(df, col):

  attention_masks = []

  for sent in df[col]:
    att_mask = [int(token_id > 0) for token_id in sent]
    attention_masks.append(att_mask)

  return attention_masks

In [0]:
trn_msks = attn_msk(train, 'encoded')
tst_msks = attn_msk(test, 'encoded')

In [0]:
# convert data lists to tensors
trn_inputs, trn_masks, trn_targets = torch.tensor(train.encoded), torch.tensor(trn_msks), torch.tensor(labels)
tst_inputs, tst_masks = torch.tensor(test.encoded), torch.tensor(tst_msks)

# convert targets to float / BCELoss expects float
trn_targets = trn_targets.float() 

## Create DataLoaders

In [0]:
from torch.utils.data import TensorDataset, DataLoader, random_split

batch_size = 16

# Create the train/valid dataloaders
trn_dataset = TensorDataset(trn_inputs, trn_masks, trn_targets)
trn, vld = random_split(trn_dataset, [140000, 19571])

train_dataloader = DataLoader(trn, shuffle=True, batch_size=batch_size)
valid_dataloader = DataLoader(vld, shuffle=True, batch_size=batch_size)

# prepare test data
tst_dataset = TensorDataset(tst_inputs, tst_masks)
test_dataloader = DataLoader(tst_dataset, shuffle=False, batch_size=1)


## Finetune Model

In [0]:
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [0]:
# instantiate model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', 
                                                      output_hidden_states = False,
                                                      output_attentions = False, 
                                                      num_labels=6)

# freeze all layers except final classification layer
for param in model.bert.parameters():
    param.requires_grad = False

# move model to gpu
model.to(device)

criterion = nn.BCELoss()
optimizer = AdamW(model.parameters(), lr=0.001)

In [0]:
# training loop 

def training_loop(epochs):

  for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    
    for i, data in enumerate(train_dataloader, 0):
      # get the inputs; data is a list of [inputs, labels]
      inputs, attention_masks, labels = data[0].to(device), data[1].to(device), data[2].to(device)

      # zero the parameter gradients
      optimizer.zero_grad()

      # forward + backward + optimize
      outputs = model(inputs, attention_mask=attention_masks)
      loss = criterion(torch.sigmoid(outputs[0]), labels)  # applied sigmoid to change prediction output to 1 or 0
      loss.backward()
      optimizer.step()

      # print statistics
      running_loss += loss.item()
      if i % 500 == 499:    # print every 500 mini-batches
          print('[%d, %5d] loss: %.3f' %
                (epoch + 1, i + 1, running_loss / 500))
          running_loss = 0.0

  print('Finished Training')  


In [85]:
# train classification layer of model

training_loop(1)

[1,   500] loss: 0.136
[1,  1000] loss: 0.108
[1,  1500] loss: 0.104
[1,  2000] loss: 0.100
[1,  2500] loss: 0.098
[1,  3000] loss: 0.091
[1,  3500] loss: 0.097
[1,  4000] loss: 0.088
[1,  4500] loss: 0.097
[1,  5000] loss: 0.088
[1,  5500] loss: 0.083
[1,  6000] loss: 0.088
[1,  6500] loss: 0.091
[1,  7000] loss: 0.083
[1,  7500] loss: 0.087
[1,  8000] loss: 0.091
[1,  8500] loss: 0.079
Finished Training


In [0]:
# unfreeze additional layers and train another epoch

for param in model.bert.parameters():
   param.requires_grad = True

optimizer = AdamW(model.parameters(), lr=0.00001)

In [87]:
training_loop(1)

[1,   500] loss: 0.064
[1,  1000] loss: 0.050
[1,  1500] loss: 0.050
[1,  2000] loss: 0.046
[1,  2500] loss: 0.048
[1,  3000] loss: 0.043
[1,  3500] loss: 0.044
[1,  4000] loss: 0.044
[1,  4500] loss: 0.042
[1,  5000] loss: 0.043
[1,  5500] loss: 0.040
[1,  6000] loss: 0.043
[1,  6500] loss: 0.041
[1,  7000] loss: 0.039
[1,  7500] loss: 0.044
[1,  8000] loss: 0.042
[1,  8500] loss: 0.040
Finished Training


In [0]:
# Save Model
# torch.save(model.state_dict(), 'toxic_comment_model')

## Evalutate Model Using Validation Set

In [90]:
model.eval()

with torch.no_grad():
  running_loss = 0.0
  for i, data in enumerate(valid_dataloader, 0):
    # get the inputs; data is a list of [inputs, labels]
    inputs, attention_masks, labels = data[0].to(device), data[1].to(device), data[2].to(device)
    outputs = model(inputs, attention_mask=attention_masks)
    loss = criterion(torch.sigmoid(outputs[0]), labels)
    
    running_loss += loss.item()

print(running_loss)

54.30625113845417


In [92]:
# loss divided by len(valid)/batch_size 

running_loss / 1223

0.044404130121385256

## Predict on test data

In [0]:
predictions=[]

for i, data in enumerate(test_dataloader, 0):
  # get the inputs; data is a list of [inputs, labels]
  inputs, attention_masks = data[0].to(device), data[1].to(device)
  # get predictions
  with torch.no_grad():
    outputs = model(inputs, attention_mask=attention_masks)
    outputs = torch.sigmoid(outputs[0])
  predictions.append(outputs[0].detach().cpu().numpy())

In [0]:
# add predictions to test file

import numpy as np

columns=['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

preds = np.vstack(predictions)
submission = pd.DataFrame(preds, columns=columns, index=test.id)
submission.reset_index(inplace=True)

In [0]:
submission.to_csv('submission.csv', index=False)

## The submission resulted in a score of **0.98334**! 

This would have been better than the 1st place results on the private leaderboard (0.98856).  Top public leaderboard score was 0.98901.

The BERT model makes it easy to get great results, but the predictions took a long time to run.  Perhaps there is a more efficient way to load the data or run calculate predictions.  I could also reduce the maximium length of the tokens and see if I can maintain good accuracy and speed up processing.