# Jigsaw Benchmark 
> Benchmark model for Jigsaw Multilingual Toxic Comment Classification.
- toc: true
- badges: true
- comments: true
- author: Aman Arora

In this post we will be looking at creating a benchmark model for [Jigsaw Multilingual Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification) Kaggle competition. 

We will be using the wonderful [transformers](https://huggingface.co/transformers/) library from Hugging Face and as mentioned in my [previous post](https://amaarora.github.io/jigsaw/2020/04/10/JigsawIntro.html) we will be using a translated version of the test dataset to create a benchmark model. 

In [1]:
from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
from transformers import AdamW, get_linear_schedule_with_warmup, get_constant_schedule
from tqdm.notebook import tqdm
import torch.nn as nn
import torch
import pandas as pd

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info("test")

INFO:root:test


## Tokenizer

We create a very basic `Tokenizer` class that creates a `BertTokenizer` from a pretrained multilingual model. The multilingual model that we will be using in this benchmark model is `bert-base-multilingual-cased`. We also write a `__call__` method inside the Class which calls the `encode_plus` function on the `BertTokenizer` that returns a dictionary of `input_ids`, `token_type_ids` and `attention_mask` that are all fed to `Bert` model. 

In [3]:
class Tokenizer:
    def __init__(self, model_name):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        
    def __call__(self, input, **kwargs):
        return self.tokenizer.encode_plus(input, **kwargs)

In [4]:
tok = Tokenizer('bert-base-multilingual-cased')
tok("Hello, how are you?", add_special_tokens=True)

INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/ubuntu/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729


{'input_ids': [101, 31178, 117, 14796, 10301, 13028, 136, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

## Bert Model

Again, we create a `BertModel` class that is a thin wrapper around the `BertForSequenceClassification` class from Hugging Face and we update the final classifier layer to have only 1 final output because this a binary classification problem. We also update the `__getattr__` method such that any lookup for an attribute that is not present in the `BertModel` class is passed on to `self.model` ie., to `BertForSequenceClassification`. This is done such that we still keep all the functionality that is already available.

In [5]:
class BertModel:
    def __init__(self, model_name, no=1):
        self.model = BertForSequenceClassification.from_pretrained(model_name)
        if self.model.classifier.out_features != no: self.model.classifier = nn.Linear(768, 1, bias=True)
    
    def __call__(self, *args, **kwargs):
        if 'target' in kwargs.keys(): del kwargs['target']
        return self.model(*args, **kwargs)
    
    def __getattr__(self, attr):
        return getattr(self.model, attr)

In [6]:
model = BertModel('bert-base-multilingual-cased')

INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json from cache at /home/ubuntu/.cache/torch/transformers/45629519f3117b89d89fd9c740073d8e4c1f0a70f9842476185100a8afe715d1.893eae5c77904d1e9175faf98909639d3eb20cc7e13e2be395de9a0d8a0dad52
INFO:transformers.configuration_utils:Model config BertConfig {
  "_num_labels": 2,
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bad_words_ids": null,
  "bos_token_id": null,
  "decoder_start_token_id": null,
  "directionality": "bidi",
  "do_sample": false,
  "early_stopping": false,
  "eos_token_id": null,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0"

Let's do a quick test to make sure that our model works. Currently the outputs are not fine-tuned for a toxic classification task and we will have to train the network for the target task to have the correct outputs. But it is good to know that we can call the model and it seems to be spitting out some outputs. It is good to have these tiny tests to make sure that what we have done so far works.  

In [7]:
input_ids = torch.tensor(tok("Hello, my dog is cute", add_special_tokens=True)['input_ids'], dtype=torch.long).unsqueeze(0)  # Batch size 1
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
outputs

(tensor([[-0.1491]], grad_fn=<AddmmBackward>),)

## Dataset

Finally we have to create a Dataset that will then be used to create `DataLoader` which will be used for training. We use the `valid_translated` file for training - the reason why we are not using the `train.csv` file is because the test set and the valid set were translated by using the Yandex API. If we use the train file, then the similarity of words between valid and train will be different (this is my assumption) and training directly on the validation file should work. At least, this is my understanding. 

In [8]:
df = pd.read_csv("./jigsaw_miltilingual_valid_translated.csv", usecols=['translated', 'toxic'])
df.head()

Unnamed: 0,translated,toxic
0,This user does not even make it to the rank of...,0
1,The text of this entry appears to be like I di...,0
2,It is worth it. Only expose my past. All time ...,1
3,Of this article as a sub-heading with maintain...,0
4,"I guess while they're At of the city, district...",0


We create a simple `BertDataset` class that returns a dictionary containing the `targets` if the dataset is in training otherwise it doesn't return the targets. We add this custom behavior such that the same class can be used to create a train and test dataset. Rest of the methods are pretty simple - we initialize our tokenizer, in `__getitem__` tokenize the `ith` row of dataframe and return it along with `attention_mask`, `token_type_ids` and maybe `target`. 

One thing that we also do is to pad the sequences, we do this so that we can collate and run the model as a batch. If the sequence lengths are the same, then they can be concatenated and we can make the call to a model in a batch. 

In [9]:
class BertDataset():
    def __init__(self, csv_path, usecols, model_name='bert-base-multilingual-cased', max_len=512, textcol='translated', train=True):
        self.df = pd.read_csv(csv_path, usecols=None if usecols is None else usecols)
        self.tok = Tokenizer(model_name)
        self.max_len = max_len
        self.textcol = textcol
        self.train = train

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        comment = self.df.iloc[idx][self.textcol]
        if self.train: target  = self.df.iloc[idx].toxic
    
        # encode comment
        t_out = self.tok(comment, add_special_tokens=True, max_length=self.max_len)
        input_ids = t_out['input_ids']
        token_type_ids = t_out['token_type_ids']
        attention_mask = t_out['attention_mask']

        # pad sequences 
        padding_length = self.max_len - len(input_ids)
        input_ids = input_ids + ([0] * padding_length)
        attention_mask = attention_mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)

        return {
            'input_ids': torch.tensor(input_ids, dtype=torch.long),
            'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'target': torch.tensor(target, dtype=torch.float)
        } if self.train else {
            'input_ids': torch.tensor(input_ids, dtype=torch.long),
            'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
        }

In [10]:
dataset = BertDataset(csv_path="./jigsaw_miltilingual_valid_translated.csv", usecols=['translated', 'toxic'])

INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/ubuntu/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729


## Dataloader

Finally we create a `dataloader`. This works because we padded the sequences inside our dataset. If we had not paded the sequences the `dataloader` would not have been able to collate the sequences due to different sequence lengths and thrown an error. 

In [11]:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=4)

Let's do a quick test to make sure that the dataloader works. 

In [12]:
# make sure dataloader works
next(iter(dataloader))

{'input_ids': tensor([[  101, 10747, 29115,  ...,     0,     0,     0],
         [  101, 10117, 15541,  ...,     0,     0,     0],
         [  101, 10377, 10124,  ...,     0,     0,     0],
         [  101, 12610, 10531,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'target': tensor([0., 0., 1., 0.])}

In [13]:
model(**next(iter(dataloader)))

(tensor([[-0.0938],
         [-0.1023],
         [-0.0882],
         [-0.1240]], grad_fn=<AddmmBackward>),)

## Train

Finally, we create a `loss_fn` that will be used to generate loss between outputs and inputs. 

In [14]:
def loss_fn(outputs, targets):
    return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))

We also create a list of optimizer with and without `weight_decay` that we will be passed to the `Optimizer` - so we create two param groups. 

In [15]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]

In [16]:
num_train_steps = int(len(dataset) / 64 * 5)

Finally we initialize our `Optimizer` and `Scheduler` and initialize `device` as `cuda`. 

In [17]:
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=num_train_steps
)
device = 'cuda'

In [18]:
model = model.to('cuda')

Below is a very basic training loop since this is a benchmark model. 

In [19]:
model.train()

for epoch in tqdm(range(5)):
    for bi, d in tqdm(enumerate(dataloader), total=len(dataloader)):
        input_ids = d["input_ids"]
        attention_mask = d["attention_mask"]
        token_type_ids = d["token_type_ids"]
        targets = d["target"]

        input_ids = input_ids.to(device, dtype=torch.long)
        attention_mask = attention_mask.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        targets = targets.to(device, dtype=torch.float)

        optimizer.zero_grad()
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )[0]

        loss = loss_fn(outputs, targets)

        loss.backward()
        optimizer.step()
        if scheduler is not None:
            scheduler.step()

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))





## Predictions

Now once the model is trained we are ready to make predictions, we create a test dataset with the `jigsaw_miltilingual_test_translated.csv` file and make predictions on this file. 

In [20]:
test_dataset = BertDataset(csv_path="./jigsaw_miltilingual_test_translated.csv", usecols=['translated'], textcol='translated', train=False)

INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/ubuntu/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729


In [21]:
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=8)

Below is a very basic prediction test loop that is a copy of the training loop without the `loss.backward()` and with a `torch.no_grad` to save GPU memory because during testing we are not worried about the gradients. 

In [22]:
model.eval()
preds = []
with torch.no_grad():    
    for bi, d in tqdm(enumerate(test_dataloader), total=len(test_dataloader)):
        input_ids = d["input_ids"]
        attention_mask = d["attention_mask"]
        token_type_ids = d["token_type_ids"]

        input_ids = input_ids.to(device, dtype=torch.long)
        attention_mask = attention_mask.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)

        optimizer.zero_grad()
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )[0]
        outputs_np = outputs.cpu().detach().numpy().tolist()
        preds.extend(outputs_np)

HBox(children=(FloatProgress(value=0.0, max=7977.0), HTML(value='')))




We take a sigmoid to convert the predictions to range (0,1).

In [34]:
_preds = torch.sigmoid(torch.tensor(preds))
_preds.shape

torch.Size([63812, 1])

In [35]:
_preds.squeeze()

tensor([0.0161, 0.0169, 0.4494,  ..., 0.2997, 0.0194, 0.0184])

In [37]:
submission = pd.read_csv("./sample_submission.csv")
submission['toxic'] = _preds.squeeze()
submission.to_csv("submission.csv", index=False)

Now we can simply update this model to Kaggle Kernel and make predictions on the competition. 

In [None]:
torch.save(model.state_dict(), './first.bin')

I have created a basic benchmark kernel [here](https://www.kaggle.com/aroraaman/first-jigsaw-inference) and we get a score of `0.8808` on this benchmark model which is trained on very little data in much less time. This is pretty encouraging. 

More to come in the next blogposts!