# ARINC Fingerprinting BERT Multi Labels Class Classifier

Since Huggingface only implemented single class classification (with loss function `CrossEntropyLoss` used), we need to modify a bit to use our own loss function (i.e. `BCEWithLogitsLoss`). 

Also, `sigmoid` is chosen instead of `softmax` at the final layer because it ensure multi-class availability.

For more details you can check [Transformer for Multi-Label](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)


Import related libraries:

In [1]:
'''Train with PyTorch.'''
# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
import torch.utils.data as data

# BERT Related Libraries
from transformers import BertTokenizer, BertForSequenceClassification

# Python
import pandas as pd
import numpy as np
import os
import time


Declaring machine learning parameters:

In [2]:
# ML Parameters
lr = 1e-2
epoch = 5
batch_size = 16


Data Source:

In [3]:
# Load the Kaggle Toxic Comments
# https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
toxic_comments = pd.read_csv("./toxic_comments.csv")
toxic_comments.head()


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Create one data accessor (for PyTorch to read the data above easily):

In [4]:
class SentenceDataset(data.Dataset):

    def __init__(self, database):
        self.database = database

    def __len__(self):
        #return self.database.shape[0]
        return 128

    def __getitem__(self, idx):
        
        # return the sentence
        i = self.database["comment_text"][idx]
        
        # return the label array
        label = self.database.loc[idx, ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]]
        label = np.array(label, dtype=float)
        
        return i, label


Prepare Data Training Set and Testing Set:

In [5]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load training dataset
dataset = SentenceDataset(toxic_comments)
print("Total: %i" % len(dataset))

# Split training and validation set
train_len = int(0.7*len(dataset))
valid_len = len(dataset) - train_len
TrainData1, ValidationData1 = random_split(dataset,[train_len, valid_len])
print("Training: %i / Testing: %i" %(len(TrainData1), len(ValidationData1)))

# Load into Iterator (each time get one batch)
train_loader = data.DataLoader(TrainData1, batch_size=batch_size, shuffle=True,drop_last=False, num_workers=0)
test_loader = data.DataLoader(ValidationData1, batch_size=batch_size, shuffle=True,drop_last=False, num_workers=0)


Total: 128
Training: 89 / Testing: 39


Create model instance:

In [6]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# hard code the label dimension to be 6 (because the data has 6 classes)
num_labels = 6

# Define model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)
model.to(device)

# Define tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define optimizer
#optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)
optimizer = optim.AdamW(model.parameters(), lr=lr)

# Define Loss function
criterion = nn.BCEWithLogitsLoss()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Preparation of traning and validation set:

Training and Testing Functions:

In [7]:
###########################
# Train with training set #
###########################
def train(model, iterator, optimizer, criterion, device):
    
    model.train()     # Enter Train Mode
    train_loss = 0

    for batch_idx, (sentences, labels) in enumerate(iterator):
        
        print(sentences)
        
        # tokenize the sentences
        encoding = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
        input_ids = encoding['input_ids']
        attention_mask = encoding['attention_mask']

        # move to GPU if necessary
        input_ids, labels = input_ids.to(device), labels.to(device)
        
        # generate prediction
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)  # NOT USING INTERNAL CrossEntropyLoss
        
        # compute gradients and update weights
        loss = criterion(outputs.logits, labels) # BCEWithLogitsLoss has sigmoid
        loss.backward()
        optimizer.step()

        # accumulate train loss
        train_loss += loss
        
    # print completed result
    print('train_loss: %f' % (train_loss))
    return train_loss


#############################
# Validate with testing set #
#############################
def test(model, iterator, optimizer, criterion, device):

    model.eval()     # Enter Evaluation Mode
    correct = 0
    total = 0

    with torch.no_grad():
        for batch_idx, (sentences, labels) in enumerate(iterator):
            
            # tokenize the sentences
            encoding = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
            input_ids = encoding['input_ids']
            attention_mask = encoding['attention_mask']
            
            # move to GPU if necessary
            input_ids, labels = input_ids.to(device), labels.to(device)
            
            # generate prediction
            outputs = model(input_ids, attention_mask=attention_mask)  # NOT USING INTERNAL CrossEntropyLoss
            prob = outputs.logits.sigmoid()   # BCEWithLogitsLoss has sigmoid
            
            # record processed data count
            total += (labels.size(0)*labels.size(1))

            # take the index of the highest prob as prediction output
            THRESHOLD = 0.7
            prediction = prob.detach().clone()
            prediction[prediction > THRESHOLD] = 1
            prediction[prediction <= THRESHOLD] = 0
            correct += prediction.eq(labels).sum().item()
    
    # print completed result
    acc = 100.*correct/total
    print('correct: %i  / total: %i / test_acc: %f' % (correct, total, acc))
    return acc


Acutal execution:

- Run `training()` and `test()` for `epoch` times


In [8]:
for e in range(epoch):
    
    print("===== Epoch %i =====" % e)
    
    # training
    print("Training started ...")
    train(model, train_loader, optimizer, criterion, device)

    # validation testing
    print("Testing started ...")
    test(model, test_loader, optimizer, criterion, device)



===== Epoch 0 =====
Training started ...
('FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!', '"\n\nWell, after I asked you to provide the diffs within one hour of your next edit here, you made an edit to your talk page here and then did not provide the diffs I requested within one hour of that edit. I then sanctioned you for failing to provide the requested diffs in a timely manner (which, after more than a week, you have still not done). Consequently, your request to lift the sanction is denied.  "', 'Locking this page would also violate WP:NEWBIES.  Whether you like it or not, conservatives are Wikipedians too.', "i can't believe no one has already put up this page Dilbert's Desktop Games so I did", "Oh, it's me vandalising?xD See here. Greetings,", "I'm Sorry \n\nI'm sorry I screwed around with someones talk page.  It was very bad to do.  I know how having the templates on their talk page helps you assert your dominance over them.  I know I should bow down to the almighty administrators.  

KeyboardInterrupt: 