# DistilBERT model
This notebook trains the DistilBERT model and exports a set of predictions for a test dataset.

First we need to install the required packages.

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
!pip install tweet-preprocessor

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments, DistilBertTokenizer
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from sklearn.model_selection import train_test_split
import pickle
import torch
from google.colab import drive
import preprocessor as p
import html

# Mount drive for loading the datasets
drive.mount('/content/drive')
import sys
sys.path.insert(0, '/content/drive/MyDrive/Colab Notebooks/')

FILENAME = "drive/MyDrive/Colab Notebooks/data/twitter_data.pkl"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Split and tokenize the datasets

In [6]:
class HateDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

def preprocess(data):
    p.set_options(p.OPT.URL, p.OPT.MENTION)

    return list(map(lambda text: p.tokenize(html.unescape(text)), data))


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

def convert_waseem_data(data):
    X = []
    y = []
    for i in range(len(data)):
        X.append(data[i]['text'])
        label = data[i]['label']
        if label in ['racism', 'sexism']:
          label = 1
        else:
          label = 0
        y.append(label)

    return X, y

In [7]:
data = pickle.load(open(FILENAME, 'rb'))
X, y = convert_waseem_data(data)
print(X[:10])
# X = preprocess(X)
print(X[:10])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.10)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=.2)
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

train_encodings = tokenizer(X_train, truncation=True, padding=True)
val_encodings = tokenizer(X_val, truncation=True, padding=True)
test_encodings = tokenizer(X_test, truncation=True, padding=True)

train_dataset = HateDataset(train_encodings, y_train)
val_dataset = HateDataset(val_encodings, y_val)
test_dataset = HateDataset(test_encodings, y_test)

['rt @colonelkickhead: another bloody instant restaurant week ?  !  ?  !  seriously !  they just jumped the shark riding two other sharks powered by sh…', '@azzamalirhabi @jihadia8 this video of the peshmerga decimating isis is far more interesting. https://t.co/d36g1z12np', "oh really ?  no more instant restaurants ?  that's shocking. #mkr #mkr2015", 'rt @benfrancisallen: it has not been a good few weeks for #isis. a new front has opened up in #sinjar and they are about to lose the battle f…', 'rt @notofeminism: i don’t need femisnsn because men carry heavy things that i cannot !  !  !  like shopping ,  boxes ,  and a huge sense of superiori…', '@mariachimacabre 19% is not the vast majority', '@dianh4 @exposefalsehood and it is muslims who were the first crusaders ,  attacking the christian world for centuries before it attacked back', '@truaemusic @mattybboi83 @number10gov capital hill is a great example of how seldom the world attacks islam given the daily provocations.', 'rt @fruit

## Load accuracy metric for the model's evaluation

In [8]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Setup DistilBERT model

In [9]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", per_device_train_batch_size=256)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset ,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

## Training

In [10]:
trainer.train()

***** Running training *****
  Num examples = 11584
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 138


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.364312,0.838108
2,No log,0.350254,0.848464
3,No log,0.341716,0.856748


***** Running Evaluation *****
  Num examples = 2897
  Batch size = 8
***** Running Evaluation *****
  Num examples = 2897
  Batch size = 8
***** Running Evaluation *****
  Num examples = 2897
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=138, training_loss=0.34750443610592163, metrics={'train_runtime': 241.1121, 'train_samples_per_second': 144.132, 'train_steps_per_second': 0.572, 'total_flos': 800218996856064.0, 'train_loss': 0.34750443610592163, 'epoch': 3.0})

## Export

In [13]:
torch.save(model, 'distilbert.pth')
from google.colab import files

files.download('distilbert.pth')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Load existing model

In [12]:
# model = torch.load("drive/MyDrive/Colab Notebooks/models/distilbert.pth")
# model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
# model.load_state_dict("drive/MyDrive/Colab Notebooks/models/distilbert.pth")
# model.eval()