# DistilBERT model
This notebook trains the DistilBERT model and exports a set of predictions for a test dataset.

**Trains on:** Waseem and Hovy (2016)

First we need to install the required packages.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 6.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 63.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 46.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.3


In [None]:
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 7.4 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 66.5 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 52.3 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 61.7 MB/s 
Installing collected packages: urllib3, xxhash, responses, multiprocess, datasets
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling u

In [None]:
!pip install tweet-preprocessor

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [None]:
!pip install "ray[tune]"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ray[tune]
  Downloading ray-2.0.0-cp37-cp37m-manylinux2014_x86_64.whl (59.4 MB)
[K     |████████████████████████████████| 59.4 MB 1.3 MB/s 
Collecting virtualenv
  Downloading virtualenv-20.16.5-py3-none-any.whl (8.8 MB)
[K     |████████████████████████████████| 8.8 MB 36.1 MB/s 
[?25hCollecting grpcio<=1.43.0,>=1.28.1
  Downloading grpcio-1.43.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 37.9 MB/s 
Collecting tensorboardX>=1.9
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 51.4 MB/s 
Collecting distlib<1,>=0.3.5
  Downloading distlib-0.3.6-py2.py3-none-any.whl (468 kB)
[K     |████████████████████████████████| 468 kB 35.0 MB/s 
[?25hCollecting platformdirs<3,>=2.4
  Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)
Installing co

In [None]:
!pip install wordsegment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wordsegment
  Downloading wordsegment-1.3.1-py2.py3-none-any.whl (4.8 MB)
[K     |████████████████████████████████| 4.8 MB 6.3 MB/s 
[?25hInstalling collected packages: wordsegment
Successfully installed wordsegment-1.3.1


In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments, DistilBertTokenizer
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from sklearn.model_selection import train_test_split
import pickle
import torch
from google.colab import drive
import preprocessor as p
import html
from torch.utils.data import DataLoader
import torch.nn.functional as nn
from ray.tune.schedulers import PopulationBasedTraining

# Mount drive for loading the datasets
drive.mount('/content/drive')
import sys
sys.path.insert(0, '/content/drive/MyDrive/Colab Notebooks/')

from reader import Reader

FILENAME = "drive/MyDrive/Colab Notebooks/data/twitter_data.pkl"
NUM_LABELS = 2

Mounted at /content/drive


## Split and tokenize the datasets

In [None]:
class HateDataset(torch.utils.data.Dataset):
    """Dataset class used for combining the data encodings and labels."""
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
X, y = Reader.load(FILENAME)
X = Reader.preprocess(X)

mapping = {'racism': 1,'sexism': 1, 'none': 0}
y = [mapping[b] for b in y]

X_train, X_val, X_test, y_train, y_val, y_test = Reader.split_with_validation(X, y)

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize all datasets
train_encodings =   
val_encodings = tokenizer(X_val, truncation=True, padding=True)
test_encodings = tokenizer(X_test, truncation=True, padding=True)

# Combine the encodings with the labels to Torch datasets
train_dataset = HateDataset(train_encodings, y_train)
val_dataset = HateDataset(val_encodings, y_val)
test_dataset = HateDataset(test_encodings, y_test)

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Load accuracy metric for the model's evaluation

In [None]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

## Setup DistilBERT model

In [None]:
def model_init():
    return DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=NUM_LABELS)
    
training_args = TrainingArguments(output_dir="train", evaluation_strategy="epoch")

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset ,
    compute_metrics=compute_metrics
)

scheduler = PopulationBasedTraining(
        metric='objective',
        mode='max',
        perturbation_interval=600.0,
        hyperparam_mutations={
            "per_device_train_batch_size": [16, 32],
            "learning_rate": [2e-5, 3e-5, 5e-5],
            "num_train_epochs": [2, 3, 4]
        })

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.21.3",
  "vocab_size": 30522
}

https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpg0ptt02y


Downloading pytorch_model.bin:   0%|          | 0.00/256M [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a
creating metadata file for /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a
loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight

## Training

In [None]:
best_trial = trainer.hyperparameter_search(
    direction="maximize", 
    backend="ray", 
    n_trials=10,
    scheduler=scheduler
)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
print(best_trial)

BestRun(run_id='4e5f2_00005', objective=0.8586078309509012, hyperparameters={'learning_rate': 1.1207606211860595e-05, 'num_train_epochs': 4, 'seed': 1.8994345766152145, 'per_device_train_batch_size': 16})


In [None]:
learning_rate = best_trial.hyperparameters['learning_rate']
num_train_epochs = best_trial.hyperparameters['num_train_epochs']
per_device_train_batch_size = best_trial.hyperparameters['per_device_train_batch_size']
seed = best_trial.hyperparameters['seed']

In [None]:
setattr(trainer.args, 'learning_rate', learning_rate)
setattr(trainer.args, 'num_train_epochs', num_train_epochs)
setattr(trainer.args, 'per_device_train_batch_size', per_device_train_batch_size)
setattr(trainer.args, 'seed', 42)

trainer.train()

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.21.3",
  "vocab_size": 30522
}

loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3879,0.387145,0.848353
2,0.3197,0.400404,0.857365
3,0.2648,0.477614,0.852393
4,0.1943,0.534339,0.854568


Saving model checkpoint to train/checkpoint-500
Configuration saved in train/checkpoint-500/config.json
Model weights saved in train/checkpoint-500/pytorch_model.bin
Saving model checkpoint to train/checkpoint-1000
Configuration saved in train/checkpoint-1000/config.json
Model weights saved in train/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 3218
  Batch size = 8
Saving model checkpoint to train/checkpoint-1500
Configuration saved in train/checkpoint-1500/config.json
Model weights saved in train/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to train/checkpoint-2000
Configuration saved in train/checkpoint-2000/config.json
Model weights saved in train/checkpoint-2000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 3218
  Batch size = 8
Saving model checkpoint to train/checkpoint-2500
Configuration saved in train/checkpoint-2500/config.json
Model weights saved in train/checkpoint-2500/pytorch_model.bin
Saving model check

TrainOutput(global_step=4828, training_loss=0.2968224997153041, metrics={'train_runtime': 324.3431, 'train_samples_per_second': 119.059, 'train_steps_per_second': 14.885, 'total_flos': 779293287474624.0, 'train_loss': 0.2968224997153041, 'epoch': 4.0})

In [None]:
path = F"drive/MyDrive/Colab Notebooks/output/distilbert-waseem-hovy.pth" 
trainer.save_model(path)

Saving model checkpoint to drive/MyDrive/Colab Notebooks/output/distilbert-waseem-hovy.pth
Configuration saved in drive/MyDrive/Colab Notebooks/output/distilbert-waseem-hovy.pth/config.json
Model weights saved in drive/MyDrive/Colab Notebooks/output/distilbert-waseem-hovy.pth/pytorch_model.bin


In [None]:
model = DistilBertForSequenceClassification.from_pretrained(path)

loading configuration file drive/MyDrive/Colab Notebooks/output/distilbert-waseem-hovy.pth/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.3",
  "vocab_size": 30522
}

loading weights file drive/MyDrive/Colab Notebooks/output/distilbert-waseem-hovy.pth/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassificatio

## Model calibration
We use temperature scaling to calibrate the model on the validation set by finding the optimal T value.

In [None]:
import sys
sys.path.append("drive/MyDrive/Colab Notebooks")
from temperature_scaling import ModelWithTemperature


In [None]:
calibrated_model = ModelWithTemperature(model)
val_loader = DataLoader(val_dataset)

# Find optimal T value to calibrate the model
calibrated_model.set_temperature(val_loader)


Before temperature - NLL: 0.448, ECE: 0.073
Optimal temperature: 1.468
After temperature - NLL: 0.476, ECE: 0.123


ModelWithTemperature(
  (model): DistilBertForSequenceClassification(
    (distilbert): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (

## Export model

In [None]:
path = F"drive/MyDrive/Colab Notebooks/output/distilbert-waseem-hovy-calibrated.pth" 
torch.save(calibrated_model, path)