# SQuAD Q&A

This notebook contains training scripts for models to be used for the question answering problem on the SQuAD v1.1 dataset, which consists on selecting a possible answer to the given question as a span of words in the given context paragraph. The newest version (v2.0) of the dataset also contains unanswerable questions, but the one on which we worked on (v1.1) does not.

## Colab requirements

Before restarting runtime (remember to select GPU runtime)$\dots$

In [None]:
!git clone https://github.com/Wadaboa/squad-question-answering.git
!pip install -r squad-question-answering/init/base_requirements.txt

After restarting runtime$\dots$

In [None]:
import os, sys

sys.path.insert(0, "squad-question-answering")
os.chdir("squad-question-answering")

## Imports

In order to import source files, we have to add the `src` folder to the Python path$\dots$ 

In [3]:
import sys

sys.path.insert(0, "src")

Then, we can import packages as usual$\dots$

In [4]:
import os
from functools import partial

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import wandb
import transformers
from transformers.trainer_utils import set_seed

import dataset
import model
import training
import tokenizer
import utils

%load_ext autoreload
%autoreload 2

## Initialization

In this section we are going to perform some initialization stuff for all the libraries to be used throughout the notebook$\dots$

### Weights & biases

Training and evaluation metrics, along with model checkpoints and results, are directly logged into a [W&B](https://wandb.ai/) project, which is openly accessible [here](https://wandb.ai/wadaboa/squad-qa). Logging abilities are only granted to members of the team, so that if you want to launch your training run, you would have to disable wandb, by setting the environment variable `WANDB_DISABLED` to an empty value in the following block (`%env WANDB_DISABLED=`).

In [3]:
%env WANDB_PROJECT=squad-qa
%env WANDB_ENTITY=wadaboa
%env WANDB_MODE=online
%env WANDB_RESUME=never
%env WANDB_WATCH=false
%env WANDB_SILENT=true

env: WANDB_PROJECT=squad-qa
env: WANDB_ENTITY=wadaboa
env: WANDB_MODE=online
env: WANDB_RESUME=never
env: WANDB_WATCH=false
env: WANDB_SILENT=true


Be sure to be logged in: if the system prompts you to insert a key, head over to [W&B](https://wandb.ai/authorize), login and the key should appear on the web page$\dots$

In [4]:
!wandb login

Be sure to have `wandb` enabled system-wise$\dots$

In [5]:
!wandb enabled

W&B enabled.


### PyTorch and numpy

Set the random seed to a fixed number for reproducible results$\dots$

In [6]:
RANDOM_SEED = 42
set_seed(RANDOM_SEED)

Get the fastest device (GPU if available, else CPU as a fallback) to be used for training neural models in `PyTorch`$\dots$

In [7]:
DEVICE = utils.get_device()
DEVICE

  return torch._C._cuda_getDeviceCount() > 0


device(type='cpu')

If a GPU device is available, print related info like GPU type, current usage$\dots$

In [8]:
if DEVICE.type != "cpu":
    !nvidia-smi

## Preliminaries

In this section we are going to perform some preliminary steps, like data loading and common variables definition$\dots$

### Raw data loading

The `SquadDataset` class holds a "raw" copy of the training set and the test set (if given). By "raw", we simply mean that questions and contexts are not pre-processed in this stage, but they are simply loaded from the given `JSON` files into appropriate `Pandas` `DataFrame`s.

In [9]:
DATA_FOLDER = os.path.join(os.getcwd(), "data")
TRAIN_DATA_FOLDER = os.path.join(DATA_FOLDER, "training")
TRAIN_SET_PATH = os.path.join(TRAIN_DATA_FOLDER, "training_set.json")
TEST_DATA_FOLDER = os.path.join(DATA_FOLDER, "testing")
TEST_SET_PATH = os.path.join(TEST_DATA_FOLDER, "test_set.json")

Remember that the `subset` variable is used to load a random subset of both the training and testing dataset. This is to be used only for debugging purposes, so that `subset` should be set to $1.0$ when performing real training runs.

In [10]:
squad_dataset = dataset.SquadDataset(
    train_set_path=TRAIN_SET_PATH, test_set_path=TEST_SET_PATH, subset=0.00005
)

Let's visualize the "raw" training set$\dots$

In [11]:
squad_dataset.raw_train_df

Unnamed: 0,answer_start,answer,title,context,question_id,question,context_id,answer_end
0,167,1735,Institute_of_technology,The world's first institution of technology or...,56de4d9ecffd8e1900b4b7e2,What year was the Banská Akadémia founded?,1860,171
1,793,SOS-based speed,Film_speed,The standard specifies how speed ratings shoul...,572674a05951b619008f7319,What is another speed that can also be reporte...,9354,808
2,421,Sumerian temples and palaces,Sumer,The most impressive and famous of Sumerian bui...,5730bb058ab72b1400f9c72c,Where were the use of advanced materials and t...,17505,449
3,192,mayor,"Ann_Arbor,_Michigan",Ann Arbor has a council-manager form of govern...,572781a5f1498d1400e8fa1f,Who is elected every even numbered year?,10585,197


Let's visualize the "raw" test set$\dots$

In [12]:
squad_dataset.raw_test_df

Unnamed: 0,answer_start,answer,title,context,question_id,question,context_id,answer_end


### Utils

This section contains common variables and functions to be used when training all the subsequent models$\dots$

The following cell is tasked to load the default training parameters, such as the batch size, the logging frequency, where model checkpoints should be saved and more$\dots$

In [22]:
TRAINER_ARGS = utils.get_default_trainer_args()

## Recurrent models

This section contains blocks of code that are used to train question answering models, which are based on recurrent modules (`LSTM`s in our case). The models that we implemented are the following:
- Baseline
- BiDAF (Bi-Directional Attention Flow)

### Embeddings

In this section we are going to load an embedding matrix using the `Gensim` API and use the corresponding matrix as the weight block of an `nn.Embedding` `PyTorch` module$\dots$

First of all, we have to define one token for padding values. Then, OOV words are handled by a single unknown token, which is estimated as the mean of all the embedding vectors (if this mean vector is already present in the model, then a random embedding with suitable ranges is computed).

In [12]:
UNK_TOKEN = "[UNK]"
PAD_TOKEN = "[PAD]"

List of available embedding models (see [here](https://github.com/RaRe-Technologies/gensim-data)):
- FastText: 
    - _fasttext-wiki-news-subwords_ (dimensions: 300)
- GloVe:
    - _glove-twitter_ (dimensions: 25. 50, 100, 200)
    - _glove-wiki-gigaword_ (dimensions: 50, 100, 200, 300)
- Word2Vec:
    - _word2vec-google-news_ (dimensions: 300)
    - _word2vec-ruscorpora_ (dimensions: 300)

**Note**: The following cell could take a while (depending on the embedding dimension), since embedding models are pretty large$\dots$

In [13]:
EMBEDDING_DIMENSION = 25
EMBEDDING_MODEL_NAME = "glove-twitter"
embedding_model, vocab = utils.load_embedding_model(
    EMBEDDING_MODEL_NAME,
    embedding_dimension=EMBEDDING_DIMENSION,
    unk_token=UNK_TOKEN,
    pad_token=PAD_TOKEN,
)

The following cell is tasked to load the embedding model into a `PyTorch` `nn.Embedding` layer, with freezed weights.

In [14]:
embedding_layer = model.get_embedding_module(embedding_model, pad_id=vocab[PAD_TOKEN])

### Data loading

The `SquadDataManager` class acts as both a data collator (i.e. bringing together multiple examples in the dataset with the help of `PyTorch`'s `DataLoader`s) and a tokenizer. In particular, tokenization happens on the fly at the batch level, thus enabling us to perform dynamic padding (based on the longest sequence in a batch) and avoiding the pre-tokenization overhead.

The tokenizer that we are using for recurrent modules splits words by whitespaces and punctuations, removes accents and applies a lowercasing function to all the tokens. Moreover, questions are padded (not truncated), while contexts are truncated to a maximum number of tokens and padded.

In [39]:
MAX_CONTEXT_TOKENS = 300

In [40]:
recurrent_tokenizer = tokenizer.get_recurrent_tokenizer(
    vocab, MAX_CONTEXT_TOKENS, unk_token=UNK_TOKEN, pad_token=PAD_TOKEN, device=DEVICE,
)

The `SquadDataManager` class also acts as a pre-processor, with the following steps:
- Removes rows that contain wrong answers (e.g. answers that do not start and end at word boundaries)
- Removes rows that contain answers that would be lost due to tokenization (truncation in particular)
- Groups answers to the same question and context pair into a single row (thus producing lists in the `answer`, `answer_start` and `answer_end` columns)

In [42]:
recurrent_dm = dataset.SquadDataManager(
    squad_dataset, recurrent_tokenizer, device=DEVICE
)

The last task assigned to the `SquadDataManager` class is that of train/validation splitting, with the given ratio ($80\%$ for the training set by default).

Let's have a look at the final training dataset$\dots$

In [43]:
recurrent_dm.train_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,56de4d9ecffd8e1900b4b7e2,What year was the Banská Akadémia founded?,Institute_of_technology,1860,The world's first institution of technology or...,[1735],[167],[171]
1,572781a5f1498d1400e8fa1f,Who is elected every even numbered year?,"Ann_Arbor,_Michigan",10585,Ann Arbor has a council-manager form of govern...,[mayor],[192],[197]
2,5730bb058ab72b1400f9c72c,Where were the use of advanced materials and t...,Sumer,17505,The most impressive and famous of Sumerian bui...,[Sumerian temples and palaces],[421],[449]


Let's have a look at the final validation dataset$\dots$

In [44]:
recurrent_dm.val_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,572674a05951b619008f7319,What is another speed that can also be reporte...,Film_speed,9354,The standard specifies how speed ratings shoul...,[SOS-based speed],[793],[808]


And let's do the same for the testing dataset$\dots$

In [45]:
recurrent_dm.test_df

Unnamed: 0,index,answer,answer_start,answer_end


### Baseline model

In [57]:
%env WANDB_RUN_GROUP=baseline
baseline_run_name = utils.get_run_name()
baseline_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{baseline_run_name}",
    num_train_epochs=30,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
)

env: WANDB_RUN_GROUP=baseline


#### Training and validation

In [87]:
baseline_model = model.QABaselineModel(
    embedding_layer, MAX_CONTEXT_TOKENS, device=DEVICE
)
print(f"The baseline model has {baseline_model.count_parameters()} parameters")

The baseline model has 245202 parameters


In [88]:
baseline_optimizer = optim.Adam(baseline_model.parameters(), lr=1e-3)
baseline_lr_scheduler = transformers.get_constant_schedule(baseline_optimizer)

In [89]:
baseline_trainer = training.SquadTrainer(
    model=baseline_model,
    args=baseline_args(run_name=baseline_run_name),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.train_dataset,
    eval_dataset=recurrent_dm.val_dataset,
    optimizers=(baseline_optimizer, baseline_lr_scheduler),
)

In [90]:
baseline_trainer.train()

torch.Size([3, 172, 100]) torch.Size([3, 172, 100]) torch.Size([3, 100])


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Em,Runtime,Samples Per Second
1,6.0152,7.586694,0.0,0.0,0.0,0.0,0.0,0.1184,8.444
2,6.0152,7.584815,0.0,0.0,0.0,0.0,0.0,0.119,8.4
3,6.0152,7.58181,0.0,0.0,0.0,0.0,0.0,0.1092,9.161
4,6.0152,7.57692,0.0,0.0,0.0,0.0,0.0,0.1089,9.184
5,7.3691,7.570308,0.0,0.0,0.0,0.0,0.0,0.1084,9.223
6,7.3691,7.560383,0.0,0.0,0.0,0.0,0.0,0.1112,8.989
7,7.3691,7.5468,0.0,0.0,0.0,0.0,0.0,0.1118,8.941


torch.Size([1, 194, 100]) torch.Size([1, 194, 100]) torch.Size([1, 100])
torch.Size([3, 172, 100]) torch.Size([3, 172, 100]) torch.Size([3, 100])
torch.Size([1, 194, 100]) torch.Size([1, 194, 100]) torch.Size([1, 100])
torch.Size([3, 172, 100]) torch.Size([3, 172, 100]) torch.Size([3, 100])
torch.Size([1, 194, 100]) torch.Size([1, 194, 100]) torch.Size([1, 100])
torch.Size([3, 172, 100]) torch.Size([3, 172, 100]) torch.Size([3, 100])
torch.Size([1, 194, 100]) torch.Size([1, 194, 100]) torch.Size([1, 100])
torch.Size([3, 172, 100]) torch.Size([3, 172, 100]) torch.Size([3, 100])
torch.Size([1, 194, 100]) torch.Size([1, 194, 100]) torch.Size([1, 100])
torch.Size([3, 172, 100]) torch.Size([3, 172, 100]) torch.Size([3, 100])
torch.Size([1, 194, 100]) torch.Size([1, 194, 100]) torch.Size([1, 100])
torch.Size([3, 172, 100]) torch.Size([3, 172, 100]) torch.Size([3, 100])
torch.Size([1, 194, 100]) torch.Size([1, 194, 100]) torch.Size([1, 100])


KeyboardInterrupt: 

#### Training only

In [62]:
baseline_model = model.QABaselineModel(
    embedding_layer, MAX_CONTEXT_TOKENS, device=DEVICE
)
print(f"The baseline model has {baseline_model.count_parameters()} parameters")

The baseline model has 41000 parameters


In [None]:
baseline_optimizer = optim.Adam(baseline_model.parameters(), lr=1e-3)
baseline_lr_scheduler = transformers.get_constant_schedule(baseline_optimizer)

In [63]:
baseline_trainer = training.SquadTrainer(
    model=baseline_model,
    args=baseline_args(run_name=f"{baseline_run_name}-whole", evaluation_strategy="no"),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.whole_dataset,
    optimizers=(baseline_optimizer, baseline_lr_scheduler),
)

In [64]:
baseline_trainer.train()

Step,Training Loss
1,5.9843
5,6.3311
10,6.1952
15,6.2592
20,6.2147
25,6.1515
30,6.2805
35,6.2771


TrainOutput(global_step=35, training_loss=6.234272003173828, metrics={'train_runtime': 79.2375, 'train_samples_per_second': 0.442, 'total_flos': 0, 'epoch': 5.0})

#### Testing

In [65]:
baseline_test_output = baseline_trainer.predict(recurrent_dm.test_dataset)
baseline_test_output.metrics

{'test_loss': 6.0533928871154785,
 'test_accuracy': 0.048774934275635486,
 'test_precision': 0.04997091627749208,
 'test_recall': 0.539648141845944,
 'test_f1': 0.08496900937695989,
 'test_em': 0.0,
 'test_runtime': 1.8771,
 'test_samples_per_second': 96.959}

In [66]:
baseline_answers_path = "results/answers/baseline.json"
utils.save_answers(baseline_answers_path, baseline_test_output.predictions[-1])
wandb.save(baseline_answers_path);
wandb.finish()

### BiDAF

> The Bi-Directional Attention Flow (BIDAF) network is a hierarchical
multi-stage architecture for modeling the representations of the context paragraph at different levels
of granularity. BIDAF includes character-level, word-level, and contextual embeddings,
and uses bi-directional attention flow to obtain a query-aware context representation.
Our attention mechanism offers following improvements to the previously popular attention paradigms. 
First, the attention layer is not used to summarize the context paragraph into a fixed-size vector. Instead, the
attention is computed for every time step, and the attended vector at each time step, along with the
representations from previous layers, is allowed to flow through to the subsequent modeling layer.
This reduces the information loss caused by early summarization. Second, we use a memory-less
attention mechanism. That is, while we iteratively compute attention through time, the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step.

In [46]:
%env WANDB_RUN_GROUP=bidaf
bidaf_run_name = utils.get_run_name()
bidaf_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{bidaf_run_name}",
    num_train_epochs=18,
    per_device_train_batch_size=60,
    per_device_eval_batch_size=60,
)

env: WANDB_RUN_GROUP=bidaf


#### Training and validation

In [47]:
bidaf_model = model.QABiDAFModel(embedding_layer, device=DEVICE)
print(f"The BiDAF model has {bidaf_model.count_parameters()} parameters")

The BiDAF model has 2052700 parameters


In [48]:
bidaf_optimizer = optim.Adadelta(bidaf_model.parameters(), lr=0.5)
bidaf_lr_scheduler = transformers.get_constant_schedule(bidaf_optimizer)

In [49]:
bidaf_trainer = training.SquadTrainer(
    model=bidaf_model,
    args=bidaf_args(run_name=bidaf_run_name),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.train_dataset,
    eval_dataset=recurrent_dm.val_dataset,
    optimizers=(bidaf_optimizer, bidaf_lr_scheduler),
)

In [50]:
bidaf_trainer.train()

torch.Size([3, 172, 100])
torch.Size([3, 172, 200])
torch.Size([3, 172, 14]) torch.Size([3, 14])
torch.Size([3, 172, 200]) torch.Size([3, 172, 1]) torch.Size([3, 172])


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Em,Runtime,Samples Per Second
1,6.1415,7.783813,0.0,0.0,0.0,0.0,0.0,0.2816,3.551


torch.Size([1, 194, 100])
torch.Size([1, 194, 200])
torch.Size([1, 194, 13]) torch.Size([1, 13])
torch.Size([1, 194, 200]) torch.Size([1, 194, 1]) torch.Size([1, 194])
torch.Size([3, 172, 100])
torch.Size([3, 172, 200])
torch.Size([3, 172, 14]) torch.Size([3, 14])
torch.Size([3, 172, 200]) torch.Size([3, 172, 1]) torch.Size([3, 172])
torch.Size([1, 194, 100])
torch.Size([1, 194, 200])
torch.Size([1, 194, 13]) torch.Size([1, 13])
torch.Size([1, 194, 200]) torch.Size([1, 194, 1]) torch.Size([1, 194])


KeyboardInterrupt: 

#### Training only

In [None]:
bidaf_model = model.QABiDAFModel(embedding_layer, device=DEVICE)
print(f"The BiDAF model has {bidaf_model.count_parameters()} parameters")

In [None]:
bidaf_optimizer = optim.Adadelta(bidaf_model.parameters(), lr=0.5)
bidaf_lr_scheduler = transformers.get_constant_schedule(bidaf_optimizer)

In [None]:
bidaf_trainer = training.SquadTrainer(
    model=bidaf_model,
    args=bidaf_args(run_name=f"{bidaf_run_name}-whole", evaluation_strategy="no"),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.whole_dataset,
    optimizers=(bidaf_optimizer, bidaf_lr_scheduler),
)

In [None]:
bidaf_trainer.train()

#### Testing

In [45]:
bidaf_test_output = bidaf_trainer.predict(recurrent_dm.test_dataset)
bidaf_test_output.metrics

{'test_loss': 8.894977569580078,
 'test_f1': 0.0,
 'test_accuracy': 0.0,
 'test_em': 0.0,
 'test_runtime': 0.2739,
 'test_samples_per_second': 3.651}

In [46]:
bidaf_answers_path = "results/answers/bidaf.json"
utils.save_answers(bidaf_answers_path, bidaf_test_output.predictions[-1])
wandb.save(bidaf_answers_path);
wandb.finish()

## Transformer networks

### Data loading

In [13]:
MAX_BERT_TOKENS = 512

In [14]:
transformer_tokenizer = tokenizer.get_transformer_tokenizer(
    max_tokens=MAX_BERT_TOKENS, device=DEVICE
)

In [15]:
transformer_dm = dataset.SquadDataManager(
    squad_dataset, transformer_tokenizer, device=DEVICE
)

In [16]:
transformer_dm.train_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,56de4d9ecffd8e1900b4b7e2,What year was the Banská Akadémia founded?,Institute_of_technology,1860,The world's first institution of technology or...,[1735],[167],[171]
1,572781a5f1498d1400e8fa1f,Who is elected every even numbered year?,"Ann_Arbor,_Michigan",10585,Ann Arbor has a council-manager form of govern...,[mayor],[192],[197]
2,5730bb058ab72b1400f9c72c,Where were the use of advanced materials and t...,Sumer,17505,The most impressive and famous of Sumerian bui...,[Sumerian temples and palaces],[421],[449]


In [17]:
transformer_dm.val_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,572674a05951b619008f7319,What is another speed that can also be reporte...,Film_speed,9354,The standard specifies how speed ratings shoul...,[SOS-based speed],[793],[808]


In [18]:
transformer_dm.test_df

Unnamed: 0,index,answer,answer_start,answer_end


### BERT

The BERT model is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

The abstract from the paper is the following:

> We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

In [23]:
%env WANDB_RUN_GROUP=bert
bert_run_name = utils.get_run_name()
bert_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{bert_run_name}",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

env: WANDB_RUN_GROUP=bert


#### Training and validation

In [32]:
bert_model = model.QABertModel(device=DEVICE)
print(f"The BERT model has {bert_model.count_parameters()} parameters")

The BERT model has 114208512 parameters


In [33]:
bert_optimizer = optim.Adam(bert_model.parameters(), lr=5e-5)
bert_lr_scheduler = transformers.get_constant_schedule(bert_optimizer)

In [34]:
bert_trainer = training.SquadTrainer(
    model=bert_model,
    args=bert_args(run_name=bert_run_name),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.train_dataset,
    eval_dataset=transformer_dm.val_dataset,
    optimizers=(bert_optimizer, bert_lr_scheduler),
)

In [36]:
bert_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Em,Runtime,Samples Per Second
1,4.4509,5.148665,0.013514,0.013699,0.5,0.026667,0.0,0.6375,1.569
2,4.4509,5.117357,0.05,0.052632,0.5,0.095238,0.0,0.5971,1.675
3,4.4509,5.088964,0.047619,0.05,0.5,0.090909,0.0,0.5654,1.769


TrainOutput(global_step=3, training_loss=4.086720943450928, metrics={'train_runtime': 39.8523, 'train_samples_per_second': 0.075, 'total_flos': 0, 'epoch': 3.0})

#### Training only

In [None]:
bert_model = model.QABertModel(device=DEVICE)
print(f"The BERT model has {bert_model.count_parameters()} parameters")

In [None]:
bert_optimizer = optim.Adam(bert_model.parameters(), lr=5e-5)
bert_lr_scheduler = transformers.get_constant_schedule(bert_optimizer)

In [None]:
bert_trainer = training.SquadTrainer(
    model=bert_model,
    args=bert_args(run_name=f"{bert_run_name}-whole", evaluation_strategy="no"),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.whole_dataset,
    optimizers=(bert_optimizer, bert_lr_scheduler),
)

In [None]:
bert_trainer.train()

#### Testing

In [24]:
bert_test_output = bert_trainer.predict(transformer_dm.test_dataset)
bert_test_output.metrics

{'test_loss': 9.000775337219238,
 'test_f1': 0.0,
 'test_accuracy': 0.0,
 'test_em': 0.0,
 'test_runtime': 0.6044,
 'test_samples_per_second': 1.655}

In [25]:
bert_answers_path = "results/answers/bert.json"
utils.save_answers(bert_answers_path, bert_test_output.predictions[-1])
wandb.save(bert_answers_path);
wandb.finish()

### DistilBERT

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than _bert-base-uncased_, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

The abstract from the paper is the following:

> As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

In [22]:
%env WANDB_RUN_GROUP=distilbert
distilbert_run_name = utils.get_run_name()
distilbert_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{distilbert_run_name}",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

env: WANDB_RUN_GROUP=bert


#### Training and validation

In [20]:
distilbert_model = model.QADistilBertModel(device=DEVICE)
print(f"The DistilBERT model has {distilbert_model.count_parameters()} parameters")

In [None]:
distilbert_optimizer = optim.Adam(distilbert_model.parameters(), lr=5e-5)
distilbert_lr_scheduler = transformers.get_constant_schedule(distilbert_optimizer)

In [None]:
distilbert_trainer = training.SquadTrainer(
    model=distilbert_model,
    args=distilbert_args(run_name=distilbert_run_name),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.train_dataset,
    eval_dataset=transformer_dm.val_dataset,
    optimizers=(distilbert_optimizer, distilbert_lr_scheduler),
)

In [23]:
distilbert_trainer.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy,Em,Runtime,Samples Per Second
1,6.5272,6.653205,0.166667,0.25,0.0,1.0424,1.919
2,6.5272,7.183872,0.166667,0.25,0.0,1.0671,1.874
3,6.5272,7.202362,0.166667,0.25,0.0,0.9527,2.099


TrainOutput(global_step=3, training_loss=5.523826281229655, metrics={'train_runtime': 45.0591, 'train_samples_per_second': 0.067, 'total_flos': 0, 'epoch': 3.0})

#### Training only

In [None]:
distilbert_model = model.QADistilBertModel(device=DEVICE)
print(f"The DistilBERT model has {distilbert_model.count_parameters()} parameters")

In [None]:
distilbert_optimizer = optim.Adam(distilbert_model.parameters(), lr=5e-5)
distilbert_lr_scheduler = transformers.get_constant_schedule(distilbert_optimizer)

In [None]:
distilbert_trainer = training.SquadTrainer(
    model=distilbert_model,
    args=distilbert_args(run_name=f"{distilbert_run_name}-whole", evaluation_strategy="no"),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.whole_dataset,
    optimizers=(distilbert_optimizer, distilbert_lr_scheduler),
)

In [None]:
distilbert_trainer.train()

#### Testing

In [24]:
distilbert_test_output = distilbert_trainer.predict(transformer_dm.test_dataset)
distilbert_test_output.metrics

{'test_loss': 9.000775337219238,
 'test_f1': 0.0,
 'test_accuracy': 0.0,
 'test_em': 0.0,
 'test_runtime': 0.6044,
 'test_samples_per_second': 1.655}

In [25]:
distilbert_answers_path = "results/answers/distilbert.json"
utils.save_answers(distilbert_answers_path, distilbert_test_output.predictions[-1])
wandb.save(distilbert_answers_path);
wandb.finish()

### ELECTRA

ELECTRA is a new pretraining approach which trains two transformer models: the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.

The abstract from the paper is the following:

> Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

In [22]:
%env WANDB_RUN_GROUP=electra
electra_run_name = utils.get_run_name()
electra_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{electra_run_name}",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

env: WANDB_RUN_GROUP=bert


#### Training and validation

In [20]:
electra_model = model.QAElectraModel(device=DEVICE)
print(f"The ELECTRA model has {electra_model.count_parameters()} parameters")

In [None]:
electra_optimizer = optim.Adam(electra_model.parameters(), lr=5e-5)
electra_lr_scheduler = transformers.get_constant_schedule(electra_optimizer)

In [None]:
electra_trainer = training.SquadTrainer(
    model=electra_model,
    args=electra_args(run_name=electra_run_name),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.train_dataset,
    eval_dataset=transformer_dm.val_dataset,
    optimizers=(electra_optimizer, electra_lr_scheduler),
)

In [23]:
electra_trainer.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy,Em,Runtime,Samples Per Second
1,6.5272,6.653205,0.166667,0.25,0.0,1.0424,1.919
2,6.5272,7.183872,0.166667,0.25,0.0,1.0671,1.874
3,6.5272,7.202362,0.166667,0.25,0.0,0.9527,2.099


TrainOutput(global_step=3, training_loss=5.523826281229655, metrics={'train_runtime': 45.0591, 'train_samples_per_second': 0.067, 'total_flos': 0, 'epoch': 3.0})

#### Training only

In [None]:
electra_model = model.QAElectraModel(device=DEVICE)
print(f"The ELECTRA model has {electra_model.count_parameters()} parameters")

In [None]:
electra_optimizer = optim.Adam(electra_model.parameters(), lr=5e-5)
electra_lr_scheduler = transformers.get_constant_schedule(electra_optimizer)

In [None]:
electra_trainer = training.SquadTrainer(
    model=electra_model,
    args=electra_args(run_name=f"{electra_run_name}-whole", evaluation_strategy="no"),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.whole_dataset,
    optimizers=(electra_optimizer, electra_lr_scheduler),
)

In [None]:
electra_trainer.train()

#### Testing

In [24]:
electra_test_output = electra_trainer.predict(transformer_dm.test_dataset)
electra_test_output.metrics

{'test_loss': 9.000775337219238,
 'test_f1': 0.0,
 'test_accuracy': 0.0,
 'test_em': 0.0,
 'test_runtime': 0.6044,
 'test_samples_per_second': 1.655}

In [25]:
electra_answers_path = "results/answers/electra.json"
utils.save_answers(electra_answers_path, electra_test_output.predictions[-1])
wandb.save(electra_answers_path);
wandb.finish()