# SQuAD Q&A

This notebook contains training scripts for models to be used for the question answering problem on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) v1.1 dataset, which consists on selecting a possible answer to the given question as a span of words in the given context paragraph. The newest version (v2.0) of the dataset also contains unanswerable questions, but the one on which we worked on (v1.1) does not.

## Colab requirements

Before restarting runtime (remember to select GPU runtime)$\dots$

In [None]:
!git clone https://github.com/Wadaboa/squad-question-answering.git
!pip install -r squad-question-answering/init/base_requirements.txt

After restarting runtime$\dots$

In [None]:
import os, sys

sys.path.insert(0, "squad-question-answering")
os.chdir("squad-question-answering")

## Imports

In order to import source files, we have to add the `src` folder to the Python `PATH`$\dots$ 

In [35]:
import sys

sys.path.insert(0, "src")

Then, we can import packages as usual$\dots$

In [68]:
import os
from functools import partial

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import wandb
import transformers
from transformers.trainer_utils import set_seed

import dataset
import model
import training
import tokenizer
import utils
import layer_utils
import config

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Check the current configuration variables$\dots$

In [None]:
[(item, getattr(config, item)) for item in dir(config) if not item.startswith("__")]

## Initialization

In this section we are going to perform some initialization stuff for all the libraries to be used throughout the notebook$\dots$

### Weights & biases

Training and evaluation metrics, along with model checkpoints and results, are directly logged into a [W&B](https://wandb.ai/) project, which is openly accessible [here](https://wandb.ai/wadaboa/squad-qa). Logging abilities are only granted to members of the team, so that if you want to launch your training run, you would have to disable wandb, by setting the environment variable `WANDB_DISABLED` to an empty value in the following block (`%env WANDB_DISABLED=`).

In [3]:
%env WANDB_PROJECT=squad-qa
%env WANDB_ENTITY=wadaboa
%env WANDB_MODE=online
%env WANDB_RESUME=never
%env WANDB_WATCH=false
%env WANDB_SILENT=true

env: WANDB_PROJECT=squad-qa
env: WANDB_ENTITY=wadaboa
env: WANDB_MODE=online
env: WANDB_RESUME=never
env: WANDB_WATCH=false
env: WANDB_SILENT=true


Be sure to be logged in: if the system prompts you to insert a key, head over to [W&B](https://wandb.ai/authorize), login and the key should appear on the web page$\dots$

In [4]:
!wandb login

Be sure to have `wandb` enabled system-wise$\dots$

In [5]:
!wandb enabled

W&B enabled.


### PyTorch and numpy

Set the random seed to a fixed number for reproducible results$\dots$

In [66]:
set_seed(config.RANDOM_SEED)

Get the fastest device (GPU if available, else CPU as a fallback) to be used for training neural models in `PyTorch`$\dots$

In [38]:
DEVICE = utils.get_device()
DEVICE

device(type='cpu')

If a GPU device is available, print related info like GPU type, current usage$\dots$

In [39]:
if DEVICE.type != "cpu":
    !nvidia-smi

## Preliminaries

In this section we are going to perform some preliminary steps, like data loading and common variables definition$\dots$

### Raw data loading

The `SquadDataset` class holds a "raw" copy of the training set and the test set (if given). By "raw", we simply mean that questions and contexts are not pre-processed in this stage, but they are simply loaded from the given `JSON` files into appropriate `Pandas` `DataFrame`s.

For the training set we used the official `SQuAD` v1.1 one, while for the test set we opted for the `SQuAD` v1.1 dev set.

In [69]:
DATA_FOLDER = os.path.join(os.getcwd(), "data")
TRAIN_DATA_FOLDER = os.path.join(DATA_FOLDER, "training")
TRAIN_SET_PATH = os.path.join(TRAIN_DATA_FOLDER, "training_set.json")
TEST_DATA_FOLDER = os.path.join(DATA_FOLDER, "testing")
TEST_SET_PATH = os.path.join(TEST_DATA_FOLDER, "test_set.json")

Remember that the `subset` variable is used to load a random subset of both the training and testing dataset. This is to be used only for debugging purposes, so that `subset` should be set to $1.0$ when performing real training runs.

In [70]:
squad_dataset = dataset.SquadDataset(
    train_set_path=TRAIN_SET_PATH,
    test_set_path=TEST_SET_PATH,
    subset=config.DATA_SUBSET,
)

Let's visualize the "raw" training set$\dots$

In [71]:
squad_dataset.raw_train_df

Unnamed: 0,answer_start,answer,title,context,question_id,question,context_id,answer_end
0,167,1735,Institute_of_technology,The world's first institution of technology or...,56de4d9ecffd8e1900b4b7e2,What year was the Banská Akadémia founded?,1860,171
1,793,SOS-based speed,Film_speed,The standard specifies how speed ratings shoul...,572674a05951b619008f7319,What is another speed that can also be reporte...,9354,808
2,421,Sumerian temples and palaces,Sumer,The most impressive and famous of Sumerian bui...,5730bb058ab72b1400f9c72c,Where were the use of advanced materials and t...,17505,449
3,192,mayor,"Ann_Arbor,_Michigan",Ann Arbor has a council-manager form of govern...,572781a5f1498d1400e8fa1f,Who is elected every even numbered year?,10585,197
4,194,decide on the feasibility of building an ICBM ...,John_von_Neumann,"Shortly before his death, when he was already ...",572843ce4b864d190016485c,What was the purpose of top secret ICBM commit...,11497,284
5,347,National Bishop Conferences,Pope_Paul_VI,Some critiqued Paul VI's decision; the newly c...,5726ef98708984140094d66e,What conferences became a requirement after Va...,10862,374
6,105,C,Spectre_(2015_film),Bond and Swann return to London where they mee...,56cdd28562d2951400fa68bd,Who does M fight with?,470,106
7,6,1150,Antarctica,About 1150 species of fungi have been recorded...,570e1a2a0dc6ce1900204dbf,How many species of fungi have been found on A...,6902,10
8,668,Virginia coastline,North_Carolina,"In the Battle of Cowan's Ford, Cornwallis met ...",57278aa6f1498d1400e8fb65,After losing the battle of Guilford Courthouse...,12034,686
9,128,aluminum.,2008_Summer_Olympics_torch_relay,The Olympic Torch is based on traditional scro...,56d8d4e7bfea0914004b7728,What is the Olympic Torch made from?,1456,137


Let's visualize the "raw" test set$\dots$

In [72]:
squad_dataset.raw_test_df

Unnamed: 0,answer_start,answer,title,context,question_id,question,context_id,answer_end
0,250,instantaneously in action-reaction pairs,Force,Tension forces can be modeled using ideal stri...,57379ed81c456719005744d7,In what way do idea strings transmit tesion fo...,2058,290
1,187,Egg of Columbus,Nikola_Tesla,Tesla also explained the principles of the rot...,56e0ed557aa994140058e7dd,What was Tesla's device called?,181,202
2,539,NP-complete Boolean satisfiability problem,Computational_complexity_theory,What intractability means in practice is open ...,56e1febfe3433e140042323a,What is the example of another problem charact...,282,581
3,1332,Treaty provisions,European_Union_law,Although it is generally accepted that EU law ...,57269bb8708984140094cb98,What are EU Regulations essentially the same a...,759,1349
4,181,Gerhard,Martin_Luther,The Lutheran theologian Franz Pieper observed ...,56f884cba6d7ea1400e17708,What theologian differed in views about the so...,412,188
5,1441,self-consistent unification,Force,The development of fundamental theories for fo...,5737821cc3c5551400e51f1c,What type of physics model did Einstein fail t...,2045,1468
6,503,not a unit and cannot be written as a product ...,Prime_number,Prime numbers give rise to two more general co...,57299c2c6aef051400155024,Under what condition is an element irreducible?,1758,589
7,221,the filling of molecular orbitals formed from ...,Oxygen,"In this dioxygen, the two oxygen atoms are che...",571c83f3dd7acb1400e4c0dc,Of what does the covalent double bond result f...,630,317
8,11,pupils are free to choose a private school,Private_school,"In Sweden, pupils are free to choose a private...",572754dd708984140094dc3f,What school model is Sweden notable for?,1343,53


### Utils

This section contains common variables and functions to be used when training all the subsequent models$\dots$

The following cell is tasked to load the default training parameters, such as the batch size, the logging frequency, where model checkpoints should be saved and more$\dots$

In [73]:
TRAINER_ARGS = utils.get_default_trainer_args()

## Recurrent models

This section contains blocks of code that are used to train question answering models which are based on recurrent networks (`LSTM`s in our case). The models that we implemented are the following:
- Baseline (a recurrent encoder with a naive version of attention)
- BiDAF (Bi-Directional Attention Flow)

Check the corresponding section to have a high-level view of each model$\dots$

### Embeddings

In this section we are going to load an embedding matrix using the `Gensim` API and use the corresponding matrix as the weight block of an `nn.Embedding` `PyTorch` module$\dots$

First of all, we have to define one token for padding values. Then, OOV words are handled by a single unknown token, which is estimated as the mean of all the embedding vectors (if this mean vector is already present in the model, then a random embedding with suitable ranges is computed).

List of available embedding models (see [here](https://github.com/RaRe-Technologies/gensim-data)):
- FastText: 
    - _fasttext-wiki-news-subwords_ (dimensions: 300)
- GloVe:
    - _glove-twitter_ (dimensions: 25. 50, 100, 200)
    - _glove-wiki-gigaword_ (dimensions: 50, 100, 200, 300)
- Word2Vec:
    - _word2vec-google-news_ (dimensions: 300)
    - _word2vec-ruscorpora_ (dimensions: 300)

**Note**: The following cell could take a while (depending on the embedding dimension), since embedding models are pretty large$\dots$

In [74]:
embedding_model, vocab = utils.load_embedding_model(
    config.EMBEDDING_MODEL_NAME,
    embedding_dimension=config.EMBEDDING_DIMENSION,
    unk_token=config.UNK_TOKEN,
    pad_token=config.PAD_TOKEN,
)

The following cell is tasked to load the embedding model into a `PyTorch` `nn.Embedding` layer, with frozen weights.

In [75]:
embedding_layer = layer_utils.get_embedding_module(
    embedding_model, pad_id=vocab[config.PAD_TOKEN]
)

### Data loading

The `SquadDataManager` class acts as both a data collator (i.e. it brings together multiple examples in the dataset with the help of `PyTorch`'s `DataLoader`s) and a tokenizer. In particular, tokenization happens on the fly at the batch level, thus enabling us to perform dynamic padding (based on the longest sequence in a batch) and avoiding the pre-tokenization overhead.

The tokenizer that we are using for recurrent modules splits words by whitespaces and punctuations, removes accents and applies a lowercasing function to all the tokens. Moreover, questions are padded (not truncated), while contexts are truncated to a maximum number of tokens and padded.

In [76]:
recurrent_tokenizer = tokenizer.get_recurrent_tokenizer(
    vocab,
    config.MAX_CONTEXT_TOKENS,
    config.UNK_TOKEN,
    config.PAD_TOKEN,
    device=DEVICE,
)

The `SquadDataManager` class also acts as a pre-processor, with the following steps:
- Removes rows that contain wrong answers (e.g. answers that do not start and end at word boundaries)
- Removes rows that contain answers that would be lost due to tokenization (truncation in particular)
- Groups answers to the same question and context pair into a single row (thus producing lists in the `answer`, `answer_start` and `answer_end` columns)

In [78]:
recurrent_dm = dataset.SquadDataManager(
    squad_dataset, recurrent_tokenizer, val_split=config.VAL_SPLIT, device=DEVICE
)

The last task assigned to the `SquadDataManager` class is that of train/validation splitting, with the given ratio ($80\%$ for the training set by default).

Let's have a look at the final training dataset$\dots$

In [79]:
recurrent_dm.train_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,56cdd28562d2951400fa68bd,Who does M fight with?,Spectre_(2015_film),470,Bond and Swann return to London where they mee...,[C],[105],[106]
1,56cee5a1aab44d1400b88c24,"In 1860, approximately how many people of Iris...",New_York_City,605,The Great Irish Famine brought a large influx ...,"[200,000]",[72],[79]
2,56cf83a7234ae51400d9bde3,Where was Donda West's funeral?,Kanye_West,1083,The funeral and burial for Donda West was held...,[Oklahoma City],[50],[63]
3,56d8d4e7bfea0914004b7728,What is the Olympic Torch made from?,2008_Summer_Olympics_torch_relay,1456,The Olympic Torch is based on traditional scro...,[aluminum.],[128],[137]
4,56de4d9ecffd8e1900b4b7e2,What year was the Banská Akadémia founded?,Institute_of_technology,1860,The world's first institution of technology or...,[1735],[167],[171]
5,56e161bfcd28a01900c67849,What was the name of Bostons first baseball team?,Boston,3325,"The Boston Red Sox, a founding member of the A...",[Red Stockings],[798],[811]
6,56e1c12fe3433e140042312a,What TV station is the main public service bro...,Communications_in_Somalia,3418,The Mogadishu-based Somali National Television...,[Somali National Television],[20],[46]
7,56e79e2e00c9c71400d773d5,Who convinced Sun Quan to make Nanjing his cap...,Nanjing,3759,"Surrounded by the Yangtze River and mountains,...",[Liu Bei],[389],[396]
8,5709b2a2200fba1400368279,When did Houston begin to regain its dependenc...,Houston,5914,One wave of the population boom ended abruptly...,[2000s],[571],[576]
9,570d2a80b3d812140066d4d3,Which city was the home of GE's first headquar...,General_Electric,7281,"At about the same time, Charles Coffin, leadin...",[Schenectady],[490],[501]


Let's have a look at the final validation dataset$\dots$

In [80]:
recurrent_dm.val_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,56ccf53362d2951400fa64fd,When did Ayurbarwada Buyantu Khan reign?,Sino-Tibetan_relations_during_the_Ming_dynasty,319,"Nevertheless, the ethno-geographic caste hiera...",[1311–1320],[447],[456]
1,56cd480b62d2951400fa6510,How many households were the offices of Qianhu...,Sino-Tibetan_relations_during_the_Ming_dynasty,321,"Chen Qingying, Professor of History and Direct...","[1,000 households]",[690],[706]
2,5726dde0f1498d1400e8edf7,Why was Professor David Graeber retired during...,Yale_University,10500,Yale has a history of difficult and prolonged ...,[he came to the defense of a student who was i...,[568],[644]
3,572755d8708984140094dc5a,How big is the Matthaei botanical garden?,"Ann_Arbor,_Michigan",10568,"Ann Arbor's ""Tree Town"" nickname stems from th...",[300 acres],[967],[976]
4,572781a5f1498d1400e8fa1f,Who is elected every even numbered year?,"Ann_Arbor,_Michigan",10585,Ann Arbor has a council-manager form of govern...,[mayor],[192],[197]
5,57283caa3acd2414000df780,How many civilians died in the attack?,Dissolution_of_the_Soviet_Union,12205,"On January 13, 1991, Soviet troops, along with...",[Fourteen],[158],[166]
6,572a301c3f37b31900478791,What did Europeans refer to the Ottoman empire...,Ottoman_Empire,14726,The Serbian revolution (1804–1815) marked the ...,"[the ""sick man""]",[588],[602]
7,572f785e04bcaa1900d769c0,Why did the Luftwaffe bomb the RAF Fighter Com...,The_Blitz,15807,Although not specifically prepared to conduct ...,[to gain air superiority],[228],[251]
8,572fabd004bcaa1900d76bac,What year did the government start giving out ...,The_Blitz,15820,Communal shelters never housed more than one s...,[1941],[457],[461]


And let's do the same for the testing dataset$\dots$

In [81]:
recurrent_dm.test_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,56e0ed557aa994140058e7dd,What was Tesla's device called?,Nikola_Tesla,181,Tesla also explained the principles of the rot...,[Egg of Columbus],[187],[202]
1,56e1febfe3433e140042323a,What is the example of another problem charact...,Computational_complexity_theory,282,What intractability means in practice is open ...,[NP-complete Boolean satisfiability problem],[539],[581]
2,56f884cba6d7ea1400e17708,What theologian differed in views about the so...,Martin_Luther,412,The Lutheran theologian Franz Pieper observed ...,[Gerhard],[181],[188]
3,571c83f3dd7acb1400e4c0dc,Of what does the covalent double bond result f...,Oxygen,630,"In this dioxygen, the two oxygen atoms are che...",[the filling of molecular orbitals formed from...,[221],[317]
4,57269bb8708984140094cb98,What are EU Regulations essentially the same a...,European_Union_law,759,Although it is generally accepted that EU law ...,[Treaty provisions],[1332],[1349]
5,572754dd708984140094dc3f,What school model is Sweden notable for?,Private_school,1343,"In Sweden, pupils are free to choose a private...",[pupils are free to choose a private school],[11],[53]
6,57299c2c6aef051400155024,Under what condition is an element irreducible?,Prime_number,1758,Prime numbers give rise to two more general co...,[not a unit and cannot be written as a product...,[503],[589]
7,5737821cc3c5551400e51f1c,What type of physics model did Einstein fail t...,Force,2045,The development of fundamental theories for fo...,[self-consistent unification],[1441],[1468]
8,57379ed81c456719005744d7,In what way do idea strings transmit tesion fo...,Force,2058,Tension forces can be modeled using ideal stri...,[instantaneously in action-reaction pairs],[250],[290]


### Baseline model

The baseline model is composed by a single recurrent encoder, which is given both questions and contexts as two separate inputs. Then, all the hidden states of a single question are averaged together (over the embedding dimension) so as to obtain a single vector which should encode the semantic information of the question at the sentence level. This aggregated question vector is then element-wise multiplied to each context token latent representation, so as to perform some kind of query-aware context encoding. Finally, the query-aware context vectors are passed onto another recurrent module and used as inputs for the end token classifier, while the query-aware context vectors are directly used as input for the start token classifier.

Hyperparameters are carefully chosen by hand, considering the fact that the baseline model is pretty lightweight (in terms of FLOPS and parameters), so that we can afford using higher batch sizes. Moreover, the number of training epoch is set to a high value to understand if and when overfitting is observed.

In [82]:
%env WANDB_RUN_GROUP=baseline
baseline_run_name = utils.get_run_name()
baseline_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{baseline_run_name}",
    num_train_epochs=30,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
)

env: WANDB_RUN_GROUP=baseline


#### Training and validation

In this section we will use the splitted train dataset to perform training and the splitted validation dataset to observe the evolution of metrics during training, at the end of each epoch. To be clear, the validation set is not used to tune hyperparameters, but just as a reference to report results over unseen data.

In [98]:
baseline_model = model.QABaselineModel(embedding_layer, device=DEVICE)
print(f"The baseline model has {baseline_model.count_parameters()} parameters")

The baseline model has 245202 parameters


In [99]:
baseline_optimizer = optim.Adam(baseline_model.parameters(), lr=1e-3)
baseline_lr_scheduler = transformers.get_constant_schedule(baseline_optimizer)

In [100]:
baseline_trainer = training.SquadTrainer(
    model=baseline_model,
    args=baseline_args(run_name=baseline_run_name),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.train_dataset,
    eval_dataset=recurrent_dm.val_dataset,
    optimizers=(baseline_optimizer, baseline_lr_scheduler),
)

In [None]:
baseline_trainer.train()

#### Training only

In this section we are going to use the whole dataset for training, so only the training loss will be used as a metric. This step is used to boost generalization ability and to exploit all the training data.

In [62]:
baseline_model = model.QABaselineModel(embedding_layer, device=DEVICE)
print(f"The baseline model has {baseline_model.count_parameters()} parameters")

The baseline model has 41000 parameters


In [None]:
baseline_optimizer = optim.Adam(baseline_model.parameters(), lr=1e-3)
baseline_lr_scheduler = transformers.get_constant_schedule(baseline_optimizer)

In [63]:
baseline_trainer = training.SquadTrainer(
    model=baseline_model,
    args=baseline_args(run_name=f"{baseline_run_name}-whole", evaluation_strategy="no"),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.whole_dataset,
    optimizers=(baseline_optimizer, baseline_lr_scheduler),
)

In [None]:
baseline_trainer.train()

#### Testing

In this section we will use the test set to assess the generalization abilities of our model and observe final metrics$\dots$

In [None]:
baseline_test_output = baseline_trainer.predict(recurrent_dm.test_dataset)
baseline_test_output.metrics

And we are going to save a `JSON` file with the following schema:
```json
{
    "question_id": "textual answer"
    ...
}
```
The `JSON` files contains textual answers to each question in the given test dataset. The output file can also be used with the `SQuAD` official evaluation script$\dots$

In [66]:
baseline_answers_path = "results/answers/baseline.json"
utils.save_answers(baseline_answers_path, baseline_test_output.predictions[-1])
wandb.save(baseline_answers_path);
wandb.finish()

### BiDAF

The Bi-Directional Attention Flow (BIDAF) network is a hierarchical
multi-stage architecture for modeling the representations of the context paragraph at different levels
of granularity. BIDAF includes character-level, word-level, and contextual embeddings,
and uses bi-directional attention flow to obtain a query-aware context representation.
Our attention mechanism offers following improvements to the previously popular attention paradigms. 
First, the attention layer is not used to summarize the context paragraph into a fixed-size vector. Instead, the
attention is computed for every time step, and the attended vector at each time step, along with the
representations from previous layers, is allowed to flow through to the subsequent modeling layer.
This reduces the information loss caused by early summarization. Second, we use a memory-less
attention mechanism. That is, while we iteratively compute attention through time, the attention at each time step is a function of only the query and the context paragraph at the current time step and does not directly depend on the attention at the previous time step.

The abstract from the paper is the following:
> Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.

The main difference of our BiDAF model w.r.t. the full model reported in the paper is that we decided to avoid using character embeddings, since ablation studies showed that it gives only marginal improvements$\dots$

Hyperparameters are taken directly from the BiDAF paper:
- Epochs: $12$ (we also tried an higher number of epochs)
- Batch size: $60$
- Optimizer: Adadelta
- Learning rate: $0.5$

In [46]:
%env WANDB_RUN_GROUP=bidaf
bidaf_run_name = utils.get_run_name()
bidaf_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{bidaf_run_name}",
    num_train_epochs=18,
    per_device_train_batch_size=60,
    per_device_eval_batch_size=60,
)

env: WANDB_RUN_GROUP=bidaf


#### Training and validation

As with the baseline model, we are going to perform both training and validation (to observe the evolution of metrics over unseen data)$\dots$

In [None]:
bidaf_model = model.QABiDAFModel(embedding_layer, device=DEVICE)
print(f"The BiDAF model has {bidaf_model.count_parameters()} parameters")

In [48]:
bidaf_optimizer = optim.Adadelta(bidaf_model.parameters(), lr=0.5)
bidaf_lr_scheduler = transformers.get_constant_schedule(bidaf_optimizer)

In [49]:
bidaf_trainer = training.SquadTrainer(
    model=bidaf_model,
    args=bidaf_args(run_name=bidaf_run_name),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.train_dataset,
    eval_dataset=recurrent_dm.val_dataset,
    optimizers=(bidaf_optimizer, bidaf_lr_scheduler),
)

In [None]:
bidaf_trainer.train()

#### Training only

As with the baseline model, we will use the entire dataset to perform training (no validation)$\dots$

In [None]:
bidaf_model = model.QABiDAFModel(embedding_layer, device=DEVICE)
print(f"The BiDAF model has {bidaf_model.count_parameters()} parameters")

In [None]:
bidaf_optimizer = optim.Adadelta(bidaf_model.parameters(), lr=0.5)
bidaf_lr_scheduler = transformers.get_constant_schedule(bidaf_optimizer)

In [None]:
bidaf_trainer = training.SquadTrainer(
    model=bidaf_model,
    args=bidaf_args(run_name=f"{bidaf_run_name}-whole", evaluation_strategy="no"),
    data_collator=recurrent_dm.tokenizer,
    train_dataset=recurrent_dm.whole_dataset,
    optimizers=(bidaf_optimizer, bidaf_lr_scheduler),
)

In [None]:
bidaf_trainer.train()

#### Testing

As with the baseline model, we will predict answers for each question in the given test dataset and save them on a `JSON` file$\dots$

In [None]:
bidaf_test_output = bidaf_trainer.predict(recurrent_dm.test_dataset)
bidaf_test_output.metrics

In [46]:
bidaf_answers_path = "results/answers/bidaf.json"
utils.save_answers(bidaf_answers_path, bidaf_test_output.predictions[-1])
wandb.save(bidaf_answers_path);
wandb.finish()

## Transformer networks

This section contains blocks of code that are used to train question answering models which are based on Transformer networks (`BERT`-related in our case). The models that we exploited are the following:
- BERT (Bidirectional Encoder Representations from Transformers)
- DistilBERT (a distilled version of BERT)
- ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)

Since pre-training such models is very expensive, we relied on pre-trained versions of them (where the pre-training tasks are different than question answering), publicly available through the [HuggingFace](https://huggingface.co/) [Transformers](https://huggingface.co/transformers/) library. Pre-trained models are wrapped into an ad-hoc `PyTorch` module, which attaches to them the output layer to be used for question answering.

Check the corresponding section to have a high-level view of each model$\dots$

### Data loading

The tokenizer that we are using for Transformer-based modules splits words using the `WordPiece` algorithm, removes accents and applies a lowercasing function to all the tokens, while also merging together questions and contexts as $[CLS] q_1 q_2 \dots q_n [SEP] c_1 c_2 \dots c_m [SEP]$ (it leverages the special tokens `[CLS]` and `[SEP]`). Moreover, the combined question/context sentence is truncated to a maximum number of tokens ($512$) and padded to the right.

In [14]:
transformer_tokenizer = tokenizer.get_transformer_tokenizer(
    config.BERT_VOCAB_PATH, config.MAX_BERT_TOKENS, device=DEVICE
)

As with the recurrent-related part of the notebook, the `SquadDataManager` class pre-processes inputs by throwing away "dirty" and "lost" answers and then groups answers related to the same question and context pair into a single row.

In [15]:
transformer_dm = dataset.SquadDataManager(
    squad_dataset, transformer_tokenizer, val_split=config.VAL_SPLIT, device=DEVICE
)

Let's see the training split of the whole dataset$\dots$

In [16]:
transformer_dm.train_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,56de4d9ecffd8e1900b4b7e2,What year was the Banská Akadémia founded?,Institute_of_technology,1860,The world's first institution of technology or...,[1735],[167],[171]
1,572781a5f1498d1400e8fa1f,Who is elected every even numbered year?,"Ann_Arbor,_Michigan",10585,Ann Arbor has a council-manager form of govern...,[mayor],[192],[197]
2,5730bb058ab72b1400f9c72c,Where were the use of advanced materials and t...,Sumer,17505,The most impressive and famous of Sumerian bui...,[Sumerian temples and palaces],[421],[449]


Let's see the validation split of the whole dataset$\dots$

In [17]:
transformer_dm.val_df

Unnamed: 0,question_id,question,title,context_id,context,answer,answer_start,answer_end
0,572674a05951b619008f7319,What is another speed that can also be reporte...,Film_speed,9354,The standard specifies how speed ratings shoul...,[SOS-based speed],[793],[808]


And the same for the test dataset$\dots$

In [18]:
transformer_dm.test_df

Unnamed: 0,index,answer,answer_start,answer_end


### BERT

The BERT model is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

The abstract from the paper is the following:

> We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

Hyperparameters are taken directly from the BERT paper (in the section related to fine-tuning for the `SQuAD` dataset):
- Epochs: $3$
- Batch size: $32$ (we went for smaller sizes because of resources limitations)
- Learning rate: $5e-5$

In [23]:
%env WANDB_RUN_GROUP=bert
bert_run_name = utils.get_run_name()
bert_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{bert_run_name}",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

env: WANDB_RUN_GROUP=bert


#### Training and validation

As with recurrent-based models, we are going to perform training and one validation run after each epoch, to observe metrics$\dots$

In [62]:
bert_model = model.QABertModel(device=DEVICE)
print(f"The BERT model has {bert_model.count_parameters()} parameters")

The BERT model has 114208512 parameters


In [33]:
bert_optimizer = optim.Adam(bert_model.parameters(), lr=5e-5)
bert_lr_scheduler = transformers.get_constant_schedule(bert_optimizer)

In [34]:
bert_trainer = training.SquadTrainer(
    model=bert_model,
    args=bert_args(run_name=bert_run_name),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.train_dataset,
    eval_dataset=transformer_dm.val_dataset,
    optimizers=(bert_optimizer, bert_lr_scheduler),
)

In [None]:
bert_trainer.train()

#### Training only

As with recurrent-based modules, we are going to perform training over the whole dataset (with no validation at all)$\dots$

In [None]:
bert_model = model.QABertModel(device=DEVICE)
print(f"The BERT model has {bert_model.count_parameters()} parameters")

In [None]:
bert_optimizer = optim.Adam(bert_model.parameters(), lr=5e-5)
bert_lr_scheduler = transformers.get_constant_schedule(bert_optimizer)

In [None]:
bert_trainer = training.SquadTrainer(
    model=bert_model,
    args=bert_args(run_name=f"{bert_run_name}-whole", evaluation_strategy="no"),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.whole_dataset,
    optimizers=(bert_optimizer, bert_lr_scheduler),
)

In [None]:
bert_trainer.train()

#### Testing

As with recurrent-based modules, we will predict one answer for each question in the test dataset and save results in a `JSON` file$\dots$

In [None]:
bert_test_output = bert_trainer.predict(transformer_dm.test_dataset)
bert_test_output.metrics

In [25]:
bert_answers_path = "results/answers/bert.json"
utils.save_answers(bert_answers_path, bert_test_output.predictions[-1])
wandb.save(bert_answers_path);
wandb.finish()

### DistilBERT

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than _bert-base-uncased_, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

The abstract from the paper is the following:

> As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Hyperparameters are the same as the ones used for fine-tuning BERT$\dots$

In [22]:
%env WANDB_RUN_GROUP=distilbert
distilbert_run_name = utils.get_run_name()
distilbert_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{distilbert_run_name}",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

env: WANDB_RUN_GROUP=bert


#### Training and validation

Same reasoning as above$\dots$

In [20]:
distilbert_model = model.QADistilBertModel(device=DEVICE)
print(f"The DistilBERT model has {distilbert_model.count_parameters()} parameters")

In [None]:
distilbert_optimizer = optim.Adam(distilbert_model.parameters(), lr=5e-5)
distilbert_lr_scheduler = transformers.get_constant_schedule(distilbert_optimizer)

In [None]:
distilbert_trainer = training.SquadTrainer(
    model=distilbert_model,
    args=distilbert_args(run_name=distilbert_run_name),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.train_dataset,
    eval_dataset=transformer_dm.val_dataset,
    optimizers=(distilbert_optimizer, distilbert_lr_scheduler),
)

In [None]:
distilbert_trainer.train()

#### Training only

Same reasoning as above$\dots$

In [None]:
distilbert_model = model.QADistilBertModel(device=DEVICE)
print(f"The DistilBERT model has {distilbert_model.count_parameters()} parameters")

In [None]:
distilbert_optimizer = optim.Adam(distilbert_model.parameters(), lr=5e-5)
distilbert_lr_scheduler = transformers.get_constant_schedule(distilbert_optimizer)

In [None]:
distilbert_trainer = training.SquadTrainer(
    model=distilbert_model,
    args=distilbert_args(run_name=f"{distilbert_run_name}-whole", evaluation_strategy="no"),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.whole_dataset,
    optimizers=(distilbert_optimizer, distilbert_lr_scheduler),
)

In [None]:
distilbert_trainer.train()

#### Testing

Same reasoning as above$\dots$

In [None]:
distilbert_test_output = distilbert_trainer.predict(transformer_dm.test_dataset)
distilbert_test_output.metrics

In [25]:
distilbert_answers_path = "results/answers/distilbert.json"
utils.save_answers(distilbert_answers_path, distilbert_test_output.predictions[-1])
wandb.save(distilbert_answers_path);
wandb.finish()

### ELECTRA

ELECTRA is a new pretraining approach which trains two transformer models: the generator and the discriminator. The generator’s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we’re interested in, tries to identify which tokens were replaced by the generator in the sequence.

The abstract from the paper is the following:

> Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

Hyperparameters are the same as the ones used for fine-tuning BERT and DistilBERT$\dots$

In [22]:
%env WANDB_RUN_GROUP=electra
electra_run_name = utils.get_run_name()
electra_args = partial(
    TRAINER_ARGS,
    output_dir=f"./checkpoints/{os.getenv('WANDB_RUN_GROUP')}/{electra_run_name}",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

env: WANDB_RUN_GROUP=bert


#### Training and validation

Same reasoning as above$\dots$

In [20]:
electra_model = model.QAElectraModel(device=DEVICE)
print(f"The ELECTRA model has {electra_model.count_parameters()} parameters")

In [None]:
electra_optimizer = optim.Adam(electra_model.parameters(), lr=5e-5)
electra_lr_scheduler = transformers.get_constant_schedule(electra_optimizer)

In [None]:
electra_trainer = training.SquadTrainer(
    model=electra_model,
    args=electra_args(run_name=electra_run_name),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.train_dataset,
    eval_dataset=transformer_dm.val_dataset,
    optimizers=(electra_optimizer, electra_lr_scheduler),
)

In [None]:
electra_trainer.train()

#### Training only

Same reasoning as above$\dots$

In [None]:
electra_model = model.QAElectraModel(device=DEVICE)
print(f"The ELECTRA model has {electra_model.count_parameters()} parameters")

In [None]:
electra_optimizer = optim.Adam(electra_model.parameters(), lr=5e-5)
electra_lr_scheduler = transformers.get_constant_schedule(electra_optimizer)

In [None]:
electra_trainer = training.SquadTrainer(
    model=electra_model,
    args=electra_args(run_name=f"{electra_run_name}-whole", evaluation_strategy="no"),
    data_collator=transformer_dm.tokenizer,
    train_dataset=transformer_dm.whole_dataset,
    optimizers=(electra_optimizer, electra_lr_scheduler),
)

In [None]:
electra_trainer.train()

#### Testing

Same reasoning as above$\dots$

In [None]:
electra_test_output = electra_trainer.predict(transformer_dm.test_dataset)
electra_test_output.metrics

In [25]:
electra_answers_path = "results/answers/electra.json"
utils.save_answers(electra_answers_path, electra_test_output.predictions[-1])
wandb.save(electra_answers_path);
wandb.finish()