## Training Albert Transformer for QA on SQuAD - 2.0

<img src="https://camo.githubusercontent.com/b9d050a07e52c7930206d37d72a229ab484cae1ace09bf0fe1c6cf9c7f5d4bc0/68747470733a2f2f68756767696e67666163652e636f2f66726f6e742f6173736574732f68756767696e67666163655f6c6f676f2e737667">

This notebook contains the details to train [albert transformer (large variant)](https://huggingface.co/transformers/model_doc/albert.html) on closed domain SQuAD dataset. The dataset is present [here](https://rajpurkar.github.io/SQuAD-explorer/).  The notebook shows the steps to train any BERT based transformer on SQuAD-2 dataset:

- Clone the transformers repository
- For training on SQuAD it is recommended to activate GPU runtime since it takes time
- Download the SQuAD dataset from the associated repository in a local directory
- Modify the dataset as in some cases, SQuAD-2 contains null answers to some questions
- The file format is in json and needs to be split for separating the questions answers and contexts
- Modify the answer,context pairs to ensure that the indices are properly marked (in some cases, the are off by 1/2)
- Tokenize the question and context pairs using a Huggingface backed Fast Tokenizer
- Initialize a Torch Dataset class, so as to convert the tokenized input into a torch dataset mapping
- Load the model from Huggingface
- Use Torch.DataLoader for loading the dataset and also transfer the model weights to cuda runtime
- Execute the training by using AdamW optimizer and initializing the gradients with 0
- For each epoch the input_ids, attention_masks ,start_positions and end_positions are placesd on the input to the model.
- Run for considerable epochs for best results

We are using the Pytorch framework for running the model training.

In [None]:
!pip install transformers



In [None]:
!git clone https://github.com/huggingface/transformers.git
!pip install -U ./transformers
!pip install git+https://github.com/huggingface/nlp.git

In [None]:
!nvidia-smi


Wed Aug 25 08:05:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install nlp

## Load the Transformer to be trained 

We are loading the [AlbertTokenizerFast](https://huggingface.co/transformers/model_doc/albert.html#alberttokenizerfast) backed by Huggingface for Unigram based tokenization and it uses the PretrainedTokenizerFast class.[Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models) is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one.

In [None]:
import torch
import transformers
from transformers import AlbertTokenizerFast
#import nlp
tokenizer=AlbertTokenizerFast.from_pretrained('albert-large-v1')
#train_set=nlp.load_dataset('squad_v2',split='train')

In [None]:
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json!

mkdir: cannot create directory ‘squad’: File exists
--2021-08-25 08:05:32--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.108.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘squad/train-v2.0.json’


2021-08-25 08:05:33 (206 MB/s) - ‘squad/train-v2.0.json’ saved [42123633/42123633]

--2021-08-25 08:05:33--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.108.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘squad/dev-v2.0.json!’


2021-08-25 08:05:33 (92.9 MB/

## Creating the Split and Dataset

Each split is in a structured json file with a number of questions and answers for each passage (or context). We’ll take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since there are multiple questions per context). This is done in the 'prepare_qa' method.

The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which token positions the answer begins and ends.We habe to get the character position at which the answer ends in the passage (we are given the starting position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.This is done in the 'modify_answer_context' method.


In [None]:
import json
from pathlib import Path

def prepare_qa(path):
  path = Path(path)
  with open(path, 'rb') as f:
      squad_dict = json.load(f)

  contexts = []
  questions = []
  answers = []
  for group in squad_dict['data']:
      for passage in group['paragraphs']:
          context = passage['context']
          for qa in passage['qas']:
              question = qa['question']
              for answer in qa['answers']:
                  contexts.append(context)
                  questions.append(question)
                  answers.append(answer)

  return contexts, questions, answers

def modify_answer_context(answers,contexts):
  for ans,ctx in zip(answers,contexts):
    
    result_text=ans['text']
    start_idx=ans['answer_start']
    
    end_idx=start_idx+len(result_text)
    if (ctx[start_idx:end_idx]==result_text):
      ans['answer_end']=end_idx
    elif (ctx[start_idx-1:end_idx-1]==result_text):
      ans['answer_start'][0]=start_idx-1
      ans['answer_end']=end_idx-1
    elif (ctx[start_idx-2:end_idx-2]==result_text):
      ans['answer_start'][0]=start_idx-2
      ans['answer_end']=end_idx-2
    else:
      raise ValueError()
train_contexts,train_questions,train_answers=prepare_qa('squad/train-v2.0.json')
val_contexts,val_questions,val_answers=prepare_qa('squad/dev-v2.0.json!')
modify_answer_context(train_answers,train_contexts)
modify_answer_context(val_answers,val_contexts)


## Encode the Question and Context pairs

In this case, we will be encoding both the question and contexts and will be setting the truncation to True to ensure uniformity in the size. Padding is also left to True.Tokenizers can accept parallel lists of sequences and encode them together as sequence pairs.

In [None]:
train_encode=tokenizer(train_contexts,train_questions,truncation=True,padding=True)
val_encode=tokenizer(val_contexts,val_questions,truncation=True,padding=True)

## Converting tokens to position ids

Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers, we can use the built in char_to_token() method. This is done in 'add_tokens_positions' method.

In [None]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encode, train_answers)
add_token_positions(val_encode, val_answers)

## Training in Pytorch and HF


<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT0chcDI56FF_5P9DFF2KP_xxiSC9mFwwHN8_Pk-tSCntFmcGjjEtQhd3cnjLBFnFkduB8&usqp=CAU">
First we have to convert the dataset to a [Torch.dataset format.](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). The SquadDataset class is built for this to create  a mapping of the key value pairs in the encoded question context pairs. Then the model is loaded and the DataLoader is used to load the dataset entries for training. We are also using Cuda for GPU based training. For each epoch of training, we initialize the gradient with 0 value and then pass the input_ids,attention_masks,start_positions and end_positions to the model.
Alternately we can also use the [HF trainer](https://huggingface.co/transformers/custom_datasets.html#ft-trainer) module for training the dataset.


In [None]:
import torch
from torch.utils.data import Dataset
from transformers import AlbertForQuestionAnswering
from torch.utils.data import DataLoader
from transformers import AdamW

class SquadDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

model=AlbertForQuestionAnswering.from_pretrained("albert-large-v1")

train_dataset = SquadDataset(train_encode)
val_dataset = SquadDataset(val_encode)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=5, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)
epochs=1
for epoch in range(epochs):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()




## Training through Trainer

[Trainer class](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) can also be used for training.

In [None]:
from transformers import  Trainer, TrainingArguments
from transformers import AlbertForQuestionAnswering
import torch
from torch.utils.data import Dataset
from transformers import AlbertForQuestionAnswering
from torch.utils.data import DataLoader
from transformers import AdamW

class SquadDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

model=AlbertForQuestionAnswering.from_pretrained("albert-large-v1")

train_dataset = SquadDataset(train_encode)
val_dataset = SquadDataset(val_encode)
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=0.25,              # total number of training epochs
    per_device_train_batch_size=1,  # batch size per device during training
    per_device_eval_batch_size=2,   # batch size for evaluation
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=2,
)

#model=AlbertForQuestionAnswering.from_pretrained("albert-large-v1")
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Step,Training Loss
2,6.4378
4,5.9931
6,6.5129
8,6.2221
10,6.866
12,6.2606
14,5.9494
16,6.0689
18,6.0829
20,5.8833


Saving model checkpoint to ./results/checkpoint-11000
Configuration saved in ./results/checkpoint-11000/config.json
Model weights saved in ./results/checkpoint-11000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-11500
Configuration saved in ./results/checkpoint-11500/config.json
Model weights saved in ./results/checkpoint-11500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-12000
Configuration saved in ./results/checkpoint-12000/config.json
Model weights saved in ./results/checkpoint-12000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-12500
Configuration saved in ./results/checkpoint-12500/config.json
Model weights saved in ./results/checkpoint-12500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-13000
Configuration saved in ./results/checkpoint-13000/config.json
Model weights saved in ./results/checkpoint-13000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-13500
Configuration saved in ./r

KeyboardInterrupt: ignored

In [None]:
tokenizer.save_pretrained('./results/checkpoint-19500')

tokenizer config file saved in ./results/checkpoint-19500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-19500/special_tokens_map.json


('./results/checkpoint-19500/tokenizer_config.json',
 './results/checkpoint-19500/special_tokens_map.json',
 './results/checkpoint-19500/tokenizer.json')

In [None]:
from transformers import AutoTokenizer,AutoModelForQuestionAnswering
from transformers import pipeline
model=AutoModelForQuestionAnswering.from_pretrained('./results/checkpoint-19500/')
tokenizer=AutoTokenizer.from_pretrained('./results/checkpoint-19500/')
nlp_QA=pipeline('question-answering',model=model,tokenizer=tokenizer)
QA_inp={
    'question': 'How many parameters does Bert large have?',
    'context': 'Bert large is really big... it has 24 layers, for a total of 340M parameters.Altogether it is 1.34 GB so expect it to take a couple minutes to download to your Colab instance.'
}
result=nlp_QA(QA_inp)
result

loading configuration file ./results/checkpoint-19500/config.json
Model config AlbertConfig {
  "_name_or_path": "albert-large-v1",
  "architectures": [
    "AlbertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 16,
  "num_hidden_groups": 1,
  "num_hidden_layers": 24,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.9.2",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file ./results/checkpoint-19500/pytorch_model.bin
All m

{'answer': '340M', 'end': 65, 'score': 0.14847151935100555, 'start': 61}

## Testing the model after training

Once the training is completed, we can test it. For this we also have to make sure the tokenizer is saved in the same directory. The trained model can be found [here](https://huggingface.co/abhilash1910/albert-squad-v2) and can be used for inference.

In [None]:
from transformers import AutoTokenizer,AutoModelForQuestionAnswering
from transformers import pipeline
model=AutoModelForQuestionAnswering.from_pretrained('abhilash1910/albert-squad-v2')
tokenizer=AutoTokenizer.from_pretrained('abhilash1910/albert-squad-v2')
nlp_QA=pipeline('question-answering',model=model,tokenizer=tokenizer)
QA_inp={
    'question': 'How many parameters does Bert large have?',
    'context': 'Bert large is really big... it has 24 layers, for a total of 340M parameters.Altogether it is 1.34 GB so expect it to take a couple minutes to download to your Colab instance.'
}
result=nlp_QA(QA_inp)
result

https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpnfn3clps


Downloading:   0%|          | 0.00/836 [00:00<?, ?B/s]

storing https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/533e0f9e240b3d367152f5f45ba92b42c804beb0bc405ad84db636139305f855.afb9d9722bb153d2ea93235f95689e33a7eb85de7dbe37ec17e51dfa3f42ebbf
creating metadata file for /root/.cache/huggingface/transformers/533e0f9e240b3d367152f5f45ba92b42c804beb0bc405ad84db636139305f855.afb9d9722bb153d2ea93235f95689e33a7eb85de7dbe37ec17e51dfa3f42ebbf
loading configuration file https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/533e0f9e240b3d367152f5f45ba92b42c804beb0bc405ad84db636139305f855.afb9d9722bb153d2ea93235f95689e33a7eb85de7dbe37ec17e51dfa3f42ebbf
Model config AlbertConfig {
  "_name_or_path": "albert-large-v1",
  "architectures": [
    "AlbertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedd

Downloading:   0%|          | 0.00/66.6M [00:00<?, ?B/s]

storing https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/deb3aca97497075348da9db756f93e54989dd4c888dd3c53ae746354c874ef2a.7fb4f17a6e8fc4f0b154526f67db816418b37456b1edbf52ea01949b81ec6a29
creating metadata file for /root/.cache/huggingface/transformers/deb3aca97497075348da9db756f93e54989dd4c888dd3c53ae746354c874ef2a.7fb4f17a6e8fc4f0b154526f67db816418b37456b1edbf52ea01949b81ec6a29
loading weights file https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/deb3aca97497075348da9db756f93e54989dd4c888dd3c53ae746354c874ef2a.7fb4f17a6e8fc4f0b154526f67db816418b37456b1edbf52ea01949b81ec6a29
All model checkpoint weights were used when initializing AlbertForQuestionAnswering.

All the weights of AlbertForQuestionAnswering were initialized from the model checkpoint at abhilash1910/albert-squad-v2.
If your task is similar to the task the 

Downloading:   0%|          | 0.00/479 [00:00<?, ?B/s]

storing https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/c2a2d7cb94c1489b2e89a716d3ef67ddff9c01a0d06be97d5d8318bf5fea8a89.a6ee620a52156eef7141e7e836f41040ca587dfa06685b0d71c3ab8f6e80842c
creating metadata file for /root/.cache/huggingface/transformers/c2a2d7cb94c1489b2e89a716d3ef67ddff9c01a0d06be97d5d8318bf5fea8a89.a6ee620a52156eef7141e7e836f41040ca587dfa06685b0d71c3ab8f6e80842c
https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmph26hpxl6


Downloading:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

storing https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/c9690b4d4af928a54f950649cb6720d7dbe1ade8c407e5fecd98e95c3796d64d.8d78aca3da666753f634bc7eddbb0752746b506d8d4d8ee4e6340f36a2254adc
creating metadata file for /root/.cache/huggingface/transformers/c9690b4d4af928a54f950649cb6720d7dbe1ade8c407e5fecd98e95c3796d64d.8d78aca3da666753f634bc7eddbb0752746b506d8d4d8ee4e6340f36a2254adc
https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpea9erb08


Downloading:   0%|          | 0.00/245 [00:00<?, ?B/s]

storing https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/538d805f111b12a3ecfa07cbe838a6a3a5dd283a1257625c7048e80b6e60e264.4b56d965eb7203eb7c5b4175f6637c2acb092a769633430d5c34e7bac43f4418
creating metadata file for /root/.cache/huggingface/transformers/538d805f111b12a3ecfa07cbe838a6a3a5dd283a1257625c7048e80b6e60e264.4b56d965eb7203eb7c5b4175f6637c2acb092a769633430d5c34e7bac43f4418
loading file https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/spiece.model from cache at None
loading file https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/c9690b4d4af928a54f950649cb6720d7dbe1ade8c407e5fecd98e95c3796d64d.8d78aca3da666753f634bc7eddbb0752746b506d8d4d8ee4e6340f36a2254adc
loading file https://huggingface.co/abhilash1910/albert-squad-v2/resolve/main/added_tokens.json from cache at None
loading file https://hugging

{'answer': '340M', 'end': 65, 'score': 0.14847151935100555, 'start': 61}