# Question Answering with BERT

In this notebook we’ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the  [Stanford Question Answering Dataset (SQuAD) 2.0](https://rajpurkar.github.io/SQuAD-explorer/).

For this assignment, and since [BERT](https://huggingface.co/transformers/model_doc/bert.html) is available as a pre-trained model, we wil be fine-tuning it with the SQuAD dataset. We will be using the **BERT base** model, which consists of 12 layers (transformer blocks), 12 attention heads, 110 million parameters, and has an output size of 768-dimensions. 

Another option is to use [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) which is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.
<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://drive.google.com/uc?id=1CrQGxVis6cPgNDazA0xBWFQWpIK3VVSv' alt="alternate text" width="width" height="height"/></div>

BERT was pre-trained on a large corpus of unlabelled text including the entire Wikipedia (that’s 2,500 million words) and book corpus (800 million words)
<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://drive.google.com/uc?id=1WZsEjjzBITh0I2YMUIKEeoPS8SdcMjHX' alt="alternate text" width="500" height="auto"/></div>

---

Let's install transformers

In [None]:
!pip install transformers



## Download the SQuAD Dataset

We will download the SQuAD v2.0 dataset using the follwoing code. Fortunately, it is already available in both, a training (train-v2.0.json) and a validation (dev-v2.0.json) chunks. We however will need to prepare it for the model in the next steps.

In [None]:
import os
import requests
from pathlib import Path


squad_folder = "squad_data" 
if not os.path.exists(squad_folder):
    os.mkdir(squad_folder)

url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'
res = requests.get(f'{url}train-v2.0.json')

for file in ['train-v2.0.json', 'dev-v2.0.json']:
    jfile = Path(f'{squad_folder}/{file}')
    # check if files are downloaded already
    if not jfile.exists():
      # make the request to download data over HTTP
      res = requests.get(f'{url}{file}')
      # write to file
      with open(f'{squad_folder}/{file}', 'wb') as f:
          for chunk in res.iter_content(chunk_size=4):
              f.write(chunk)

Let's look at the data, shall we!

In [None]:
import json

file = 'train-v2.0.json'
with open(f'{squad_folder}/{file}', 'r') as j:
    parsed = json.load(j)

In [None]:
parsed.keys()

dict_keys(['version', 'data'])

In [None]:
# print(json.dumps(parsed['data'][:1][0], indent=2, sort_keys=True))

<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://drive.google.com/uc?id=1BKdJEvlbLU5BoB2C54UV58avO5-xHl2o' alt="alternate text" width="800" height="auto"/></div>



The important pieces here are:
- Contexts — Paragraphs that contain the answers to the questions
- Questions — strings containing the question
- Answers — strings which are ‘extracts’ of the given contexts that provide an answer to the questions

While training, the model will read both the question and the answer, and return the token positions of the predicted answer from the context.

## Prepare the Data

For this step, we will use the following function to extract the contexts, questions and answers from each data file. 

In [None]:
def read_squad(path):
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    # iterate through all data in squad data
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                # check if we need to be extracting from 'answers' or 'plausible_answers'
                if 'plausible_answers' in qa.keys():
                    access = 'plausible_answers'
                else:
                    access = 'answers'
                for answer in qa[access]:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

In [None]:
# execute our read SQuAD function for training and validation sets
train_contexts, train_questions, train_answers = read_squad(f'{squad_folder}/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad(f'{squad_folder}/dev-v2.0.json')

In [None]:
# making sure our function above worked
assert len(train_contexts) == len(train_questions) == len(train_answers)
assert len(val_contexts) == len(val_questions) == len(val_answers)

In [None]:
print("paragraph:\n", train_contexts[0])
print("question:\n", train_questions[0])
print("answer:\n", train_answers[0])

paragraph:
 Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
question:
 When did Beyonce start becoming popular?
answer:
 {'text': 'in the late 1990s', 'answer_start': 269}


As we can see the answers are in a different format than the paragraphs and the questions. Each item of the answers is a dictionary where the answer is contained within the 'text', and the starting position of this answer within the context is also provided in `answer_start`.

We need to train our model to find the start and end of our answer within the context, so we need to add an `answer_end` value as well. We will write the code to do so in the following cell.

In [None]:
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        answer_text = answer['text']
        start_idx = answer['answer_start']
        # then the end index is:
        end_idx = start_idx + len(answer_text)

        # fact: sometimes squad answers are off by a character or two, so
        # we will hard-code the following based on the fact above about the SQuAD

        # if answer is captured correctly within the start and end indices
        if context[start_idx:end_idx] == answer_text:
            answer['answer_end'] = end_idx
        # otherwise:
        else:
            # this means the answer is off by 1-2 tokens
            for n in [1, 2]:
                if context[start_idx-n:end_idx-n] == answer_text:
                    answer['answer_start'] = start_idx - n
                    answer['answer_end'] = end_idx - n

In [None]:
# apply the function to our two answer lists
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

Let's check if that worked as well:

In [None]:
train_answers[0]

{'answer_end': 286, 'answer_start': 269, 'text': 'in the late 1990s'}

Expected output:

`{'answer_end': 286, 'answer_start': 269, 'text': 'in the late 1990s'}`



## Encoding

We are almost done preparing the data. We just need to convert our strings into tokens and then translate our `answer_start` and `answer_end` indices from character-position to token-position.

Tokenization is easily done using a built-in HuggingFace tokenizer like so:

In [None]:
# from transformers import BertTokenizerFast
from transformers import DistilBertTokenizerFast

# tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

Our context-question pairs are now represented as Encoding objects. These objects merge each corresponding context and question strings to create the Q&A format expected by BERT, which is simply both context and question concatenated but separated with a `[SEP]` token:

In [None]:
tokenizer.decode(train_encodings['input_ids'][0])

'[CLS] beyonce giselle knowles - carter ( / biːˈjɒnseɪ / bee - yon - say ) ( born september 4, 1981 ) is an american singer, songwriter, record producer and actress. born and raised in houston, texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of r & b girl - group destiny\'s child. managed by her father, mathew knowles, the group became one of the world\'s best - selling girl groups of all time. their hiatus saw the release of beyonce\'s debut album, dangerously in love ( 2003 ), which established her as a solo artist worldwide, earned five grammy awards and featured the billboard hot 100 number - one singles " crazy in love " and " baby boy ". [SEP] when did beyonce start becoming popular? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

This concatenated version is stored within the input_ids attribute of our encoding object. But, rather than the human-readable text, the data is stored as BERT-readable token IDs.

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

The tokenizer is great, but it doesn’t produce our answer start and end token positions (**start_positions** and **end_positions**) which we need to have in our encoding object which the model will need to train on. For that, we define a custom add_token_positions function:

In [None]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        # append start/end token position using char_to_token method
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        # end position cannot be found, char_to_token found space, so shift position until found
        shift = 1
        while end_positions[-1] is None:
            end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] - shift)
            shift += 1
    # update our encodings object with the new token-based start/end positions
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

In [None]:
# apply function to our data
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

Now our encoding objects have two more extra attributes: `start_positions` and `end_positions`. Each of these is simply a list containing the start/end token positions of the answer that corresponds to their respective question-context pairs.

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

Expected output

`dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])`



With that our data is ready!

## Initializing the Dataset

We are now ready to transform our data into the correct format for training with PyTorch. For this, we need to build a dataset object so we can feed them into our model during training and validation.

In [None]:
import torch
from torch.utils.data import Dataset

class SquadDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

Build datasets for both our training and validation sets

In [None]:
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

(Optional) Build partial datasets for both our training and validation sets

In [None]:
import torch.utils.data as data_utils

train_dataset = data_utils.Subset(train_dataset, torch.arange(10000))
val_dataset = data_utils.Subset(val_dataset, torch.arange(1000))

In [None]:
print("length train:", len(train_dataset))
print("length val:", len(val_dataset))

length train: 10000
length val: 1000


## Fine-Tune

As usual we first initialize our `DataLoader` for train and validation which we'll be using to load data during training and fine-tuning our model

In [None]:
from torch.utils.data import DataLoader

batch_size = 8

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

In [None]:
# setup GPU/CPU, for this assignment it is recommended to use GPU (cuda)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

cuda


In [None]:
!pip install seqeval



In [None]:
from transformers import BertForQuestionAnswering
from transformers import DistilBertForQuestionAnswering
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm import tqdm, trange
from seqeval.metrics import f1_score, accuracy_score
import numpy as np

In [None]:
# initialize model for QA
# model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

In [None]:
# move model over to detected device
model.to(device)

# initialize adam optimizer with weight decay (reduces chance of overfitting)
optimizer = AdamW(model.parameters(), lr=5e-5, eps=1e-8)

# set epochs
epochs = 3

num_samples = len(train_dataset)
batches_per_epoch = int(num_samples/batch_size)
gradient_accumulation_steps = 1
total_steps = int(epochs*batches_per_epoch/gradient_accumulation_steps)
warmup_steps = int(0.1*total_steps)

scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

## Store the average loss after each epoch so we can plot them.
loss_values, validation_loss_values = [], []

for _ in trange(epochs, desc="Epoch"):

    # Reset the total loss for this epoch.
    train_losses = 0
    for step, batch in enumerate(train_loader):
        # Put the model into training mode
        model.train()

        # pull all the tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)

        # train model
        outputs = model(input_ids, 
                        attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        # track loss
        loss = outputs.loss/gradient_accumulation_steps
        loss.backward()
        train_losses += loss.item()

        # update parameters
        optimizer.step()
        model.zero_grad()
        scheduler.step()

    # Calculate the average loss over the training data
    avg_train_loss = train_losses / len(train_loader)
    print("train loss: {}".format(avg_train_loss))

    # Store the loss value for plotting the learning curve
    loss_values.append(avg_train_loss)

    print('Evaluating...')
    val_losses = 0
    predictions , true_labels = [], []
    acc = []
    for batch in val_loader:
        # Put the model into evaluation mode
        model.eval()

        # Telling the model not to compute or store gradients,
        # saving memory and speeding up validation
        with torch.no_grad():
            # pull batched items from loader
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            start_true = batch['start_positions'].to(device)
            end_true = batch['end_positions'].to(device)

            # make predictions
            outputs = model(input_ids, 
                            attention_mask=attention_mask,
                            start_positions=start_true,
                            end_positions=end_true)
            
            # track loss
            loss = outputs.loss
            val_losses += loss.item()

            # pull prediction tensors out and argmax to get predicted tokens
            start_pred = torch.argmax(outputs['start_logits'], dim=1)
            end_pred = torch.argmax(outputs['end_logits'], dim=1)
            # calculate accuracy for both and append to accuracy list
            acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
            acc.append(((end_pred == end_true).sum()/len(end_pred)).item())

    # Calculate the average loss over the training data
    avg_val_loss = val_losses / len(val_loader)
    print("val_loss: {}".format(avg_val_loss))

    # calculate accuracy
    acc = sum(acc)/len(acc)
    print("Validation Accuracy: {}".format(acc))

    # Store the val loss value for plotting the learning curve
    validation_loss_values.append(avg_val_loss)
    print()

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

train loss: 2.478915125322342
Evaluating...


Epoch:  33%|███▎      | 1/3 [17:26<34:53, 1046.59s/it]

val_loss: 1.8885453548431397
Validation Accuracy: 0.4885

train loss: 0.9362878883957862
Evaluating...


Epoch:  67%|██████▋   | 2/3 [34:54<17:27, 1047.48s/it]

val_loss: 1.852135550737381
Validation Accuracy: 0.522

train loss: 0.4579455568432808
Evaluating...


Epoch: 100%|██████████| 3/3 [52:21<00:00, 1047.30s/it]

val_loss: 2.1260898740291596
Validation Accuracy: 0.533






Let's save our model so we can use it in the next session!

In [None]:
# model_path = 'models/bert-custom'
model_path = 'models/distilbert-custom'

model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('models/distilbert-custom/tokenizer_config.json',
 'models/distilbert-custom/special_tokens_map.json',
 'models/distilbert-custom/vocab.txt',
 'models/distilbert-custom/added_tokens.json',
 'models/distilbert-custom/tokenizer.json')

## Inference

Load the saved model and tokenizer

In [None]:
# model = BertForQuestionAnswering.from_pretrained(model_path)
# tokenizer = BertTokenizerFast.from_pretrained(model_path)

model = DistilBertForQuestionAnswering.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

In [None]:
def question_answer(question, text):
    
    # tokenize question and text as a pair
    input_ids = tokenizer.encode(question, text)
    
    # string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    # segment IDs
    # first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)    
    # number of tokens in segment A (question)
    num_seg_a = sep_idx+1    
    # number of tokens in segment B (text)
    num_seg_b = len(input_ids) - num_seg_a
    
    # list of 0s and 1s for segment embeddings
    segment_ids = [0]*num_seg_a + [1]*num_seg_b 
    assert len(segment_ids) == len(input_ids)
    
    # model output using input_ids
    ## for bert (+ using token_type_ids)
    if isinstance(model, BertForQuestionAnswering):
      output = model(torch.tensor([input_ids], token_type_ids=torch.tensor([segment_ids])))
    elif isinstance(model, DistilBertForQuestionAnswering):
      output = model(torch.tensor([input_ids]))
    else:
      raise "Inference code does not suport model!"
    
    # reconstructing the answer
    answer_start = torch.argmax(output.start_logits, dim=1)
    answer_end = torch.argmax(output.end_logits, dim=1)

    # again, because BERT uses wordpiece tokenization...
    answer = ''  
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]
                
    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."
    
    print("\nPredicted answer:\n{}".format(answer.capitalize()))

In [None]:
text = """
my name is Hasan, I like machine learning.
"""

In [None]:
question = "What does hasan like?"

In [None]:
question_answer(question, text)


Predicted answer:
Machine learning .


In [None]:
# This might be able to download the model folder 
from google.colab import files
files.download(model_path) 