# Question Answering with BERT

In this notebook we’ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the  [Stanford Question Answering Dataset (SQuAD) 2.0](https://rajpurkar.github.io/SQuAD-explorer/).

For this assignment, and since [BERT](https://huggingface.co/transformers/model_doc/bert.html) is available as a pre-trained model, we wil be fine-tuning it with the SQuAD dataset. We will be using the **BERT base** model, which consists of 12 layers (transformer blocks), 12 attention heads, 110 million parameters, and has an output size of 768-dimensions. 

Another option is to use [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) which is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.
<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://drive.google.com/uc?id=1CrQGxVis6cPgNDazA0xBWFQWpIK3VVSv' alt="alternate text" width="width" height="height"/></div>

BERT was pre-trained on a large corpus of unlabelled text including the entire Wikipedia (that’s 2,500 million words) and book corpus (800 million words)
<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://drive.google.com/uc?id=1WZsEjjzBITh0I2YMUIKEeoPS8SdcMjHX' alt="alternate text" width="500" height="auto"/></div>

---

Let's install transformers

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 5.0 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 44.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 25.0 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninsta

## Download the SQuAD Dataset

We will download the SQuAD v2.0 dataset using the follwoing code. Fortunately, it is already available in both, a training (train-v2.0.json) and a validation (dev-v2.0.json) chunks. We however will need to prepare it for the model in the next steps.

In [None]:
import os
import requests

squad_folder = "squad_data" 
if not os.path.exists(squad_folder):
    os.mkdir(squad_folder)

url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'
res = requests.get(f'{url}train-v2.0.json')

for file in ['train-v2.0.json', 'dev-v2.0.json']:
    # make the request to download data over HTTP
    res = requests.get(f'{url}{file}')
    # write to file
    with open(f'{squad_folder}/{file}', 'wb') as f:
        for chunk in res.iter_content(chunk_size=4):
            f.write(chunk)

Let's look at the data, shall we!

In [None]:
import json

file = 'train-v2.0.json'
with open(f'{squad_folder}/{file}', 'r') as j:
    parsed = json.load(j)

In [None]:
parsed.keys()

dict_keys(['version', 'data'])

In [None]:
print(json.dumps(parsed['data'][:1][0], indent=2, sort_keys=True))

{
  "paragraphs": [
    {
      "context": "Beyonc\u00e9 Giselle Knowles-Carter (/bi\u02d0\u02c8j\u0252nse\u026a/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyonc\u00e9's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\".",
      "qas": [
        {
          "answers": [
            {
              "answer_start": 269,
              "text": "in the late 1990s"
            }
          ],
          "id": "56be85543aeaaa14008c9063",
          "is_i

The important pieces here are:
- Contexts — Paragraphs that contain the answers to the questions
- Questions — strings containing the question
- Answers — strings which are ‘extracts’ of the given contexts that provide an answer to the questions

While training, the model will read both the question and the answer, and return the token positions of the predicted answer from the context.

## Prepare the Data

For this step, we will use the following function to extract the contexts, questions and answers from each data file. 

In [None]:
def read_squad(path):
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    # iterate through all data in squad data
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                # check if we need to be extracting from 'answers' or 'plausible_answers'
                if 'plausible_answers' in qa.keys():
                    access = 'plausible_answers'
                else:
                    access = 'answers'
                for answer in qa[access]:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

In [None]:
# execute our read SQuAD function for training and validation sets
train_contexts, train_questions, train_answers = read_squad(f'{squad_folder}/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad(f'{squad_folder}/dev-v2.0.json')

In [None]:
print("paragraph:\n", train_contexts[0])
print("question:\n", train_questions[0])
print("answer:\n", train_answers[0])

paragraph:
 Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
question:
 When did Beyonce start becoming popular?
answer:
 {'text': 'in the late 1990s', 'answer_start': 269}


As we can see the answers are in a different format than the paragraphs and the questions. Each item of the answers is a dictionary where the answer is contained within the 'text', and the starting position of this answer within the context is also provided in `answer_start`.

We need to train our model to find the start and end of our answer within the context, so we need to add an `answer_end` value as well. We will write the code to do so in the following cell.

In [None]:
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        answer_text = answer['text']
        start_idx = answer['answer_start']
        # then the end index is:
        end_idx = start_idx + len(answer_text)

        # fact: sometimes squad answers are off by a character or two, so
        # we will hard-code the following based on the fact above about the SQuAD

        # if answer is captured correctly within the start and end indices
        if context[start_idx:end_idx] == answer_text:
            answer['answer_end'] = end_idx
        # otherwise:
        else:
            # this means the answer is off by 1-2 tokens
            for n in [1, 2]:
                if context[start_idx-n:end_idx-n] == answer_text:
                    answer['answer_start'] = start_idx - n
                    answer['answer_end'] = end_idx - n

In [None]:
# apply the function to our two answer lists
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

Let's check if that worked:

In [None]:
train_answers[0]

{'answer_end': 286, 'answer_start': 269, 'text': 'in the late 1990s'}

## Encoding

We are almost done preparing the data. We just need to convert our strings into tokens and then translate our `answer_start` and `answer_end` indices from character-position to token-position.

Tokenization is easily done using a built-in HuggingFace tokenizer like so:

In [None]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Our context-question pairs are now represented as Encoding objects. These objects merge each corresponding context and question strings to create the Q&A format expected by BERT, which is simply both context and question concatenated but separated with a `[SEP]` token:

In [None]:
tokenizer.decode(train_encodings['input_ids'][0])

'[CLS] beyonce giselle knowles - carter ( / biːˈjɒnseɪ / bee - yon - say ) ( born september 4, 1981 ) is an american singer, songwriter, record producer and actress. born and raised in houston, texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of r & b girl - group destiny\'s child. managed by her father, mathew knowles, the group became one of the world\'s best - selling girl groups of all time. their hiatus saw the release of beyonce\'s debut album, dangerously in love ( 2003 ), which established her as a solo artist worldwide, earned five grammy awards and featured the billboard hot 100 number - one singles " crazy in love " and " baby boy ". [SEP] when did beyonce start becoming popular? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

This concatenated version is stored within the input_ids attribute of our encoding object. But, rather than the human-readable text, the data is stored as BERT-readable token IDs.

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

The tokenizer is great, but it doesn’t produce our answer start and end token positions which we need to have in our encoding object. For that, we define a custom add_token_positions function:

In [None]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        # append start/end token position using char_to_token method
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        # end position cannot be found, char_to_token found space, so shift position until found
        shift = 1
        while end_positions[-1] is None:
            end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] - shift)
            shift += 1
    # update our encodings object with the new token-based start/end positions
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

In [None]:
# apply function to our data
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

Now our encoding objects have two more extra attributes: `start_positions` and `end_positions`. Each of these is simply a list containing the start/end token positions of the answer that corresponds to their respective question-context pairs.

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])

With that our data is ready!

## Initializing the Dataset

We are now ready to transform our data into the correct format for training with PyTorch. For this, we need to build a dataset object so we can feed them into our model during training and validation.

In [None]:
import torch
from torch.utils.data import Dataset

class SquadDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

In [None]:
# build datasets for both our training and validation sets
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

In [None]:
len(train_dataset)

130319

In [None]:
len(val_dataset)

26232

## Fine-Tune

As usual we first initialize our `DataLoader` for train and validation which we'll be using to load data during training and fine-tuning our model

In [None]:
from torch.utils.data import DataLoader

# set batch size
BATCH_SIZE = 32

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

In [None]:
# setup GPU/CPU, for this assignment it is recommended to use GPU (cuda)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

cpu


In [None]:
from transformers import BertForQuestionAnswering
from transformers import AdamW
from tqdm import tqdm, trange