<a href="https://colab.research.google.com/github/aksiitbhu/Question-Answering/blob/main/Question_Answering_Squad_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question Answering❓**
with fine-tuned BERT on SQuAD 2.0.  

Question answering comes in many forms. We’ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. More specifically, we will fine tune the [bert-base-uncased](https://huggingface.co/bert-base-uncased) model on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.

I have followed [this tutorial](https://huggingface.co/transformers/v3.2.0/custom_datasets.html#question-answering-with-squad-2-0) from the huggingface community for how to fine tune BERT on custom datasets which in our case is the SQuAD 2.0.

**Some first imports**

In [1]:
import requests
import json
import torch
import os
from tqdm import tqdm

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Connecting Google Drive in order to save the model**

In [2]:
if not os.path.exists('/content/drive/MyDrive/BERT-SQuAD'):
  os.mkdir('/content/drive/MyDrive/BERT-SQuAD')

In [3]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m106.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.2 MB/s[0m eta [36m0:00:00[0m
Coll

### **Download SQuAD 2.0 ⬇️**

SQuAD consists of two json files.

* train dataset
* validation dataset

In [4]:
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2023-10-13 12:08:07--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2023-10-13 12:08:10 (436 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2023-10-13 12:08:10--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2023-10-13 12:08:11 (326 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



## **Data preprocessing 💽**

In this section of data preprocessing, our goal is to get our data in the following form:

<div>
<img src="http://www.mccormickml.com/assets/BERT/SQuAD/input_formatting.png" width="650"/>
</div>


In short, we have to do the following:

1. Extract the data from the jsons files
2. Tokenize the data
3. Define the datasets

In [5]:
# Load the training dataset and take a look at it
with open('train-v2.0.json', 'rb') as f:
  squad = json.load(f)

In [6]:
# Each 'data' dict has two keys (title and paragraphs)
squad['data'][0].keys()

dict_keys(['title', 'paragraphs'])

In [7]:
print(squad['data'][0])

{'title': 'Beyoncé', 'paragraphs': [{'qas': [{'question': 'When did Beyonce start becoming popular?', 'id': '56be85543aeaaa14008c9063', 'answers': [{'text': 'in the late 1990s', 'answer_start': 269}], 'is_impossible': False}, {'question': 'What areas did Beyonce compete in when she was growing up?', 'id': '56be85543aeaaa14008c9065', 'answers': [{'text': 'singing and dancing', 'answer_start': 207}], 'is_impossible': False}, {'question': "When did Beyonce leave Destiny's Child and become a solo singer?", 'id': '56be85543aeaaa14008c9066', 'answers': [{'text': '2003', 'answer_start': 526}], 'is_impossible': False}, {'question': 'In what city and state did Beyonce  grow up? ', 'id': '56bf6b0f3aeaaa14008c9601', 'answers': [{'text': 'Houston, Texas', 'answer_start': 166}], 'is_impossible': False}, {'question': 'In which decade did Beyonce become famous?', 'id': '56bf6b0f3aeaaa14008c9602', 'answers': [{'text': 'late 1990s', 'answer_start': 276}], 'is_impossible': False}, {'question': 'In what 

In [8]:
# Find the group about Greece
gr = -1
for idx, group in enumerate(squad['data']):
  print(group['title'])
  if group['title'] == 'Greece':
    gr = idx
    print(gr)
    break

Beyoncé
Frédéric_Chopin
Sino-Tibetan_relations_during_the_Ming_dynasty
IPod
The_Legend_of_Zelda:_Twilight_Princess
Spectre_(2015_film)
2008_Sichuan_earthquake
New_York_City
To_Kill_a_Mockingbird
Solar_energy
Kanye_West
Buddhism
American_Idol
Dog
2008_Summer_Olympics_torch_relay
Genome
Comprehensive_school
Republic_of_the_Congo
Prime_minister
Institute_of_technology
Wayback_Machine
Dutch_Republic
Symbiosis
Canadian_Armed_Forces
Cardinal_(Catholicism)
Iranian_languages
Lighting
Separation_of_powers_under_the_United_States_Constitution
Architecture
Human_Development_Index
Southern_Europe
BBC_Television
Arnold_Schwarzenegger
Plymouth
Heresy
Warsaw_Pact
Materialism
Christian
Sony_Music_Entertainment
Oklahoma_City
Hunter-gatherer
United_Nations_Population_Fund
Russian_Soviet_Federative_Socialist_Republic
Alexander_Graham_Bell
Pub
Internet_service_provider
Comics
Saint_Helena
Aspirated_consonant
Hydrogen
Space_Race
Web_browser
BeiDou_Navigation_Satellite_System
Canon_law
Communications_in_Som

In [9]:
# let's check on Greece which is 186th (0-based indexing)
# we can see that we have a context and many questions and answers following
squad['data'][186]

{'title': 'Greece',
 'paragraphs': [{'qas': [{'question': 'On what peninsula is Greece located?',
     'id': '572611d489a1e219009ac1e0',
     'answers': [{'text': 'Balkan peninsula', 'answer_start': 115}],
     'is_impossible': False},
    {'question': 'How many geographic regions make up Greece?',
     'id': '572611d489a1e219009ac1e1',
     'answers': [{'text': 'nine', 'answer_start': 287}],
     'is_impossible': False},
    {'question': 'How long is the coastline of Greece?',
     'id': '572611d489a1e219009ac1e2',
     'answers': [{'text': '8,498 mi', 'answer_start': 708}],
     'is_impossible': False},
    {'question': "How many of Greece's islands are inhabited?",
     'id': '572611d489a1e219009ac1e3',
     'answers': [{'text': '227', 'answer_start': 774}],
     'is_impossible': False},
    {'question': 'What is the tallest mountain in Greece?',
     'id': '572611d489a1e219009ac1e4',
     'answers': [{'text': 'Mount Olympus', 'answer_start': 839}],
     'is_impossible': False}],
  

In [10]:
# and this is the context given for NYC
squad['data'][186]['paragraphs'][0]['context']

'Greece is strategically located at the crossroads of Europe, Asia, and Africa. Situated on the southern tip of the Balkan peninsula, it shares land borders with Albania to the northwest, the Republic of Macedonia and Bulgaria to the north and Turkey to the northeast. Greece consists of nine geographic regions: Macedonia, Central Greece, the Peloponnese, Thessaly, Epirus, the Aegean Islands (including the Dodecanese and Cyclades), Thrace, Crete, and the Ionian Islands. The Aegean Sea lies to the east of the mainland, the Ionian Sea to the west, and the Mediterranean Sea to the south. Greece has the longest coastline on the Mediterranean Basin and the 11th longest coastline in the world at 13,676 km (8,498 mi) in length, featuring a vast number of islands, of which 227 are inhabited. Eighty percent of Greece is mountainous, with Mount Olympus being the highest peak at 2,918 metres (9,573 ft).'

### **Get data 📁**

After we got a taste of the jsons files data format let's extract our data and store them into some data structures.

In [11]:
def read_data(path):
  # load the json file
  with open(path, 'rb') as f:
    squad = json.load(f)

  contexts = []
  questions = []
  answers = []

  for group in squad['data']:
    for passage in group['paragraphs']:
      context = passage['context']
      for qa in passage['qas']:
        question = qa['question']
        for answer in qa['answers']:
          contexts.append(context)
          questions.append(question)
          answers.append(answer)

  return contexts, questions, answers

Put the contexts, questions and answers for training and validation into the appropriate lists.

In [12]:
train_contexts, train_questions, train_answers = read_data('train-v2.0.json')
valid_contexts, valid_questions, valid_answers = read_data('dev-v2.0.json')

In [13]:
# print a random question and answer
print(f'There are {len(train_questions)} questions')
print(train_questions[-10000])
print(train_answers[-10000])

There are 86821 questions
What is a modern common occurence with antibiotics?
{'text': 'resistance of bacteria', 'answer_start': 17}


As you can see above, the answers are dictionaries whith the answer text and an integer which indicates the start index of the answer in the context. As the SQuAD does not give us the end index of the answer in the context we have to find it ourselves. So, let's get the character position at which the answer ends in the passage. Note that sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.

In [14]:
def add_end_idx(answers, contexts):
  for answer, context in zip(answers, contexts):
    gold_text = answer['text']
    start_idx = answer['answer_start']
    end_idx = start_idx + len(gold_text)

    # sometimes squad answers are off by a character or two so we fix this
    if context[start_idx:end_idx] == gold_text:
      answer['answer_end'] = end_idx
    elif context[start_idx-1:end_idx-1] == gold_text:
      answer['answer_start'] = start_idx - 1
      answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
    elif context[start_idx-2:end_idx-2] == gold_text:
      answer['answer_start'] = start_idx - 2
      answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers, train_contexts)
add_end_idx(valid_answers, valid_contexts)

In [15]:
# You can see that now we get the answer_end also
print(train_questions[-10000])
print(train_answers[-10000])

What is a modern common occurence with antibiotics?
{'text': 'resistance of bacteria', 'answer_start': 17, 'answer_end': 39}


### **Tokenization 🔢**

As we know we have to tokenize our data in form that is acceptable for the BERT model. We are going to use the `BertTokenizerFast` instead of `BertTokenizer` as the first one is much faster. Since we are going to train our model in batches we need to set `padding=True`.

In [16]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
valid_encodings = tokenizer(valid_contexts, valid_questions, truncation=True, padding=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Let's see what we got after tokenizing our data.

In [17]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [18]:
no_of_encodings = len(train_encodings['input_ids'])
print(f'We have {no_of_encodings} context-question pairs')

We have 86821 context-question pairs


In [19]:
train_encodings['input_ids'][0]

[101,
 20773,
 21025,
 19358,
 22815,
 1011,
 5708,
 1006,
 1013,
 12170,
 23432,
 29715,
 3501,
 29678,
 12325,
 29685,
 1013,
 10506,
 1011,
 10930,
 2078,
 1011,
 2360,
 1007,
 1006,
 2141,
 2244,
 1018,
 1010,
 3261,
 1007,
 2003,
 2019,
 2137,
 3220,
 1010,
 6009,
 1010,
 2501,
 3135,
 1998,
 3883,
 1012,
 2141,
 1998,
 2992,
 1999,
 5395,
 1010,
 3146,
 1010,
 2016,
 2864,
 1999,
 2536,
 4823,
 1998,
 5613,
 6479,
 2004,
 1037,
 2775,
 1010,
 1998,
 3123,
 2000,
 4476,
 1999,
 1996,
 2397,
 4134,
 2004,
 2599,
 3220,
 1997,
 1054,
 1004,
 1038,
 2611,
 1011,
 2177,
 10461,
 1005,
 1055,
 2775,
 1012,
 3266,
 2011,
 2014,
 2269,
 1010,
 25436,
 22815,
 1010,
 1996,
 2177,
 2150,
 2028,
 1997,
 1996,
 2088,
 1005,
 1055,
 2190,
 1011,
 4855,
 2611,
 2967,
 1997,
 2035,
 2051,
 1012,
 2037,
 14221,
 2387,
 1996,
 2713,
 1997,
 20773,
 1005,
 1055,
 2834,
 2201,
 1010,
 20754,
 1999,
 2293,
 1006,
 2494,
 1007,
 1010,
 2029,
 2511,
 2014,
 2004,
 1037,
 3948,
 3063,
 4969,
 1010,
 36

Let's decode the first pair of context-question encoded pair and look into it.

In [20]:
tokenizer.decode(train_encodings['input_ids'][0])

'[CLS] beyonce giselle knowles - carter ( / biːˈjɒnseɪ / bee - yon - say ) ( born september 4, 1981 ) is an american singer, songwriter, record producer and actress. born and raised in houston, texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of r & b girl - group destiny\'s child. managed by her father, mathew knowles, the group became one of the world\'s best - selling girl groups of all time. their hiatus saw the release of beyonce\'s debut album, dangerously in love ( 2003 ), which established her as a solo artist worldwide, earned five grammy awards and featured the billboard hot 100 number - one singles " crazy in love " and " baby boy ". [SEP] when did beyonce start becoming popular? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

We can see that each word is assigned a number.

For example,

beyonce $\rightarrow$ 20773  
[CLS] $\rightarrow$ 101  
[SEP] $\rightarrow$ 102   
[PAD] $\rightarrow$ 0  

We see that the above form matches the one in the image we saw in the Data preprocessing section before.

Next we need to convert our character start/end positions to token start/end positions. Why is that? Because our words converted into tokens, so the answer start/end needs to show the index of start/end token which contains the answer and not the specific characters in the context.

In [21]:
def add_token_positions(encodings, answers):
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
    start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
    end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

    # if start position is None, the answer passage has been truncated
    if start_positions[-1] is None:
      start_positions[-1] = tokenizer.model_max_length
    if end_positions[-1] is None:
      end_positions[-1] = tokenizer.model_max_length

  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(valid_encodings, valid_answers)

In [22]:
train_encodings['start_positions'][:10]

[67, 55, 128, 47, 69, 81, 124, 91, 69, 72]

### **Dataset definition 🗄️**

We have to define our dataset using the PyTorch Dataset class from `torch.utils` in order create our dataloaders after that.

In [23]:
class SQuAD_Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __getitem__(self, idx):
    return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  def __len__(self):
    return len(self.encodings.input_ids)

In [24]:
train_dataset = SQuAD_Dataset(train_encodings)
valid_dataset = SQuAD_Dataset(valid_encodings)

### **Dataloaders 🔁**

In [25]:
from torch.utils.data import DataLoader

# Define the dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=16)

## **Fine-Tuning ⚙️**

### **Model definition 🤖**

We are going to use the `bert-case-uncased` from the huggingface transformers.

In [26]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **Training 🏋️‍♂️**

Μy choices for some parameters:

* Use of `AdamW` which is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. This helps to avoid overfitting which is necessary in this case were the model is very complex.

* Set the `lr=5e-5` as I read that this is the best value for the learning rate for this task.

In [27]:
# Check on the available device - use GPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Working on {device}')

Working on cuda


In [30]:
from transformers import AdamW

N_EPOCHS = 5
optim = AdamW(model.parameters(), lr=5e-5)

model.to(device)
model.train()

for epoch in range(N_EPOCHS):
  loop = tqdm(train_loader, leave=True)
  for batch in loop:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    loss.backward()
    optim.step()

    loop.set_description(f'Epoch {epoch+1}')
    loop.set_postfix(loss=loss.item())

Epoch 1: 100%|██████████| 5427/5427 [2:18:47<00:00,  1.53s/it, loss=1.63]
Epoch 2: 100%|██████████| 5427/5427 [2:17:35<00:00,  1.52s/it, loss=0.912]
Epoch 3: 100%|██████████| 5427/5427 [2:17:15<00:00,  1.52s/it, loss=0.134]
Epoch 4: 100%|██████████| 5427/5427 [2:17:25<00:00,  1.52s/it, loss=0.254]
Epoch 5: 100%|██████████| 5427/5427 [2:17:52<00:00,  1.52s/it, loss=0.445]


**Save the model in my drive in order not to run it each time**

In [31]:
model_path = '/content/drive/MyDrive/BERT-SQuAD'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/BERT-SQuAD/tokenizer_config.json',
 '/content/drive/MyDrive/BERT-SQuAD/special_tokens_map.json',
 '/content/drive/MyDrive/BERT-SQuAD/vocab.txt',
 '/content/drive/MyDrive/BERT-SQuAD/added_tokens.json',
 '/content/drive/MyDrive/BERT-SQuAD/tokenizer.json')

**Respectively, load the saved model**

In [32]:
#from transformers import BertForQuestionAnswering, BertTokenizerFast

#model_path = '/content/drive/MyDrive/BERT-SQuAD'
#model = BertForQuestionAnswering.from_pretrained(model_path)
#tokenizer = BertTokenizerFast.from_pretrained(model_path)

#device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
#print(f'Working on {device}')

#model = model.to(device)

### **Testing ✅**

We are evaluating the model on the validation set by checking the model's predictions for the answer's start and end indexes and comparing with the true ones.

In [33]:
model.eval()

acc = []

for batch in tqdm(valid_loader):
  with torch.no_grad():
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_true = batch['start_positions'].to(device)
    end_true = batch['end_positions'].to(device)

    outputs = model(input_ids, attention_mask=attention_mask)

    start_pred = torch.argmax(outputs['start_logits'], dim=1)
    end_pred = torch.argmax(outputs['end_logits'], dim=1)

    acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
    acc.append(((end_pred == end_true).sum()/len(end_pred)).item())

acc = sum(acc)/len(acc)

print("\n\nT/P\tanswer_start\tanswer_end\n")
for i in range(len(start_true)):
  print(f"true\t{start_true[i]}\t{end_true[i]}\n"
        f"pred\t{start_pred[i]}\t{end_pred[i]}\n")

100%|██████████| 1269/1269 [11:23<00:00,  1.86it/s]



T/P	answer_start	answer_end

true	67	68
pred	67	68

true	67	68
pred	67	68

true	67	68
pred	67	68

true	66	68
pred	67	68

true	171	172
pred	64	68

true	171	172
pred	64	68

true	171	172
pred	64	68

true	171	172
pred	64	68

true	171	172
pred	64	68

true	158	160
pred	67	68

true	158	160
pred	67	68

true	158	160
pred	67	68

true	158	160
pred	67	68

true	158	160
pred	67	68






### **Ask questions 🙋**

We are going to use some functions from the [*official Evaluation Script v2.0*](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) of SQuAD in order to test the fine-tuned model by asking some questions given a context. I have also looked at this [notebook](https://colab.research.google.com/github/fastforwardlabs/ff14_blog/blob/master/_notebooks/2020-06-09-Evaluating_BERT_on_SQuAD.ipynb#scrollTo=MzPlHgWEBQ8D) which evaluates BERT on SQuAD.

In [34]:
def get_prediction(context, question):
  inputs = tokenizer.encode_plus(question, context, return_tensors='pt').to(device)
  outputs = model(**inputs)

  answer_start = torch.argmax(outputs[0])
  answer_end = torch.argmax(outputs[1]) + 1

  answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))

  return answer

def normalize_text(s):
  """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
  import string, re
  def remove_articles(text):
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
    return re.sub(regex, " ", text)
  def white_space_fix(text):
    return " ".join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return "".join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()

  return white_space_fix(remove_articles(remove_punc(lower(s))))

def exact_match(prediction, truth):
    return bool(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
  pred_tokens = normalize_text(prediction).split()
  truth_tokens = normalize_text(truth).split()

  # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
  if len(pred_tokens) == 0 or len(truth_tokens) == 0:
    return int(pred_tokens == truth_tokens)

  common_tokens = set(pred_tokens) & set(truth_tokens)

  # if there are no common tokens then f1 = 0
  if len(common_tokens) == 0:
    return 0

  prec = len(common_tokens) / len(pred_tokens)
  rec = len(common_tokens) / len(truth_tokens)

  return round(2 * (prec * rec) / (prec + rec), 2)

def question_answer(context, question,answer):
  prediction = get_prediction(context,question)
  em_score = exact_match(prediction, answer)
  f1_score = compute_f1(prediction, answer)

  print(f'Question: {question}')
  print(f'Prediction: {prediction}')
  print(f'True Answer: {answer}')
  print(f'Exact match: {em_score}')
  print(f'F1 score: {f1_score}\n')

**Beyoncé**

In [35]:
context = """Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer,
          songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing
          and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child.
          Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time.
          Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide,
          earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy"."""


questions = ["For whom the passage is talking about?",
             "When did Beyonce born?",
             "Where did Beyonce born?",
             "What is Beyonce's nationality?",
             "Who was the Destiny's group manager?",
             "What name has the Beyoncé's debut album?",
             "How many Grammy Awards did Beyonce earn?",
             "When did the Beyoncé's debut album release?",
             "Who was the lead singer of R&B girl-group Destiny's Child?"]

answers = ["Beyonce Giselle Knowles - Carter", "September 4, 1981", "Houston, Texas",
           "American", "Mathew Knowles", "Dangerously in Love", "five", "2003",
           "Beyonce Giselle Knowles - Carter"]

for question, answer in zip(questions, answers):
  question_answer(context, question, answer)

Question: For whom the passage is talking about?
Prediction: mathew knowles
True Answer: Beyonce Giselle Knowles - Carter
Exact match: False
F1 score: 0.33

Question: When did Beyonce born?
Prediction: september 4, 1981
True Answer: September 4, 1981
Exact match: True
F1 score: 1.0

Question: Where did Beyonce born?
Prediction: houston, texas
True Answer: Houston, Texas
Exact match: True
F1 score: 1.0

Question: What is Beyonce's nationality?
Prediction: american
True Answer: American
Exact match: True
F1 score: 1.0

Question: Who was the Destiny's group manager?
Prediction: mathew knowles
True Answer: Mathew Knowles
Exact match: True
F1 score: 1.0

Question: What name has the Beyoncé's debut album?
Prediction: dangerously in love
True Answer: Dangerously in Love
Exact match: True
F1 score: 1.0

Question: How many Grammy Awards did Beyonce earn?
Prediction: five
True Answer: five
Exact match: True
F1 score: 1.0

Question: When did the Beyoncé's debut album release?
Prediction: 2003
Tru

**Athens**

In [36]:
context = """Athens is the capital and largest city of Greece. Athens dominates the Attica region and is one of the world's oldest cities,
             with its recorded history spanning over 3,400 years and its earliest human presence starting somewhere between the 11th and 7th millennium BC.
             Classical Athens was a powerful city-state. It was a center for the arts, learning and philosophy, and the home of Plato's Academy and Aristotle's Lyceum.
             It is widely referred to as the cradle of Western civilization and the birthplace of democracy, largely because of its cultural and political impact on the European continent—particularly Ancient Rome.
             In modern times, Athens is a large cosmopolitan metropolis and central to economic, financial, industrial, maritime, political and cultural life in Greece.
             In 2021, Athens' urban area hosted more than three and a half million people, which is around 35% of the entire population of Greece.
             Athens is a Beta global city according to the Globalization and World Cities Research Network, and is one of the biggest economic centers in Southeastern Europe.
             It also has a large financial sector, and its port Piraeus is both the largest passenger port in Europe, and the second largest in the world."""

questions = ["Which is the largest city in Greece?",
             "For what was the Athens center?",
             "Which city was the home of Plato's Academy?"]

answers = ["Athens", "center for the arts, learning and philosophy", "Athens"]

for question, answer in zip(questions, answers):
  question_answer(context, question, answer)

Question: Which is the largest city in Greece?
Prediction: athens
True Answer: Athens
Exact match: True
F1 score: 1.0

Question: For what was the Athens center?
Prediction: plato ' s academy and aristotle ' s lyceum
True Answer: center for the arts, learning and philosophy
Exact match: False
F1 score: 0.15

Question: Which city was the home of Plato's Academy?
Prediction: athens
True Answer: Athens
Exact match: True
F1 score: 1.0



**Angelos**

In [37]:
context = """Angelos Poulis was born on 8 April 2001 in Nicosia, Cyprus. He is half Cypriot and half Greek.
            He is currently studying at the Department of Informatics and Telecommunications of the University of Athens in Greece.
            His scientific interests are in the broad field of Artificial Intelligence and he loves to train neural networks!
            Okay, I'm Angelos and I'll stop talking about me right now."""

questions = ["When did Angelos born?",
             "In what university is Angelos studying now?",
             "What is Angelos' nationality?",
             "What are his scientific interests?",
             "What I will do right now?"]

answers = ["8 April 2001", "University of Athens",
           "half Cypriot and half Greek", "Artificial Intelligence",
           "stop talking about me"]

for question, answer in zip(questions, answers):
  question_answer(context, question, answer)

Question: When did Angelos born?
Prediction: 8 april 2001
True Answer: 8 April 2001
Exact match: True
F1 score: 1.0

Question: In what university is Angelos studying now?
Prediction: university of athens
True Answer: University of Athens
Exact match: True
F1 score: 1.0

Question: What is Angelos' nationality?
Prediction: half cypriot and half greek
True Answer: half Cypriot and half Greek
Exact match: True
F1 score: 0.8

Question: What are his scientific interests?
Prediction: artificial intelligence
True Answer: Artificial Intelligence
Exact match: True
F1 score: 1.0

Question: What I will do right now?
Prediction: train neural networks! okay, i ' m angelos and i ' ll stop talking about me
True Answer: stop talking about me
Exact match: False
F1 score: 0.44



## **Summary (and some Questions & Answers) 🧐**

**Technical details:**
* **Model used:** `bert-base-uncased`
* **Dataset:** The Stanford Question Answering Dataset (SQuAD)  
* **Run time:** ~ 4 hours on the Tesla P100 GPU for `N_EPOCHS = 3`. Each epoch took about 1 hour and 15 minutes for training. I think if we run the model for at least `N_EPOCHS = 5` we can get even better results, but what we got for 3 epochs is already very good!

**Conclusion:** We can say that training the model for just 3 epochs, which took about 4 hours on the Tesla P100 GPU, gives us pretty good results. The model can also answer quite well to questions about contents it hasn't seen before and I can say this because I gave it a passage for myself!

Some *example questions and answers* we get are the following:

**About Athens:**

> **Question:** Which is the largest city in Greece?  
  **Prediction:** athens  
  **True Answer:** Athens  
  **Exact match:** True  
  **F1 score:** 1.0  

> **Question:** For what was the Athens center?  
  **Prediction:** center for the arts, learning and philosophy  
  **True Answer:** center for the arts, learning and philosophy  
  **Exact match:** True  
  **F1 score:** 1.0  

**About Beyoncé:**

> **Question:** When did Beyonce born?  
  **Prediction:** september 4, 1981  
  **True Answer:** September 4, 1981  
  **Exact Match:** True
  **F1 score:** 1.0

> **Question:** What name has the Beyoncé's debut album?  
  **Prediction:** dangerously in love  
  **True Answer:** Dangerously in Love   
  **Exact Match:** True  
  **F1 score:** 1.0

> **Question:** How many Grammy Awards did Beyonce earn?  
  **Prediction:** five  
  **True Answer:** five  
  **Exact Match:** True  
  **F1 score:** 1.0


> **Question:** When did the Beyoncé's debut album release?  
  **Prediction:** 2003  
  **True Answer:** 2003  
  **Exact Match:** True  
  **F1 score:** 1.0


> **Question:** Who was the lead singer of R&B girl-group Destiny's Child?  
  **Prediction:** beyonce giselle knowles - carter  
  **True Answer:** Beyonce Giselle Knowles - Carter  
  **Exact Match:** True  
  **F1 score:** 1.0


**About Angelos:**

> **Question:** When did Angelos born?  
  **Prediction:** 8 april 2001  
  **True Answer:** 8 April 2001  
  **Exact match:** True  
  **F1 score:** 1.0

> **Question:** In what university is Angelos studying now?  
  **Prediction:** university of athens  
  **True Answer:** University of Athens  
  **Exact match:** True    
  **F1 score:** 1.0

> **Question:** What is Angelos' nationality?  
  **Prediction:** half cypriot and half greek.  
  **True Answer:** half Cypriot and half Greek   
  **Exact match:** True  
  **F1 score:** 0.8

> **Question:** What are his scientific interests?  
  **Prediction:** artificial intelligence  
  **True Answer:** Artificial Intelligence    
  **Exact match:** True  
  **F1 score:** 1.0

> **Question:** What I will do right now?  
  **Prediction:** stop talking about me  
  **True Answer:** stop talking about me  
  **Exact match:** True  
  **F1 score:** 1.0
