# **Question Answering❓**
with fine-tuned BERT on SQuAD 2.0.  

Question answering comes in many forms. We’ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. More specifically, we will fine tune the [bert-base-uncased](https://huggingface.co/bert-base-uncased) model on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.

I have followed [this tutorial](https://huggingface.co/transformers/v3.2.0/custom_datasets.html#question-answering-with-squad-2-0) from the huggingface community for how to fine tune BERT on custom datasets which in our case is the SQuAD 2.0.

**Some first imports**

In [None]:
import requests
import json
import torch
import os
from tqdm import tqdm
import random

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m100.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m96.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.27.1


### **Download SQuAD 2.0 ⬇️**

SQuAD consists of two json files.

* train dataset 
* validation dataset

In [None]:
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2023-03-16 01:03:05--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2023-03-16 01:03:05 (497 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2023-03-16 01:03:05--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2023-03-16 01:03:05 (304 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



## **Data preprocessing 💽**

In short, we have to do the following:

1. Extract the data from the jsons files
2. Tokenize the data
3. Define the datasets

### **Get data 📁** 

After we got a taste of the jsons files data format let's extract our data and store them into some data structures.

In [None]:
def read_data(path):  
  # load the json file
  with open(path, 'rb') as f:
    squad = json.load(f)

  contexts = []
  questions = []
  answers = []

  for group in squad['data']:
    for passage in group['paragraphs']:
      context = passage['context']
      for qa in passage['qas']:
        question = qa['question']
        for answer in qa['answers']:
          contexts.append(context)
          questions.append(question)
          answers.append(answer)

  return contexts, questions, answers

random.seed(42)
def split_dataset(contexts, questions, answers):
  num_elements = int(len(contexts) * 0.1)
  sample_indices = random.sample(range(len(contexts)), num_elements)

  contexts_sample = [contexts[i] for i in sample_indices]
  questions_sample = [questions[i] for i in sample_indices]
  answers_sample = [answers[i] for i in sample_indices]
  
  return contexts_sample, questions_sample, answers_sample

def get_prediction(context, question):
  inputs = tokenizer.encode_plus(question, context, return_tensors='pt').to(device)
  outputs = model(**inputs)
  
  answer_start = torch.argmax(outputs[0])  
  answer_end = torch.argmax(outputs[1]) + 1 
  
  answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))
  
  return answer

def normalize_text(s):
  """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
  import string, re
  def remove_articles(text):
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
    return re.sub(regex, " ", text)
  def white_space_fix(text):
    return " ".join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return "".join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()

  return white_space_fix(remove_articles(remove_punc(lower(s))))

def exact_match(prediction, truth):
    return bool(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
  pred_tokens = normalize_text(prediction).split()
  truth_tokens = normalize_text(truth).split()
  
  # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
  if len(pred_tokens) == 0 or len(truth_tokens) == 0:
    return int(pred_tokens == truth_tokens)
  
  common_tokens = set(pred_tokens) & set(truth_tokens)
  
  # if there are no common tokens then f1 = 0
  if len(common_tokens) == 0:
    return 0
  
  prec = len(common_tokens) / len(pred_tokens)
  rec = len(common_tokens) / len(truth_tokens)
  F1 = round(2 * (prec * rec) / (prec + rec), 2)
  # print(prec, rec, F1)
  return F1
  
def question_answer(context, question, answer):
  prediction = get_prediction(context,question)
  em_score = exact_match(prediction, answer)
  f1_score = compute_f1(prediction, answer)

  print(f'Question: {question}')
  print(f'Prediction: {prediction}')
  print(f'True Answer: {answer}')
  print(f'Exact match: {em_score}')
  print(f'F1 score: {f1_score}\n')

import numpy as np
def LEEP(pseudo_source_label: np.ndarray, target_label: np.ndarray):
    """
    :param pseudo_source_label: shape [N, C_s]
    :param target_label: shape [N], elements in [0, C_t)
    :return: leep score
    """
    N, C_s = pseudo_source_label.shape
    target_label = target_label.reshape(-1)
    C_t = int(np.max(target_label) + 1)   # the number of target classes
    normalized_prob = pseudo_source_label / float(N)  # sum(normalized_prob) = 1
    joint = np.zeros((C_t, C_s), dtype=float)  # placeholder for joint distribution over (y, z)
    for i in range(C_t):
        this_class = normalized_prob[target_label == i]
        row = np.sum(this_class, axis=0)
        joint[i] = row
    p_target_given_source = (joint / joint.sum(axis=0, keepdims=True)).T  # P(y | z)

    empirical_prediction = pseudo_source_label @ p_target_given_source
    empirical_prob = np.array([predict[label] for predict, label in zip(empirical_prediction, target_label)])
    leep_score = np.mean(np.log(empirical_prob))
    return leep_score

Put the contexts, questions and answers for training and validation into the appropriate lists.

In [None]:
train_contextl, train_questionl, train_answerl = read_data('train-v2.0.json')
train_contexts, train_questions, train_answers = split_dataset(train_contextl, train_questionl, train_answerl)
valid_contexts, valid_questions, valid_answers = read_data('dev-v2.0.json')
print(len(train_contexts), len(valid_contexts))

8682 20302


As you can see above, the answers are dictionaries whith the answer text and an integer which indicates the start index of the answer in the context. As the SQuAD does not give us the end index of the answer in the context we have to find it ourselves. So, let's get the character position at which the answer ends in the passage. Note that sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.

In [None]:
def add_end_idx(answers, contexts):
  for answer, context in zip(answers, contexts):
    gold_text = answer['text']
    start_idx = answer['answer_start']
    end_idx = start_idx + len(gold_text)

    # sometimes squad answers are off by a character or two so we fix this
    if context[start_idx:end_idx] == gold_text:
      answer['answer_end'] = end_idx
    elif context[start_idx-1:end_idx-1] == gold_text:
      answer['answer_start'] = start_idx - 1
      answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
    elif context[start_idx-2:end_idx-2] == gold_text:
      answer['answer_start'] = start_idx - 2
      answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers, train_contexts)
add_end_idx(valid_answers, valid_contexts)
# You can see that now we get the answer_end also
print(train_questions[-1000])
print(train_answers[-1000])

The often unclear division between a self-executing treaty and a non-self-executing treaty can lead to a treaty being what if disagreements exist within a party?
{'text': 'politicized', 'answer_start': 61, 'answer_end': 72}


### **Tokenization 🔢**

As we know we have to tokenize our data in form that is acceptable for the BERT model. We are going to use the `BertTokenizerFast` instead of `BertTokenizer` as the first one is much faster. Since we are going to train our model in batches we need to set `padding=True`.

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
model_name = "bert-base-cased"

# load the tokenizer and model
# tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Load a pre-trained checkpoint
checkpoint_path = '/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/checkpoint-61360'
cp_number = checkpoint_path
# Load the checkpoint into the model
pre_trained_weight = torch.load(os.path.join(checkpoint_path, "pytorch_model.bin"))

weight = pre_trained_weight.pop('bert.pooler.dense.weight')
bias = pre_trained_weight.pop('bert.pooler.dense.bias')

cweight = pre_trained_weight.pop('classifier.weight')
cbias = pre_trained_weight.pop('classifier.bias')


pre_trained_weight['qa_outputs.weight'] = cweight[:2]
pre_trained_weight['qa_outputs.bias'] = cbias[:2]

model.load_state_dict(pre_trained_weight)
weighted_model = model.load_state_dict(pre_trained_weight)

tokenizer = AutoTokenizer.from_pretrained(model_name)

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True, max_length=384)
valid_encodings = tokenizer(valid_contexts, valid_questions, truncation=True, padding=True, max_length=384)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
def add_token_positions(encodings, answers):
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
    start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
    end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

    # if start position is None, the answer passage has been truncated
    if start_positions[-1] is None:
      start_positions[-1] = tokenizer.model_max_length

    if end_positions[-1] is None:
      end_positions[-1] = tokenizer.model_max_length

  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(valid_encodings, valid_answers)

class SQuAD_Dataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __getitem__(self, idx):
    return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  def __len__(self):
    return len(self.encodings.input_ids)

train_dataset = SQuAD_Dataset(train_encodings)
valid_dataset = SQuAD_Dataset(valid_encodings)

from torch.utils.data import DataLoader

batch_size = 32

# Define the dataloaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)

## **Fine-Tuning ⚙️**

### **Training 🏋️‍♂️**

Μy choices for some parameters:

* Use of `AdamW` which is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. This helps to avoid overfitting which is necessary in this case were the model is very complex.


In [None]:
from transformers import AdamW

# Check on the available device - use GPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Working on {device}')

# Freeze the backbone parameters
for param in model.bert.parameters():
    param.requires_grad = False

N_EPOCHS = 5
print(f'checkpoint: {cp_number}, epoch: {N_EPOCHS}')
optim = AdamW(model.parameters(), lr=2e-3)

model.to(device)

model.train()
train_losses = []


for epoch in range(N_EPOCHS):
  loop = tqdm(train_loader, leave=True)
  loss_of_epoch = 0

  for batch in loop:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    loss.backward()
    optim.step()
    loss_of_epoch += loss.item()
    loop.set_description(f'Epoch {epoch+1}')
    loop.set_postfix(loss=loss.item())

  loss_of_epoch /= len(train_loader)
  train_losses.append(loss_of_epoch)

print(f'checkpoint: {cp_number}, epoch: {N_EPOCHS}')

Working on cuda
checkpoint: /content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/checkpoint-61360, epoch: 5


Epoch 1: 100%|██████████| 272/272 [00:47<00:00,  5.67it/s, loss=3.87]
Epoch 2: 100%|██████████| 272/272 [00:45<00:00,  6.03it/s, loss=4.67]
Epoch 3: 100%|██████████| 272/272 [00:45<00:00,  6.03it/s, loss=3.78]
Epoch 4: 100%|██████████| 272/272 [00:45<00:00,  6.04it/s, loss=4.54]
Epoch 5: 100%|██████████| 272/272 [00:45<00:00,  6.04it/s, loss=3.87]

checkpoint: /content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/checkpoint-61360, epoch: 5





**Save the model in my drive in order not to run it each time**

In [None]:
model_path = '/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/MRC_61360_5epoch'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/MRC_61360_5epoch/tokenizer_config.json',
 '/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/MRC_61360_5epoch/special_tokens_map.json',
 '/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/MRC_61360_5epoch/vocab.txt',
 '/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/MRC_61360_5epoch/added_tokens.json',
 '/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/MRC_61360_5epoch/tokenizer.json')

**Respectively, load the saved model**

In [None]:
# from transformers import AutoTokenizer, AutoModelForQuestionAnswering
# cp_number = 'NER2NLI/MRC_49088_2epoch'
# model_path = '/content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NER2NLI/MRC_49088_2epoch'
# model = AutoModelForQuestionAnswering.from_pretrained(model_path)
# tokenizer = AutoTokenizer.from_pretrained(model_path)

# device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# print(f'Working on {device}')

# model = model.to(device)

Working on cuda


### **Testing ✅**

We are evaluating the model on the validation set by checking the model's predictions for the answer's start and end indexes and comparing with the true ones.

In [None]:
model.eval()

valid_predictions = []
val_answers = []
acc = []
start_prob_list = []
end_prob_list = []


for i in tqdm(range(len(valid_contexts))):
# for i in tqdm(range(3000)):   
    encoded_dict = tokenizer.encode_plus(valid_questions[i], valid_contexts[i], return_tensors='pt', max_length=384, truncation=True)
    input_ids = encoded_dict['input_ids'].to(device)
    token_type_ids = encoded_dict['token_type_ids'].to(device)
    attention_mask = encoded_dict['attention_mask'].to(device)
    start_true = batch['start_positions'].to(device)
    end_true = batch['end_positions'].to(device)

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        start_logits, end_logits = outputs.start_logits, outputs.end_logits

    start_logits_padded = torch.nn.functional.pad(start_logits, (0, 384-start_logits.shape[1]), 'constant', value=0)
    end_logits_padded = torch.nn.functional.pad(end_logits, (0, 384-end_logits.shape[1]), 'constant', value=0)

    start_prob_list.append(start_logits_padded)
    end_prob_list.append(end_logits_padded)
        
    start_index = torch.argmax(start_logits, dim=1)[0]
    end_index = torch.argmax(end_logits, dim=1)[0]

    start_pred = torch.argmax(outputs['start_logits'], dim=1)
    end_pred = torch.argmax(outputs['end_logits'], dim=1)
    acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
    acc.append(((end_pred == end_true).sum()/len(end_pred)).item())
    valid_predictions.append(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(encoded_dict['input_ids'][0][start_index:end_index+1])))
    val_answers.append(valid_answers[i]['text'])

start_logits = torch.cat(start_prob_list, dim=0)
end_logits = torch.cat(end_prob_list, dim=0)
print(start_logits.shape, end_logits.shape)

accuracy = sum(acc)/len(acc)
print("Validation Accuracy is {}".format(accuracy))

100%|██████████| 20302/20302 [04:09<00:00, 81.34it/s]

torch.Size([20302, 384]) torch.Size([20302, 384])
Validation Accuracy is 0.06075756083144518





In [None]:
EM = []
F1 = []
Prec = []
Rec = []
for pre, val in zip(valid_predictions, val_answers):
  em_score = exact_match(pre, val)
  f1_score = compute_f1(pre, val)
  EM.append(em_score)
  F1.append(f1_score)
  # Prec.append(prec)
  # Rec.append(rec)

start_leep = LEEP(torch.softmax(torch.Tensor(start_logits.cpu()), dim=1).numpy(), np.array(valid_encodings['start_positions']))
end_leep = LEEP(torch.softmax(torch.Tensor(end_logits.cpu()), dim=1).numpy(), np.array(valid_encodings['end_positions']))

print(f'checkpoint: {cp_number}, epoch: {N_EPOCHS}')
print(f'train_losses:{train_losses}')
print(f'Validate number:      {len(EM)}')
print("Validation Accuracy is {}".format(accuracy))
print(f'Exact match number:   {sum(EM)}')
print(f'Exact match:  {sum(EM)/len(EM)}')
print(f'F1 score:     {sum(F1)/len(F1)}')
print(f'start_leep:   {start_leep}, end_leep: {end_leep}')
print(f'average_leep: {(start_leep+end_leep)/2}')

checkpoint: /content/drive/MyDrive/BERT-SQuAD/Model_weights_HW1/NLI/checkpoint-61360, epoch: 5
train_losses:[4.29837233003448, 3.9964589003254387, 3.95673064624562, 3.9504449770731083, 3.9429063840824017]
Validate number:      20302
Validation Accuracy is 0.06075756083144518
Exact match number:   500
Exact match:  0.02462811545660526
F1 score:     0.06699487735198256
average_leep: -4.961575026982134


### **Ask questions 🙋**

We are going to use some functions from the [*official Evaluation Script v2.0*](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) of SQuAD in order to test the fine-tuned model by asking some questions given a context. I have also looked at this [notebook](https://colab.research.google.com/github/fastforwardlabs/ff14_blog/blob/master/_notebooks/2020-06-09-Evaluating_BERT_on_SQuAD.ipynb#scrollTo=MzPlHgWEBQ8D) which evaluates BERT on SQuAD.

**Beyoncé**

In [None]:
context = """Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, 
          songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing 
          and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. 
          Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. 
          Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, 
          earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy"."""


questions = ["For whom the passage is talking about?",
             "When did Beyonce born?",
             "Where did Beyonce born?",
             "What is Beyonce's nationality?",
             "Who was the Destiny's group manager?",
             "What name has the Beyoncé's debut album?",
             "How many Grammy Awards did Beyonce earn?",
             "When did the Beyoncé's debut album release?",
             "Who was the lead singer of R&B girl-group Destiny's Child?"]

answers = ["Beyonce Giselle Knowles - Carter", "September 4, 1981", "Houston, Texas", 
           "American", "Mathew Knowles", "Dangerously in Love", "five", "2003", 
           "Beyonce Giselle Knowles - Carter"]

for question, answer in zip(questions, answers):
  question_answer(context, question, answer)

Question: For whom the passage is talking about?
Prediction: 
True Answer: Beyonce Giselle Knowles - Carter
Exact match: False
F1 score: 0

Question: When did Beyonce born?
Prediction: 
True Answer: September 4, 1981
Exact match: False
F1 score: 0

Question: Where did Beyonce born?
Prediction: 
True Answer: Houston, Texas
Exact match: False
F1 score: 0

Question: What is Beyonce's nationality?
Prediction: 
True Answer: American
Exact match: False
F1 score: 0

Question: Who was the Destiny's group manager?
Prediction: 
True Answer: Mathew Knowles
Exact match: False
F1 score: 0

Question: What name has the Beyoncé's debut album?
Prediction: 
True Answer: Dangerously in Love
Exact match: False
F1 score: 0

Question: How many Grammy Awards did Beyonce earn?
Prediction: 
True Answer: five
Exact match: False
F1 score: 0

Question: When did the Beyoncé's debut album release?
Prediction: 
True Answer: 2003
Exact match: False
F1 score: 0

Question: Who was the lead singer of R&B girl-group Dest

**Athens**

In [None]:
context = """Athens is the capital and largest city of Greece. Athens dominates the Attica region and is one of the world's oldest cities, 
             with its recorded history spanning over 3,400 years and its earliest human presence starting somewhere between the 11th and 7th millennium BC.
             Classical Athens was a powerful city-state. It was a center for the arts, learning and philosophy, and the home of Plato's Academy and Aristotle's Lyceum.
             It is widely referred to as the cradle of Western civilization and the birthplace of democracy, largely because of its cultural and political impact on the European continent—particularly Ancient Rome.
             In modern times, Athens is a large cosmopolitan metropolis and central to economic, financial, industrial, maritime, political and cultural life in Greece. 
             In 2021, Athens' urban area hosted more than three and a half million people, which is around 35% of the entire population of Greece.
             Athens is a Beta global city according to the Globalization and World Cities Research Network, and is one of the biggest economic centers in Southeastern Europe. 
             It also has a large financial sector, and its port Piraeus is both the largest passenger port in Europe, and the second largest in the world."""

questions = ["Which is the largest city in Greece?",
             "For what was the Athens center?",
             "Which city was the home of Plato's Academy?"]

answers = ["Athens", "center for the arts, learning and philosophy", "Athens"]

for question, answer in zip(questions, answers):
  question_answer(context, question, answer)

Question: Which is the largest city in Greece?
Prediction: 
True Answer: Athens
Exact match: False
F1 score: 0

Question: For what was the Athens center?
Prediction: 
True Answer: center for the arts, learning and philosophy
Exact match: False
F1 score: 0

Question: Which city was the home of Plato's Academy?
Prediction: 
True Answer: Athens
Exact match: False
F1 score: 0



**Angelos**

In [None]:
context = """Angelos Poulis was born on 8 April 2001 in Nicosia, Cyprus. He is half Cypriot and half Greek. 
            He is currently studying at the Department of Informatics and Telecommunications of the University of Athens in Greece. 
            His scientific interests are in the broad field of Artificial Intelligence and he loves to train neural networks! 
            Okay, I'm Angelos and I'll stop talking about me right now."""

questions = ["When did Angelos born?",
             "In what university is Angelos studying now?",
             "What is Angelos' nationality?",
             "What are his scientific interests?",
             "What I will do right now?"]

answers = ["8 April 2001", "University of Athens", 
           "half Cypriot and half Greek", "Artificial Intelligence", 
           "stop talking about me"]

for question, answer in zip(questions, answers):
  question_answer(context, question, answer)

Question: When did Angelos born?
Prediction: Angelos Pouli
True Answer: 8 April 2001
Exact match: False
F1 score: 0

Question: In what university is Angelos studying now?
Prediction: Angelos Pouli
True Answer: University of Athens
Exact match: False
F1 score: 0

Question: What is Angelos' nationality?
Prediction: Angelos Pouli
True Answer: half Cypriot and half Greek
Exact match: False
F1 score: 0

Question: What are his scientific interests?
Prediction: Angelos Pouli
True Answer: Artificial Intelligence
Exact match: False
F1 score: 0

Question: What I will do right now?
Prediction: Angelos Pouli
True Answer: stop talking about me
Exact match: False
F1 score: 0



## **Summary (and some Questions & Answers) 🧐**

**Technical details:**
* **Model used:** `bert-base-uncased`
* **Dataset:** The Stanford Question Answering Dataset (SQuAD)  
* **Run time:** ~ 4 hours on the Tesla P100 GPU for `N_EPOCHS = 3`. Each epoch took about 1 hour and 15 minutes for training. I think if we run the model for at least `N_EPOCHS = 5` we can get even better results, but what we got for 3 epochs is already very good!

**Conclusion:** We can say that training the model for just 3 epochs, which took about 4 hours on the Tesla P100 GPU, gives us pretty good results. The model can also answer quite well to questions about contents it hasn't seen before and I can say this because I gave it a passage for myself!

Some *example questions and answers* we get are the following:

**About Athens:**

> **Question:** Which is the largest city in Greece?  
  **Prediction:** athens  
  **True Answer:** Athens  
  **Exact match:** True  
  **F1 score:** 1.0  

> **Question:** For what was the Athens center?  
  **Prediction:** center for the arts, learning and philosophy  
  **True Answer:** center for the arts, learning and philosophy  
  **Exact match:** True  
  **F1 score:** 1.0  

**About Beyoncé:**

> **Question:** When did Beyonce born?  
  **Prediction:** september 4, 1981  
  **True Answer:** September 4, 1981  
  **Exact Match:** True	 
  **F1 score:** 1.0

> **Question:** What name has the Beyoncé's debut album?  
  **Prediction:** dangerously in love  
  **True Answer:** Dangerously in Love   
  **Exact Match:** True  
  **F1 score:** 1.0

> **Question:** How many Grammy Awards did Beyonce earn?  
  **Prediction:** five  
  **True Answer:** five  
  **Exact Match:** True  
  **F1 score:** 1.0


> **Question:** When did the Beyoncé's debut album release?  
  **Prediction:** 2003  
  **True Answer:** 2003  
  **Exact Match:** True  
  **F1 score:** 1.0


> **Question:** Who was the lead singer of R&B girl-group Destiny's Child?  
  **Prediction:** beyonce giselle knowles - carter  
  **True Answer:** Beyonce Giselle Knowles - Carter  
  **Exact Match:** True  
  **F1 score:** 1.0


**About Angelos:**

> **Question:** When did Angelos born?  
  **Prediction:** 8 april 2001  
  **True Answer:** 8 April 2001  
  **Exact match:** True  
  **F1 score:** 1.0

> **Question:** In what university is Angelos studying now?  
  **Prediction:** university of athens  
  **True Answer:** University of Athens  
  **Exact match:** True    
  **F1 score:** 1.0

> **Question:** What is Angelos' nationality?  
  **Prediction:** half cypriot and half greek.  
  **True Answer:** half Cypriot and half Greek   
  **Exact match:** True  
  **F1 score:** 0.8

> **Question:** What are his scientific interests?  
  **Prediction:** artificial intelligence  
  **True Answer:** Artificial Intelligence    
  **Exact match:** True  
  **F1 score:** 1.0

> **Question:** What I will do right now?  
  **Prediction:** stop talking about me  
  **True Answer:** stop talking about me  
  **Exact match:** True  
  **F1 score:** 1.0
