# Contextual question answering

In [3]:
from transformers import TrainingArguments, AutoTokenizer, AutoModelForQuestionAnswering, Trainer, AutoModel
from datasets import load_dataset
from tqdm.auto import tqdm
import numpy as np
import collections
import evaluate

1. Get acquainted with the Simple legal questions dataset.
2. Select one open issue in the dataset, provide the answers for the questions in the package and open a pull request with the answers.
3. The subset of the answers that you have provided in point 2 is your test dataset. If in the dataset there are questions that are the same as the questions in your test set, make the questions and the answers part of your test dataset (i.e. remove them from the training set).
4. The remaing questions and answers are your training set. Divide that set into training and validation subsets. The validation part should be selected as 20% of the original training set. Make sure that there are no questions in the validation set that are present in the training subset. If there are such questions, make them part of the validation set.

In [3]:
# Here get data

5. If the training set is small (less than 1 thousand question+answer pairs) use one of the available QA dataset, e.g. PoQUAD or SQUAD. Using the second dataset is sensible, if you are training a multilingual model, like mT5.

In [4]:
squad_dataset = load_dataset("squad")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
print("Context: ", squad_dataset["train"][0]["context"])
print("Question: ", squad_dataset["train"][0]["question"])
print("Answer: ", squad_dataset["train"][0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In [6]:
max_length = 384
stride = 128

In [7]:
def preprocess_training_examples(examples, tokenizer):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [8]:
def preprocess_validation_examples(examples, tokenizer):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

6. Train a neural model able to answer the legal questions. Fine-tune at least two pre-trained models. Make sure you are using a machine with a GPU, since training the model on CPU will be very long. The training should include at least 10 epochs (depending on the size of the training set you are using). The pre-trained models you can use include:
    * plT5-base
    * plT5-large
    * mT5-base
    * mT5-large

In [9]:
def get_trainer(model_name, model, train_dataset, validation_dataset, tokenizer, epochs=3):
    args = TrainingArguments(
        model_name,
        evaluation_strategy="no",
        save_strategy="epoch",
        learning_rate=2e-5,
        num_train_epochs=epochs,
        weight_decay=0.01,
        fp16=True,
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=validation_dataset,
        tokenizer=tokenizer,
    )
    return trainer

bert-base-case

In [10]:
bert_base_cased = "deepset/bert-base-cased-squad2"

In [11]:
bert_base_cased_tokenizer = AutoTokenizer.from_pretrained(bert_base_cased)

Downloading:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [12]:
bert_base_cased_train_dataset = squad_dataset["train"].map(
    lambda x: preprocess_training_examples(x, bert_base_cased_tokenizer),
    batched=True,
    remove_columns=squad_dataset["train"].column_names,
)
len(squad_dataset["train"]), len(bert_base_cased_train_dataset)

  0%|          | 0/88 [00:00<?, ?ba/s]

(87599, 88729)

In [13]:
bert_base_cased_validation_dataset = squad_dataset["validation"].map(
    lambda x: preprocess_validation_examples(x, bert_base_cased_tokenizer),
    batched=True,
    remove_columns=squad_dataset["validation"].column_names,
)
len(squad_dataset["validation"]), len(bert_base_cased_validation_dataset)

  0%|          | 0/11 [00:00<?, ?ba/s]

(10570, 10822)

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(bert_base_cased).to("cuda")

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [None]:
bert_base_cased_trainer = get_trainer(bert_base_cased, model, bert_base_cased_train_dataset, bert_base_cased_validation_dataset, bert_base_cased_tokenizer, 5)
bert_base_cased_trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using cuda_amp half precision backend
***** Running training *****
  Num examples = 88729
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 55460
  Number of trainable parameters = 107721218


Step,Training Loss
500,0.1593
1000,0.1685
1500,0.1734
2000,0.1655
2500,0.27
3000,0.7256
3500,0.7342
4000,0.7177
4500,0.7132
5000,0.6867


Saving model checkpoint to deepset/bert-base-cased-squad2/checkpoint-11092
Configuration saved in deepset/bert-base-cased-squad2/checkpoint-11092/config.json
Model weights saved in deepset/bert-base-cased-squad2/checkpoint-11092/pytorch_model.bin
tokenizer config file saved in deepset/bert-base-cased-squad2/checkpoint-11092/tokenizer_config.json
Special tokens file saved in deepset/bert-base-cased-squad2/checkpoint-11092/special_tokens_map.json
Saving model checkpoint to deepset/bert-base-cased-squad2/checkpoint-22184
Configuration saved in deepset/bert-base-cased-squad2/checkpoint-22184/config.json
Model weights saved in deepset/bert-base-cased-squad2/checkpoint-22184/pytorch_model.bin
tokenizer config file saved in deepset/bert-base-cased-squad2/checkpoint-22184/tokenizer_config.json
Special tokens file saved in deepset/bert-base-cased-squad2/checkpoint-22184/special_tokens_map.json


Step,Training Loss
500,0.1593
1000,0.1685
1500,0.1734
2000,0.1655
2500,0.27
3000,0.7256
3500,0.7342
4000,0.7177
4500,0.7132
5000,0.6867


Saving model checkpoint to deepset/bert-base-cased-squad2/checkpoint-33276
Configuration saved in deepset/bert-base-cased-squad2/checkpoint-33276/config.json
Model weights saved in deepset/bert-base-cased-squad2/checkpoint-33276/pytorch_model.bin
tokenizer config file saved in deepset/bert-base-cased-squad2/checkpoint-33276/tokenizer_config.json
Special tokens file saved in deepset/bert-base-cased-squad2/checkpoint-33276/special_tokens_map.json
Saving model checkpoint to deepset/bert-base-cased-squad2/checkpoint-44368
Configuration saved in deepset/bert-base-cased-squad2/checkpoint-44368/config.json
Model weights saved in deepset/bert-base-cased-squad2/checkpoint-44368/pytorch_model.bin
tokenizer config file saved in deepset/bert-base-cased-squad2/checkpoint-44368/tokenizer_config.json
Special tokens file saved in deepset/bert-base-cased-squad2/checkpoint-44368/special_tokens_map.json
Saving model checkpoint to deepset/bert-base-cased-squad2/checkpoint-55460
Configuration saved in deep

TrainOutput(global_step=55460, training_loss=0.3297708093840595, metrics={'train_runtime': 13293.1073, 'train_samples_per_second': 33.374, 'train_steps_per_second': 4.172, 'total_flos': 8.694224973160704e+16, 'train_loss': 0.3297708093840595, 'epoch': 5.0})

deBERTa-v2

In [9]:
deberta = "hf-internal-testing/tiny-random-deberta-v2"

In [10]:
deberta_tokenizer = AutoTokenizer.from_pretrained(deberta)

Downloading:   0%|          | 0.00/419 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.45M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/156 [00:00<?, ?B/s]

In [11]:
deberta_train_dataset = squad_dataset["train"].map(
    lambda x: preprocess_training_examples(x, deberta_tokenizer),
    batched=True,
    remove_columns=squad_dataset["train"].column_names,
)
len(squad_dataset["train"]), len(deberta_train_dataset)

  0%|          | 0/88 [00:00<?, ?ba/s]

(87599, 88230)

In [12]:
deberta_validation_dataset = squad_dataset["validation"].map(
    lambda x: preprocess_validation_examples(x, deberta_tokenizer),
    batched=True,
    remove_columns=squad_dataset["validation"].column_names,
)
len(squad_dataset["validation"]), len(deberta_validation_dataset)

  0%|          | 0/11 [00:00<?, ?ba/s]

(10570, 10744)

In [13]:
model = AutoModelForQuestionAnswering.from_pretrained(deberta).to("cuda")

Downloading:   0%|          | 0.00/643 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/50.1M [00:00<?, ?B/s]

Some weights of the model checkpoint at hf-internal-testing/tiny-random-deberta-v2 were not used when initializing DebertaV2ForQuestionAnswering: ['encoder.layer.2.attention.output.dense.bias', 'encoder.layer.2.attention.self.value_proj.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.3.attention.output.dense.weight', 'encoder.layer.4.attention.output.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.1.output.dense.bias', 'encoder.layer.4.attention.output.dense.bias', 'encoder.layer.2.attention.self.query_proj.bias', 'encoder.layer.0.attention.self.query_proj.bias', 'encoder.layer.3.attention.output.LayerNorm.weight', 'encoder.layer.3.attention.output.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'encoder.layer.2.intermediate.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'encoder.layer.3.attention.self.value_proj.weight', 'encoder.layer.3.attention.self.key_proj.bias', 'encoder.layer.0.attention.se

In [15]:
deberta_trainer = get_trainer(deberta, model, deberta_train_dataset, deberta_validation_dataset, deberta_tokenizer, 10)
deberta_trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using cuda_amp half precision backend
***** Running training *****
  Num examples = 88230
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 110290
  Number of trainable parameters = 4147003


Step,Training Loss
500,5.849
1000,5.3332
1500,4.9389
2000,4.7892
2500,4.7121
3000,4.6499
3500,4.6014
4000,4.5774
4500,4.5371
5000,4.4905


Saving model checkpoint to hf-internal-testing/tiny-random-deberta-v2/checkpoint-11029
Configuration saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-11029/config.json
Model weights saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-11029/pytorch_model.bin
tokenizer config file saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-11029/tokenizer_config.json
Special tokens file saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-11029/special_tokens_map.json
Saving model checkpoint to hf-internal-testing/tiny-random-deberta-v2/checkpoint-22058
Configuration saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-22058/config.json
Model weights saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-22058/pytorch_model.bin
tokenizer config file saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-22058/tokenizer_config.json
Special tokens file saved in hf-internal-testing/tiny-random-deberta-v2/checkpoint-22058/specia

TrainOutput(global_step=110290, training_loss=3.988194063086168, metrics={'train_runtime': 4164.5726, 'train_samples_per_second': 211.858, 'train_steps_per_second': 26.483, 'total_flos': 69268314240000.0, 'train_loss': 3.988194063086168, 'epoch': 10.0})

7. Report the obtained performance of the models (in the form of a table). The report should include exact match and F1 score for the tokens appearing both in the reference and the predicted answer.

In [22]:
def compute_metrics(start_logits, end_logits, features, examples, metric):
    n_best = 20
    max_answer_length = 30
    example_to_features = collections.defaultdict(list)
    results = []
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
            results.append({"question_id": example_id, "question": example["question"],"prediction_text": best_answer["text"], "answers": example["answers"], "logit_score": best_answer["logit_score"]})
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})
            results.append({"question_id": example_id, "question": example["question"],"prediction_text": "", "answers": example["answers"], "logit_score": -9999})


    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers), results

bert_base_cased metrics

In [None]:
bert_base_cased_metric = evaluate.load("squad")

In [None]:
bert_base_cased_predictions, _, _ = bert_base_cased_trainer.predict(bert_base_cased_validation_dataset)
bert_base_cased_start_logits, bert_base_cased_end_logits = bert_base_cased_predictions
bert_base_cased_metric, bert_base_cased_answers = compute_metrics(bert_base_cased_start_logits, bert_base_cased_end_logits, bert_base_cased_validation_dataset, squad_dataset["validation"], bert_base_cased_metric)

The following columns in the test set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 10822
  Batch size = 8


  0%|          | 0/10570 [00:00<?, ?it/s]

In [None]:
bert_base_cased_metric

{'exact_match': 78.99716177861873, 'f1': 87.45072512449907}

deBERTa-v2 metrics

In [17]:
deberta_metric = evaluate.load("squad")

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [18]:
deberta_predictions, _, _ = deberta_trainer.predict(deberta_validation_dataset)
deberta_start_logits, deberta_end_logits = deberta_predictions
deberta_metric, deberta_answers = compute_metrics(deberta_start_logits, deberta_end_logits, deberta_validation_dataset, squad_dataset["validation"], deberta_metric)

The following columns in the test set don't have a corresponding argument in `DebertaV2ForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `DebertaV2ForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 10744
  Batch size = 8


  0%|          | 0/10570 [00:00<?, ?it/s]

In [19]:
deberta_metric

{'exact_match': 2.7341532639545885, 'f1': 8.781228187024613}

8. Report the best results obtained on the validation dataset and the corresponding results on your test dataset. The results on the test set have to be obtained for the model that yield the best result on the validation dataset.
9. Generate, report and analyze the answers provided by the best model on you test dataset.

bert_base_cased metrics

In [None]:
sorted(bert_base_cased_answers, key=lambda d: d['logit_score'])[-20:]


[{'question_id': '57273455f1498d1400e8f48c',
  'question': 'What is the Mongolian name for the original place of the Genghis Khan mausoleum?',
  'prediction_text': 'Edsen Khoroo',
  'answers': {'text': ['Edsen Khoroo', 'Edsen Khoroo', 'Edsen Khoroo'],
   'answer_start': [112, 112, 112]},
  'logit_score': 34.94},
 {'question_id': '5726e313f1498d1400e8eeb3',
  'question': 'In what form are most hospital medications?',
  'prediction_text': 'unit-dose, or a single dose of medicine',
  'answers': {'text': ['unit-dose, or a single dose of medicine',
    'unit-dose',
    'unit-dose, or a single dose of medicine'],
   'answer_start': [260, 260, 260]},
  'logit_score': 34.94},
 {'question_id': '57265e455951b619008f70bd',
  'question': 'What were the years two Regulations that conflicted with an Italian law originate in the Simmenthal SpA case? ',
  'prediction_text': '1964 and 1968',
  'answers': {'text': ['1964 and 1968',
    '1964 and 1968',
    '1964 and 1968',
    '1964 and 1968'],
   'answ

deBERTa-v2 metrics

In [20]:
sorted(deberta_answers, key=lambda d: d['logit_score'])[-20:]


[{'question_id': '5726356938643c19005ad301',
  'question': 'In cases with shared medium how is it delivered ',
  'prediction_text': 'Packet mode communication may be implemented with or without intermediate forwarding nodes (packet switches or routers). Packets are normally forwarded by intermediate network nodes asynchronously',
  'answers': {'text': ['the packets may be delivered according to a multiple access scheme',
    'according to a multiple access scheme',
    'multiple access scheme'],
   'answer_start': [497, 526, 541]},
  'logit_score': 11.15},
 {'question_id': '572970c11d04691400779466',
  'question': 'What theorem states that the probability that a number n is prime is inversely proportional to its logarithm?',
  'prediction_text': ' 300 BC',
  'answers': {'text': ['the prime number theorem',
    'prime number theorem',
    'prime number',
    'prime number theorem',
    'prime number theorem'],
   'answer_start': [319, 323, 323, 323, 323]},
  'logit_score': 11.15},
 {'qu

10. optional: perform hyperparameter tuning for the models to obtain better results. Take into account some of the following parameters: learning rate, gradient accumulation steps, batch size, gradient clipping, learning rate schedule

In [14]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

bert_base_cased_train_dataset.set_format("torch")
validation_set = bert_base_cased_validation_dataset.remove_columns(["example_id", "offset_mapping"])
validation_set.set_format("torch")

train_dataloader = DataLoader(
    bert_base_cased_train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    validation_set, collate_fn=default_data_collator, batch_size=8
)

In [15]:
model = AutoModelForQuestionAnswering.from_pretrained(bert_base_cased).to("cuda")

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [16]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [19]:
from accelerate import Accelerator

accelerator = Accelerator(fp16=True)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)



In [20]:
from transformers import get_scheduler

num_train_epochs = 4
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [24]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    print(f"Epoch: {epoch}")
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    start_logits = []
    end_logits = []
    accelerator.print("Evaluation!")
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())

    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    start_logits = start_logits[: len(bert_base_cased_validation_dataset)]
    end_logits = end_logits[: len(bert_base_cased_validation_dataset)]
    metric = evaluate.load("squad")
    bert_base_cased_metric = compute_metrics(
        start_logits, end_logits, bert_base_cased_validation_dataset, squad_dataset["validation"], metric
    )
    print(bert_base_cased_metric)

  0%|          | 0/44368 [00:00<?, ?it/s]

Epoch: 0
Evaluation!


  0%|          | 0/1353 [00:00<?, ?it/s]

  0%|          | 0/10570 [00:00<?, ?it/s]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Evaluation!


  0%|          | 0/1353 [00:00<?, ?it/s]

  0%|          | 0/10570 [00:00<?, ?it/s]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Evaluation!


  0%|          | 0/1353 [00:00<?, ?it/s]

  0%|          | 0/10570 [00:00<?, ?it/s]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Evaluation!


  0%|          | 0/1353 [00:00<?, ?it/s]

  0%|          | 0/10570 [00:00<?, ?it/s]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [26]:
bert_base_cased_metric[0]

{'exact_match': 79.56480605487228, 'f1': 87.94732093979019}

f1 score improved 0.5% and exact match around 0.6% for bert

11. Answer the following questions:
    * Which pre-trained model performs better on that task?
    
   "deepset/bert-base-cased-squad2" performed superb on this task achieving 79% exact match and almost 88% f1-score.
    
    * Does the performance on the validation dataset reflects the performance on your test set?
    
    * What are the outcomes of the model on your own questions? Are they satisfying? If not, what might be the reason for that?
    
    * Why extractive question answering is not well suited for inflectional languages?
    
    Because very often form in the context is different than the proper form that fit for given question.
    
    * Why you have to remove the duplicated questions from the training and the validation subsets?
    
    Different context