<a href="https://colab.research.google.com/github/Yanhan-ss/NLP/blob/main/Exercise_6_Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 6: Question Answering

This session is based on lecture 11 (Question Answering).

Relevant SLP chapters:
* Chapter 14 (except 14.1 and 14.2)

## Extractive QA : SQuAD

In this exercise, we will learn how to finetune a transformer model on question answering, specifically extractive question answering, which is to predict / select the answer to a question given some context.
The benchmark dataset for extractive question answering is [SQuAD](https://huggingface.co/datasets/rajpurkar/squad).

We will finetune BERT on this dataset. (You are also free to explore other models).

Connecting to a GPU runtime in Google Colab is highly recommended for this exercise.

This is a fill in the blank exercise, where you are expected to fill in parts of the code marked `# TODO: your code here`.

### Part 1: The dataset

In [None]:
!pip install -q "transformers[torch]" datasets evaluate "accelerate" pandas

In [None]:
import collections

import evaluate
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from datasets import (
    load_dataset,
)
from IPython.display import HTML, display
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import (
    AutoModelForQuestionAnswering,
    AutoTokenizer,
    DefaultDataCollator,
)

In [None]:
# we will only use a subset of the train split for squad for training because of limited compute and time
# you can load more if you want and have the compute/time
squad = load_dataset("squad", split="train[:2500]")

In [None]:
squad.column_names

In [None]:
# We'll turn the dataset in to a dataframe for easier inspection
squad_df = squad.to_pandas()

In [None]:
# Re-run this cell to see some other random examples
display(HTML(squad_df.sample(10).to_html()))

If you look at the example from the dataset above, you have an example ID, the
title, the context for extracting the answer, the question, and the answer with includes the index of the starting character of the answer in the context.
The model will be train to predict these spans.

### Part 2: Preprocessing

The SQuAD finetuning pipeline requires more preprocessing than the average text classification dataset. We will use the transformers tokenizers library that we briefly introduced in Exercise 1 to tokenize our text and preprocess the text.

In [None]:
model_id = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# these variables will become clear later
doc_stride = 128
max_token_length = 512
squad_v2 = False
pad_on_right = tokenizer.padding_side == "right"

Now one specific thing for the preprocessing in question answering is how to deal with very long documents.



Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. To deal with longer sequences, we can truncate (shorten) only the `context` by setting `truncation="only_second"`.
Next, we can map the start and end positions of the answer to the original `context` by setting `return_offset_mapping=True`.
With the mapping in hand, now we can find the start and end tokens of the answer. We will use the `sequence_ids` method to find which part of the offset corresponds to the `question` and which corresponds to the `context`.


In [None]:
example = None

for ex in squad:
    if len(tokenizer(ex["question"], ex["context"])["input_ids"]) > max_token_length:
        example = ex
        break

example

Without any truncation, we get the following length for the input IDs:

In [None]:
len(tokenizer(example["question"], example["context"])["input_ids"])

Now, if we just truncate, we will lose information (and possibly the answer to our question):

In [None]:
len(
    tokenizer(
        example["question"],
        example["context"],
        max_length=max_token_length,
        truncation="only_second",
    )["input_ids"]
)

Removing part of the `context` might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). We can enable this functionality by setting `return_overflowing_tokens=True`. Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`.

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_token_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride,
)

Now we don't have one list of `input_ids`, but several:

In [None]:
[len(x) for x in tokenized_example["input_ids"]]

And if we decode them, we can see the overlap:

In [None]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

We need to find in which of those words the answer actually is, and where exactly in that context.
The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens.
The tokenizer we're using can help us with that by returning an `offset_mapping`:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_token_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride,
)
print(tokenized_example["offset_mapping"][0][:100])

This gives the corresponding start and end character in the original text for each token in our input IDs. The very first token (`[CLS]`) has (0, 0) because it doesn't correspond to any part of the question/answer, then the second token is the same as the characters 0 to 3 of the question:

In [None]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(
    tokenizer.convert_ids_to_tokens([first_token_id])[0],
    example["question"][offsets[0] : offsets[1]],
)

So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

It returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). Now with all of this, we can find the first and last token of the answer in one of our input feature:

In [None]:
answers = example["answers"]  # testing on one example
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# TODO: your code here:
# Initially, we select the entire span, your task is to find:
# - token_start_index
# - token_end_index
# And correct them using the offset, which should result in:
# - start_position
# - end_position
# In this part, we're only working with one example!

# This is the full span, you have to some more processing!
token_start_index = 0
token_end_index = len(tokenized_example["input_ids"][0]) - 1

offsets = tokenized_example["offset_mapping"][0]
start_position = ...
end_position = ...
print(start_position, end_position)
# This should print (13, 14)

In [None]:
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position : end_position + 1]))
print(answers["text"][0])
# This sanity check should give 1565 both times!

Now let's put everything together in one function we will apply to our training set. In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position.

In [None]:
def preprocess(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_token_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # TODO: your code here
        # get the start and end character of the answer
        # get the start and end token indices, just as you've done above

        token_start_index = ...
        token_end_index = ...

        # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
        if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # TODO: your code here
            # move the token_start_index and token_end_index to the two ends of the answer.
            # note: we could go after the last offset if the answer is the last word (edge case).
            # keep in mind to account for - 1 and + 1


            tokenized_examples["start_positions"].append(token_start_index)
            tokenized_examples["end_positions"].append(token_end_index)

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = preprocess(squad[:2])
features

In [None]:
# the datasets library has a .map() function that allows you to apply any function to each example in the dataset
# it also allows for batching, so we can set batched=True for faster processing

tokenized_squad = squad.map(preprocess, batched=True, remove_columns=squad.column_names)

### Part 3: Finetuning

In [None]:
# this will give some warnings you can ignore
model = AutoModelForQuestionAnswering.from_pretrained(model_id)

In [None]:
data_collator = DefaultDataCollator()
train_loader = DataLoader(tokenized_squad, batch_size=16, shuffle=True, collate_fn=data_collator)

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# move model over to detected device
model.to(device)
# initialize our optimizer
optim = torch.optim.AdamW(model.parameters(), lr=5e-5)

In [None]:
# set model to train mode
model.train()

n_epochs = 3

# This will take about 15 minutes on a GPU, keep this in mind when doing to exercise!
for epoch in range(n_epochs):
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all the tensor batches required for training
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        start_positions = batch["start_positions"].to(device)
        end_positions = batch["end_positions"].to(device)
        # train model on batch and return outputs (incl. loss)
        outputs = model(
            input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions
        )
        # extract loss
        loss = outputs[0]
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())

We will now perform inference on our trained model.
Let's define our question and context:

In [None]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

In [None]:
inputs = tokenizer(question, context, return_tensors="pt").to(device)

In [None]:
# we won't compute gradients since we only want to do inference
with torch.no_grad():
    outputs = model(**inputs)  # do a forward pass in the model
    print(outputs)

In [None]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

In [None]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]  # get the tokenized answer
tokenizer.decode(predict_answer_tokens)  # convert those tokens to readable text

#### Visualising the scores

We can visualize tokens in the sequence and the scores the model assigns to them.
With this, we have some idea of what the model is looking for in answering these questions.

In [None]:
# Use plot styling from seaborn and increase size
sns.set_style("darkgrid")
plt.rcParams["figure.figsize"] = (20, 10)

Retrieve all of the start and end scores, and use all of the tokens as x-axis labels.

In [None]:
# TODO your code here:
# create a plot two subplots:
#   1. tokens + token ids on the y-axis and the start scores on the x-axis
#   1. tokens + token ids on the y-axis and the end scores on the x-axis
# note: use `input_ids` and `outputs` you've used above
# the tokenizer can convert them back to strings
# make sure the tokens are unique...
start_scores = ...
end_scores = ...
token_labels = ...

Create a bar plot showing the score for every input word being the "start" word.

In [None]:
# Create a barplot showing the start word score for all of the tokens.
fig, ax = plt.subplots(1, 2)
ax1 = sns.barplot(x=start_scores, y=token_labels, ax=ax[0])
ax2 = sns.barplot(x=start_scores, y=token_labels, ax=ax[1], color="red")

ax1.grid(True)
ax2.grid(True)

ax1.set_title("Start Scores")
ax2.set_title("End Scores")

plt.show()

Try out some other examples.
See how the scores change, try to come up with some hard examples.
You can take inspiration from what we've seen in the course:
* amibiguity: structural, lexical, etc.
* spelling errors
* non-standard text
* ...

Explore what's easy and what's hard for the model.

### Part 4: Evaluation

Extractive QA is harder to evaluate than just comparing labels; more processing is needed.
We need to make sure we can map model predictions back to the context and compare this with the intended answer if we want to evaluate the model.

In [None]:
squad_val = load_dataset("squad", split="validation[200:204]")  # Again a smaller sample
squad_val_tokenized = squad_val.map(preprocess, batched=True, remove_columns=squad_val.column_names)
val_loader = DataLoader(squad_val_tokenized, batch_size=4, shuffle=False, collate_fn=DefaultDataCollator())

In [None]:
batch = next(iter(val_loader))
batch = {k: v.to(device) for k, v in batch.items()}

with torch.no_grad():
    output = model(**batch)
output.keys()

In [None]:
batch

In [None]:
print((output.start_logits.shape, output.end_logits.shape))

print(output.start_logits.argmax(-1))
print(output.end_logits.argmax(-1))

We have one logit for each feature and each token. The most obvious thing to predict an answer for each feature is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. In that case, we might want to look at the second best prediction to see if it gives a possible answer and select that instead.

However, picking the second best answer is not as easy as picking the best one: is it the second best index in the start logits with the best index in the end logits? Or the best index in the start logits with the second best index in the end logits? And if that second best answer is not possible either, it gets even trickier for the third best answer.

To classify our answers, we will use the score obtained by adding the start and end logits. We won't try to order all the possible answers and limit ourselves to with a hyper-parameter we call n_best_size. We'll pick the best indices in the start and end logits and gather all the answers this predicts. After checking if each one is valid, we will sort them by their score and keep the best one.

The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:

* the ID of the example that generated the feature (since each example can generate several features, as seen before);
* the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from prepare_train_features:


In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_token_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to [] the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else [])
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [None]:
validation_features = squad_val.map(
    prepare_validation_features,
    batched=True,
    remove_columns=squad_val.column_names,
)
val_loader = DataLoader(validation_features, batch_size=4, shuffle=False, collate_fn=DefaultDataCollator())

In [None]:
test_df = validation_features.to_pandas()
test_df

In [None]:
raw_predictions = {
    "start_logits": [],
    "end_logits": [],
}

with torch.no_grad():
    for batch in val_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        batch.pop("offset_mapping")
        outputs = model(**batch)
        # we unpack the batch here so it's easier to work with the examples later on
        raw_predictions["start_logits"].extend([ex for ex in outputs["start_logits"].cpu()])
        raw_predictions["end_logits"].extend([ex for ex in outputs["end_logits"].cpu()])
        break

In [None]:
raw_predictions["start_logits"][0].shape  # one sample instead of the batch size

We can now refine the test we had before: since we set `[]` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune).

This part is again on a single example, the full implementation is after this.

In [None]:
max_answer_length = 30
n_best_size = 20

start_logits = output.start_logits[0].cpu()
end_logits = output.end_logits[0].cpu()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = squad_val[0]["context"]

# Gather the indices the best start/end logits:
start_indexes = torch.argsort(start_logits)[-n_best_size - 1 :].tolist()
end_indexes = torch.argsort(end_logits)[-n_best_size - 1 :].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or len(offset_mapping[start_index]) == 0
            or len(offset_mapping[end_index]) == 0
        ):
            continue

        # TODO: your code here
        # Ignore answers with a length that is either < 0 or > max_answer_length.


        # TODO: your code here
        # Check that the answer is inside the context and fill the variables.
        if ...:
            start_char = ...
            end_char = ...
            score = ...
            text = ...
            valid_answers.append({"score": score, "text": text})

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers = "\n".join(str(item) for item in valid_answers)

print(f'Question: {squad_val[0]["question"]}')
print(f'Partial context: {squad_val[0]["context"][:200]}')
print(f'Gold answers: {squad_val[0]["answers"]}')
print(f"Predictions:\n{valid_answers}")

Your code should give:

```
Question: Who had the best record in the NFC?
Partial context: Despite waiving longtime running back DeAngelo Williams and losing top wide receiver Kelvin Benjamin to a torn ACL in the preseason, the Carolina Panthers had their best regular season in franchise hi
Gold answers: {'text': ['Carolina Panthers', 'the Panthers', 'Carolina'], 'answer_start': [137, 695, 330]}
Predictions:
{'score': tensor(4.2731), 'text': 'Carolina Panthers'}
{'score': tensor(1.9384), 'text': 'Panthers'}
{'score': tensor(1.8279), 'text': 'the Carolina Panthers'}
{'score': tensor(0.4810), 'text': 'Carolina'}
{'score': tensor(-1.9641), 'text': 'the Carolina'}
{'score': tensor(-2.3274), 'text': 'New Orleans Saints and the 2011 Green Bay Packers'}
{'score': tensor(-2.3703), 'text': 'Panthers'}
{'score': tensor(-2.7467), 'text': 'New Orleans Saints and the 2011 Green Bay Packers. With their NFC-best 15–1 regular season record, the Panthers'}
{'score': tensor(-3.2434), 'text': 'the Panthers'}
{'score': tensor(-3.2682), 'text': 'DeAngelo Williams and losing top wide receiver Kelvin Benjamin to a torn ACL in the preseason, the Carolina Panthers'}
{'score': tensor(-3.5486), 'text': '2009 New Orleans Saints and the 2011 Green Bay Packers'}
{'score': tensor(-3.5672), 'text': 'Green Bay Packers'}
{'score': tensor(-3.6009), 'text': 'Carolina Panthers had their best regular season in franchise history'}
{'score': tensor(-3.6009), 'text': 'New Orleans Saints'}
{'score': tensor(-3.9680), 'text': '2009 New Orleans Saints and the 2011 Green Bay Packers. With their NFC-best 15–1 regular season record, the Panthers'}
{'score': tensor(-3.9865), 'text': 'Green Bay Packers. With their NFC-best 15–1 regular season record, the Panthers'}
{'score': tensor(-4.8222), 'text': '2009 New Orleans Saints'}
{'score': tensor(-4.8484), 'text': 'New Orleans Saints and the 2011 Green Bay Packers. With their NFC-best 15–1'}
{'score': tensor(-4.9472), 'text': '2011 Green Bay Packers'}
{'score': tensor(-5.0479), 'text': 'New Orleans Saints and the 2011 Green Bay Packers. With their NFC-best 15–1 regular season record'}
```


As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [None]:
examples = squad_val
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

The code above only keeps answers that are inside the context, we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the *minimum* of the scores for the impossible answer in each feature generated by the example.

We then predict the impossible answer when that score is greater than the score of the best non-impossible answer. All combined together, this gives us this post-processing function:

In [None]:
def postprocess_qa_predictions(
    examples, features, start_logits_predictions, end_logits_predictions, n_best_size=20, max_answer_length=30
):
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None  # Only used if squad_v2 is True, assignment 5
        valid_answers = []

        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = start_logits_predictions[feature_index]
            end_logits = end_logits_predictions[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = torch.argsort(start_logits)[-n_best_size - 1 :].tolist()
            end_indexes = torch.argsort(end_logits)[-n_best_size - 1 :].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or len(offset_mapping[start_index]) == 0
                        or len(offset_mapping[end_index]) == 0
                    ):
                        continue
                    # TODO: your code here
                    # include what you've done above on a single example on everything here
                    score = ...
                    text = ...
                    valid_answers.append({"score": score, "text": text})

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}

        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if squad_v2:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer
        else:
            predictions[example["id"]] = best_answer["text"]

    return predictions

In [None]:
final_predictions = postprocess_qa_predictions(
    squad_val,
    validation_features,
    raw_predictions["start_logits"],
    raw_predictions["end_logits"],
)

In [None]:
# huggingface provides a way to compute pre-defined metrics
# for convenience, we're using this here as well
metric = evaluate.load("squad_v2" if squad_v2 else "squad")
if squad_v2:
    formatted_predictions = [
        {"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()
    ]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in squad_val]
metric.compute(predictions=formatted_predictions, references=references)

### Part 5: SQuAD v2

After the release of SQuAD, SQuAD v2 was released.
This version of the dataset combines the 100k questions from SQuAD v1 with an additional 50k unanswerable questions.
This increases the complexity of the task: systems must not only answer questions, but also determine when no answer is supported by the paragraph and abstain from answering.

Here, we're looking at two problems we have to fix before we can use SQuAD v2:

- What additional pre-processing has to be done?
- How do we evaluate this definition of the task?

In [None]:
squadv2 = load_dataset("squad_v2")

In [None]:
# an example where the answer is not on the context
squadv2["train"][130318]

In [None]:
# this will error because of empty answer spans
features_v2 = preprocess(squadv2["train"][130316:130318])

# TODO: adapt the functions above for preprocessing so they can deal with empty answers