# **HuggingFace - Question Answering**

In this notebook, we will explore the use of transformers in question answering. **Question Answering** is basically the task of answering a question based on context given by the user. Applications of this can include providing the model with a long document and extracting answers from it.

## **Pre-requisite Steps**
First, we have to install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
!apt install git-lfs

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your **write token** and check the Git option checkbox.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## **Exploring the dataset**
Here we load in the dataset named `squad` from HuggingFace. This dataset is mostly used as an academic benchmark for extractive question answering, as per HuggingFace.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("squad")

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

To see what the dataset contains, we print it. Features include an ID, Title, Context, a Question, and an array of answers.

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Here we see what each of those features contain:
*   **ID** - Unique Identifier for the row.
*   **Title** - Title describing the question and context.
*   **Context** - The content where the answer will be taken from.
*   **Question** - The question to be asked about the context.
*   **Answer** - The answers taken from the context.

However, we'll only really be focusing on the last 3. Let's see what it contains below.

In [None]:
print("Context: ", raw_datasets["train"][0]["context"])
print("Question: ", raw_datasets["train"][0]["question"])
print("Answer: ", raw_datasets["train"][0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In the training dataset, we have only 1 answer for each of the questions. However, we will see that it is different for the validation dataset.

In [None]:
raw_datasets["train"].filter(lambda x: len(x["answers"]["text"]) != 1)

Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In contrast to the training dataset, the validation dataset can have multiple answers to provide the user with multiple options. To show this, we can check some of the answers in the dataset.

In [None]:
print(raw_datasets["validation"][0]["answers"])
print(raw_datasets["validation"][2]["answers"])

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
{'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355, 355]}


Let's also see the context and question for the second answer array we printed, just so we can validate by ourself if it is accurate.

In [None]:
print(raw_datasets["validation"][2]["context"])
print(raw_datasets["validation"][2]["question"])

Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Where did Super Bowl 50 take place?


## **Processing the Training Data**
We'll be using a BERT model for this and fine-tuning it to fit our needs. First though, we'll have to tokenize our dataset.

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

The model's tokenizer we're using is a fast tokenizer, which in essence is a lot faster in tokenizing batch data. This is essential for what we are trying to do, and also allows us to use any other transformer models as long as it has a fast tokenizer.

In [None]:
tokenizer.is_fast

True

We then pass in the question and context into the tokenizer and it will then format it like so:

`[CLS] {question} [SEP] {context} [SEP]`

In [None]:
context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building \' s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]'

One of the cases we need to be prepared for is when the context is way too long for the tokenizer. For this tokenizer, it has a set max length of 384, however, some of our contexts will exceed that. To get around this, we can use a **Sliding Window** algorithm to only parse a few words from the context at a time. We can see this in action in the cell below.

In [None]:
inputs = tokenizer(
    question,
    context,
    max_length=100, # Max Length of the tokens
    truncation="only_second", # Truncate the context, which is the second parameter
    stride=50, # No. of tokens that overlap between successive slides
    return_overflowing_tokens=True,
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building ' s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the

As from what we can see, the tokens have been split up into 4 inputs. Each input here has an overlap of 50 tokens as set in the `stride` parameter. You can also see that the questions' answer is actually only in the 3rd and last inputs, this has the added benefit of training the transformer on cases when the answer is not in the context.

Right now, if we were to use this as is, we'd only be getting the first character of the answers. So if the answer was "Saint", we'd only be getting "S". To get around this, we need to also return the offsets mapping. This allows us to get the length of the word so that we'd get the entire word and not just the first character.

In [None]:
inputs = tokenizer(
    question,
    context,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

We can see that `offset_mapping` can be seen in the columns, however there's also `overflow_to_sample_mapping`. What this column represents is which features belong to which sample or row. In our case, since we've only given it one row (i.e. 1 question, 1 context), we'll only see four 0's.  

In [None]:
inputs["overflow_to_sample_mapping"]

[0, 0, 0, 0]

However, if we put in more data, we can see that we are given a list of integers representing which features came from which sample.

In [None]:
inputs = tokenizer(
    raw_datasets["train"][2:6]["question"],
    raw_datasets["train"][2:6]["context"],
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

The 4 examples gave 19 features.
Here is where each comes from: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3].


Here we go through the entire features list and find the position of the answer represented by their token indices. If, for example, the answer were 3 words and started at the first word of the context, the tuple would be like (0,2). We also check if the context fully contains the answer, and if not, we append (0,0) similar to when the answer is not in the context.

In [None]:
answers = raw_datasets["train"][2:6]["answers"]
start_positions = []
end_positions = []

for i, offset in enumerate(inputs["offset_mapping"]):
    sample_idx = inputs["overflow_to_sample_mapping"][i]
    answer = answers[sample_idx]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)

    # Find the start and end of the context
    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
    context_start = idx
    while sequence_ids[idx] == 1:
        idx += 1
    context_end = idx - 1

    # If the answer is not fully inside the context, label is (0, 0)
    if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
        start_positions.append(0)
        end_positions.append(0)
    else:
        # Otherwise it's the start and end token positions
        idx = context_start
        while idx <= context_end and offset[idx][0] <= start_char:
            idx += 1
        start_positions.append(idx - 1)

        idx = context_end
        while idx >= context_start and offset[idx][1] >= end_char:
            idx -= 1
        end_positions.append(idx + 1)

start_positions, end_positions

([83, 51, 19, 0, 0, 64, 27, 0, 34, 0, 0, 0, 67, 34, 0, 0, 0, 0, 0],
 [85, 53, 21, 0, 0, 70, 33, 0, 40, 0, 0, 0, 68, 35, 0, 0, 0, 0, 0])

Let's test out whether the answer we got it correct for the first one.

In [None]:
idx = 0
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end + 1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

Theoretical answer: the Main Building, labels give: the Main Building


We got the correct answer for the first one! Let's try one that has no answer, represented by the (0,0). Let's try the one on index 4.

In [None]:
idx = 4
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]

decoded_example = tokenizer.decode(inputs["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}")

Theoretical answer: a Marian place of prayer and reflection, decoded example: [CLS] What is the Grotto at Notre Dame? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building ' s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grot [SEP]


We can validate that the answer is nowhere to be found in the snippet of the context given. Now that we understand the process, we can group the functions together and apply it to the entire dataset.

In [None]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

We utilize the Dataset.map() method with the batched=True option to process the entire training dataset using the `preprocess_training_examples` function. This approach is essential because the function can generate multiple training features from a single example, effectively altering the dataset's overall length.

In [None]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)
len(raw_datasets["train"]), len(train_dataset)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

(87599, 88729)

From this, we can see that we actually added +1,000 more features, which is why the `Batched` parameter is important. This training set is now ready to be used, now let's go to the pre-processing of the validation set.

## **Processing the Validation Data**
For the validation data, we won't have to generate labels since we won't be calculating for validation loss (This doesn't really help us understand how good the model is). The only thing we'll be doing a bit differently here is setting the offset mappings of the question part in each tokenized input to `None`. This is because the `sequence_ids` method we used earlier won't be available in the post-processing, so we can't use that to differentiate between the question and the context.

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

We then apply it to the entire dataset as we did before.

In [None]:
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)
len(raw_datasets["validation"]), len(validation_dataset)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

(10570, 10822)

This time around, the number of features increased by <300 only, which indicates that the contexts for the validation dataset are a lot smaller.

## **Fine-tuning the model with the Trainer API**

The model predicts the start and end positions of the answer within the tokenized input. To refine these predictions, a post-processing step is applied, which involves:

- **Masking irrelevant logits**: Logits corresponding to tokens outside the context are ignored.
- **Skip Softmax**: Instead of converting logits to probabilities using softmax, we'll work directly with the logits, saving computation time.
- **Scoring answer spans**: Answer spans are scored using the sum of their corresponding start and end logits (applying the logarithmic property: log(a*b) = log(a) + log(b)).
- **Selecting the best answer**:  The answer span with the highest logit score, and which forms a valid answer (start before end), is chosen as the final prediction.

Essentially, the post-processing refines the raw model output by focusing on relevant tokens, scoring potential answer spans using logits directly, and selecting the most likely and valid answer.

To demonstrate this, we will be using another trained model for now, as our model is currently untrained.

In [None]:
small_eval_set = raw_datasets["validation"].select(range(100))
trained_checkpoint = "distilbert-base-cased-distilled-squad"

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
eval_set = small_eval_set.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

We then return to the tokenizer we had for our original model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Next, we prepare the `eval_set` for the model by removing unnecessary columns, creating a batch containing the entire small validation set, and feeding it to the model for processing. To expedite this process, we utilize a GPU if one is available.

In [None]:
import torch
from transformers import AutoModelForQuestionAnswering

eval_set_for_model = eval_set.remove_columns(["example_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(
    device
)

with torch.no_grad():
    outputs = trained_model(**batch)

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

The trainer returns the predictions as NumPy arrays, we convert the start and end logits to the same format.

In [None]:
start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

To obtain the predicted answer for each example in `small_eval_set`, we need to account for the fact that some examples might have been divided into multiple features within `eval_set`. So, we have to map each of the example in `small_eval_set` to the corresponding features in `eval_set`.

In [None]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

With this mapping in place, we can now dive in and go through each example and its related features. For each example, we'll take a look at the scores of the top `n_best` start and end positions, making sure to weed out any answers that don't make sense. We'll be on the lookout for answers that:

1. Answers that fall outside the context.
2. Answers with a negative length (end position before start position).
3. Answers exceeding a maximum length (defined by max_answer_length, e.g., 30).

Once we've checked out all the possible answers for an example, we'll simply choose the one with the best score as our prediction!

In [None]:
import numpy as np

n_best = 20
max_answer_length = 30
predicted_answers = []

for example in small_eval_set:
    example_id = example["id"]
    context = example["context"]
    answers = []

    for feature_index in example_to_features[example_id]:
        start_logit = start_logits[feature_index]
        end_logit = end_logits[feature_index]
        offsets = eval_set["offset_mapping"][feature_index]

        start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
        end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Skip answers that are not fully in the context
                if offsets[start_index] is None or offsets[end_index] is None:
                    continue
                # Skip answers with a length that is either < 0 or > max_answer_length.
                if (
                    end_index < start_index
                    or end_index - start_index + 1 > max_answer_length
                ):
                    continue

                answers.append(
                    {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                )

    best_answer = max(answers, key=lambda x: x["logit_score"])
    predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})

To evaluate the answers, we use HuggingFace's Evaluate library.

In [None]:
import evaluate

metric = evaluate.load("squad")

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

We do need to format the theoretical answers in the format the evaluator expects. This format is basically a dictionary containing the id and the answers.

In [None]:
theoretical_answers = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in small_eval_set
]

In [None]:
print(predicted_answers[0])
print(theoretical_answers[0])

{'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'}
{'id': '56be4db0acb8001400a502ec', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}}


We can then compute the metrics for this eval set.

In [None]:
metric.compute(predictions=predicted_answers, references=theoretical_answers)

{'exact_match': 83.0, 'f1': 88.25000000000004}

We'll now combine our post-processing logic within a `compute_metrics()` function, designed for the Hugging Face Trainer.

In a standard setup, `compute_metrics()` receives predictions and labels, facilitating direct performance evaluation. However, our specific use case necessitates access to:

* The **feature dataset** to retrieve offset mappings.
* The **example dataset** to access original contexts.

This dependency on external data sources prevents us from using `compute_metrics()` for continuous evaluation during the training process. Consequently, we'll reserve its usage for the **final evaluation** stage, once training is complete, to assess the overall performance of our model.

The `compute_metrics()` groups the functions we did before, though we do add a check in case no valid answer was predicted, returning an empty string.

In [None]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

Let's use it to see if it works.

In [None]:
compute_metrics(start_logits, end_logits, eval_set, small_eval_set)

  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 83.0, 'f1': 88.25000000000004}

## **Fine-tuning the Model**

We are now ready to train our model! First, let's create it like we did before using the `AutoModelForQuestionAnswering` class.


In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We then set our training arguments. Unfortunately we won't be able to pass our compute_metrics function here since it has a different signature than what the class is expecting. This will result in us having to evaluate it in a different way later on. You may also notice that `push_to_hub` is set to `True`, this updates the model on the hub every epoch, allowing us to have checkpoints.

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    push_to_hub=True,
)



We can now put in all of the required parameters inside the Trainer and start the training! You'll have to use your wandb.ai API key for this.

**WARNING**: This will be a very long process, around ~30 minutes on an A100 GPU. You can fine-tune a bit more if you want to reduce the amount of time it'll take.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mskyblaze24[0m ([33mskyblaze24-predictive-systems-inc-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
500,2.5889
1000,1.699
1500,1.5047
2000,1.3897
2500,1.3461
3000,1.304
3500,1.2362
4000,1.2083
4500,1.1385
5000,1.1678


TrainOutput(global_step=33276, training_loss=0.8376497414725691, metrics={'train_runtime': 1882.8125, 'train_samples_per_second': 141.377, 'train_steps_per_second': 17.674, 'total_flos': 5.216534983896422e+16, 'train_loss': 0.8376497414725691, 'epoch': 3.0})

We can finally evaluate our model! To do this, we'll be using the `predict` method of the trainer. This returns the predictions, which contains the `start_logits` and the `end_logits`. We can then put this into the `compute_metrics()` function to get the performance of our model.

In [None]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets["validation"])

  0%|          | 0/10570 [00:00<?, ?it/s]

{'exact_match': 81.13528855250709, 'f1': 88.56974068156724}

We can see that the model performs relatively well, with an `exact_match` score of ~81% and an F1 score of 88.5%! We can now push the final version of the trainer, just in case it wasn't pushed yet.

In [None]:
trainer.push_to_hub(commit_message="Training complete")

events.out.tfevents.1738954811.31bd2e0b3a00.2554.0:   0%|          | 0.00/19.5k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/CyberE/bert-finetuned-squad/commit/891bddfdcc774c37e8899a4178748d4510ef4b2d', commit_message='Training complete', commit_description='', oid='891bddfdcc774c37e8899a4178748d4510ef4b2d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/CyberE/bert-finetuned-squad', endpoint='https://huggingface.co', repo_type='model', repo_id='CyberE/bert-finetuned-squad'), pr_revision=None, pr_num=None)

## **Using the Fine-Tuned Model**

We can now use the fine-tuned model! To do this, we'll just have to provide context and also add in a question of our own. Don't forget to change the model checkpoint to your own (i.e. change `{username}/bert-finetuned-squad` to your HF username)

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "CyberE/bert-finetuned-squad"
question_answerer = pipeline("question-answering", model=model_checkpoint)

context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

config.json:   0%|          | 0.00/671 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


{'score': 0.993828535079956,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

Here we see that it gets the answer from the context, meaning the model has been a success! However, this model's accuracy is only around ~81%, try experimenting around with the fine tuning, and it may work even better.