# Fine-Tuning BERT for Question Answering with Hugging Face 🤗

This notebook provides a complete walkthrough for fine-tuning a BERT model on a question-answering task using the Hugging Face `transformers`, `datasets`, and `evaluate` libraries. We'll use the popular SQuAD (Stanford Question Answering Dataset) for this task.

The process involves these key steps:
1.  **Setup**: Install and import necessary libraries.
2.  **Load Data**: Load the SQuAD dataset.
3.  **Preprocessing**: Tokenize and prepare the text data for the model.
4.  **Fine-Tuning**: Train the model using the `Trainer` API.
5.  **Evaluation**: Evaluate the model's performance on the validation set.
6.  **Inference**: Use the fine-tuned model for prediction.

## 1. Setup and Installations

First, let's install the required libraries from Hugging Face. We need `datasets` to load our data, `transformers` for the model and tokenizer, and `evaluate` for the metrics.

In [None]:
!pip install transformers[torch] datasets evaluate -q

## 2. Load the Dataset

We'll use the SQuAD dataset, which is a standard benchmark for question answering. It consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.

In [None]:
from datasets import load_dataset

# Load the SQuAD dataset
squad = load_dataset("squad")

print(squad)

Let's look at a single example from the training set to understand its structure.

In [None]:
print(squad["train"][0])

## 3. Preprocessing the Data

Preprocessing for question answering is more involved than for simple text classification. We need to:
1.  Tokenize the **context** and the **question** together.
2.  Handle long contexts that exceed the model's maximum sequence length (512 tokens for BERT).
3.  Map the start and end positions of the answer in the original text to the tokenized input.

In [None]:
from transformers import AutoTokenizer

# We'll use a distilled version of BERT for faster training, but you can use 'bert-base-uncased' for better performance
model_checkpoint = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

### Handling Long Contexts

When a context is too long, we can split it into several smaller chunks. The `stride` parameter creates an overlap between these chunks, ensuring that the answer span isn't cut in half. Each chunk will become a separate feature for the model.

In [None]:
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = 128  # The authorized overlap between two parts of the context when splitting it is needed.

def preprocess_function(examples):
    # Tokenize the questions and contexts, truncating only the context if the total length is too long.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # The 'return_overflowing_tokens' creates a mapping from a feature to its original example.
    # We need this to map predictions back to their original context.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Now we need to label our data with the start and end token positions.
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Get the original example corresponding to this feature.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # If no answers are given, set the cls_index as the answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start and end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Find the token start and end indices.
            token_start_index = 0
            while tokenized_examples.sequence_ids(i)[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while tokenized_examples.sequence_ids(i)[token_end_index] != 1:
                token_end_index -= 1

            # If the answer is not fully inside the current span, label it with (0, 0).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise, find the exact start and end token positions.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)

                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Now, we apply this function to our entire dataset using `map`. This might take a few minutes.

In [None]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
print("Preprocessing complete!")

## 4. Fine-Tuning the Model

We're now ready to train our model. We'll use the `AutoModelForQuestionAnswering` class, which will load a pretrained BERT model with a question-answering head on top.

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Next, we define the `TrainingArguments`. This class holds all the hyperparameters for training, such as the learning rate, number of epochs, batch size, and where to save the model.

In [None]:
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False, # Set to True if you want to upload the model to the Hub
)

Finally, we instantiate the `Trainer` and start the fine-tuning process. This will take a while, especially if you are not using a GPU.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
    tokenizer=tokenizer,
)

print("Starting training...")
trainer.train()
print("Training finished!")

## 5. Evaluation

After training, we need to evaluate our model's performance. The standard metrics for SQuAD are **Exact Match (EM)** and **F1-score**. This requires some post-processing to map the model's output (logits for start and end tokens) back to text spans in the original context.

In [None]:
import torch
import numpy as np
import collections
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size=20, max_answer_length=30):
    all_start_logits, all_end_logits = raw_predictions
    
    # Map features to their corresponding examples
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionary to store our predictions
    predictions = collections.OrderedDict()

    # Loop over all the examples
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None 
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map back to the original context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if (start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char:end_char],
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            best_answer = {"text": "", "score": 0.0}
        
        predictions[example["id"]] = best_answer["text"]

    return predictions

In [None]:
# First, get the raw predictions from the model on the validation set.
raw_predictions = trainer.predict(tokenized_squad["validation"])

# The Trainer hides some columns, so we need to create a new dataset for post-processing.
validation_features = squad["validation"].map(
    preprocess_function,
    batched=True,
    remove_columns=squad["validation"].column_names
)

# Now, clean up the predictions.
final_predictions = postprocess_qa_predictions(squad["validation"], validation_features, raw_predictions.predictions)

# Load the SQuAD metric from the `evaluate` library.
metric = evaluate.load("squad")

# Format the predictions and labels for the metric.
formatted_predictions = [{
    "id": k,
    "prediction_text": v
} for k, v in final_predictions.items()]

references = [{
    "id": ex["id"],
    "answers": ex["answers"]
} for ex in squad["validation"]]

# Compute the metrics.
results = metric.compute(predictions=formatted_predictions, references=references)
print(results)

## 6. Inference

Now that we have a fine-tuned model, let's see how to use it to answer a new question. The easiest way is to use a `pipeline`.

In [None]:
from transformers import pipeline

# You need to provide the path where the trainer saved the model.
# This is typically inside the directory you specified in TrainingArguments.
model_path = f"{model_name}-finetuned-squad/checkpoint-XXXX" # <-- IMPORTANT: Replace XXXX with the last checkpoint number

# For this example, let's just use the model object we already have in memory
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

context = """
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. 
It is named after the engineer Gustave Eiffel, whose company designed and built the tower. 
Constructed from 1887 to 1889 as the entrance to the 1889 World's Fair, it was initially criticized by some of France's leading artists and intellectuals for its design, 
but it has become a global cultural icon of France and one of the most recognizable structures in the world.
"""

question = "Who designed the Eiffel Tower?"

result = qa_pipeline(question=question, context=context)

print(f"Question: {question}")
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")