# Finetune DistilBERT on the SQuAD dataset for extractive question answering.

## Setup

In [None]:
!pip install datasets transformers evaluate gradio

In [2]:
# Login to huggingfacehub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Preparing the data

### Load the SQuAD Dataset
The SQuAD dataset is used the most as an academic benchmark for extractive question answering.

In [3]:
from datasets import load_dataset

squad = load_dataset("squad")
squad

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

There are several important fields here:

- **answers:** the starting location of the answer token and the answer text.
- **context:** background information from which the model needs to extract the answer.
- **question:** the question a model should answer.

In [4]:
# let’s print the first element of our training set
print("Context:", squad["train"][0]["context"])
print("Question:", squad["train"][0]["question"])
print("Answer:", squad["train"][0]["answers"])

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer: {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In [5]:
# During training, there is only one possible answer
squad["train"].filter(lambda x: len(x["answers"]["text"]) !=1)

Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [6]:
# For evaluation, there are several possible answers
print(squad["validation"][0]["answers"])
print(squad["validation"][2]["answers"])

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
{'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355, 355]}


In [7]:
# Some of the questions have several possible answers
print(squad["validation"][2]["question"])
print(squad["validation"][2]["context"])

Where did Super Bowl 50 take place?
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.


### Processing the training data

In [8]:
# load a DistilBERT tokenizer to process the question and context fields
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
# Define the maximum length for the input sequences
max_length = 384

# Define the stride for the sliding window approach to handle long contexts
stride = 128

def preprocess_training_examples(examples):
    # Strip any leading/trailing whitespace from the questions
    questions = [q.strip() for q in examples["question"]]

    # Tokenize the inputs with the provided tokenizer
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,                # Set maximum length for the input sequence
        truncation="only_second",             # Truncate only the context part of the sequence if needed
        stride=stride,                        # Define the stride size for splitting long contexts
        return_overflowing_tokens=True,       # Return overflow tokens for contexts that are too long
        return_offsets_mapping=True,          # Return offsets for each token, useful for mapping tokens to characters
        padding="max_length",                 # Pad sequences to the maximum length
    )

    # Pop the offset mapping and sample map to process them separately
    offset_mapping = inputs.pop("offset_mapping")        # Offset mappings for tokens
    sample_map = inputs.pop("overflow_to_sample_mapping") # Mapping to keep track of sample indices
    answers = examples["answers"]                        # List of answers

    # Initialize lists to store start and end positions of the answers
    start_positions = []
    end_positions = []

    # Iterate over each example's offset mapping
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]            # Get the corresponding sample index
        answer = answers[sample_idx]          # Get the answer for this sample
        start_char = answer["answer_start"][0] # Start character index of the answer
        end_char = start_char + len(answer["text"][0]) # End character index of the answer

        # Get sequence IDs to differentiate between question (0) and context (1) tokens
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context in the tokenized input
        idx = 0
        while sequence_ids[idx] != 1:  # Move to the start of the context (skip question tokens)
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:  # Find the end of the context
            idx += 1
        context_end = idx - 1

        # Check if the answer is fully inside the context
        # If not, set start and end positions to (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise, find the token indices for start and end positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # Add the start and end positions to the input dictionary
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs


In [10]:
# Map the preprocessing function over the entire dataset
train_dataset = squad["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=squad["train"].column_names,
)

len(squad["train"]), len(train_dataset)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

(87599, 88524)

### Processing the validation data

In [11]:
def preprocess_validation_examples(examples):
    # Strip any leading/trailing whitespace from the questions
    questions = [q.strip() for q in examples["question"]]

    # Tokenize the questions and contexts, with specific settings:
    # - max_length: Maximum sequence length for the tokenizer output
    # - truncation="only_second": Truncate the context (second part of the input pair) if it exceeds max_length
    # - stride: The amount to shift the window when the context is too long and needs truncation
    # - return_overflowing_tokens: Return tokens that spill over max_length, creating multiple input sequences
    # - return_offsets_mapping: Return offset mappings for each token to track where tokens map in the original text
    # - padding="max_length": Pad sequences to the maximum length specified
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Extract the mapping of new tokens to their original example indices
    sample_map = inputs.pop("overflow_to_sample_mapping")

    # Initialize a list to store example IDs for each input sequence
    example_ids = []

    # Iterate over each input sequence to adjust the offset mappings
    for i in range(len(inputs["input_ids"])):
        # Find the index of the original example that corresponds to the current input sequence
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        # Get the sequence IDs and offsets for the current sequence
        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]

        # Update the offset mapping to keep offsets only for tokens related to the context (second part of the input)
        # Set offsets to None for tokens corresponding to the question (first part) or special tokens
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    # Add the example IDs to the inputs dictionary
    inputs["example_id"] = example_ids

    # Return the modified inputs, ready for use in validation
    return inputs


In [12]:
# Map the preprocessing function over the entire dataset
validation_dataset = squad["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=squad["validation"].column_names,
)

len(squad["validation"]), len(validation_dataset)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

(10570, 10784)

## Fine-tuning the model

### Post-processing

In [13]:
import numpy as np
import collections

import evaluate

metric = evaluate.load("squad")

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [14]:
# Define post-processing function
from tqdm.auto import tqdm

n_best = 20
max_answer_length = 30
predicted_answers = []

def compute_metrics(start_logits, end_logits, features, examples):
    # Create a mapping from example IDs to feature indices
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    # Loop over each example in the dataset
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Get all features related to the current example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            # Get the top start and end logits (highest probabilities)
            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()

            # Check all combinations of start and end indexes
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip if the start or end is outside the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip if the answer length is invalid
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    # Extract the answer text from the context
                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Choose the best answer based on logit score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    # Prepare the true answers for comparison
    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]

    # Compute and return evaluation metrics
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)


### Fine-tuning the model

In [15]:
import torch
from transformers import AutoModelForQuestionAnswering

# Load the pre-trained DistilBERT model for question answering
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
# Define our hyperparameters in TrainingArguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = "distilbert-finetuned-squad",
    evaluation_strategy = "no",
    save_strategy = "epoch",
    learning_rate = 2e-5,
    num_train_epochs = 1,
    weight_decay = 0.01,
    fp16 = True,
    push_to_hub = True,
)



In [17]:
# Define the Trainer
from transformers import Trainer

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = validation_dataset,
    tokenizer = tokenizer,
)

In [18]:
# Train the model
trainer.train()

Step,Training Loss
500,3.2451
1000,2.0268
1500,1.7458
2000,1.61
2500,1.5566
3000,1.5033
3500,1.4905
4000,1.4161
4500,1.4309
5000,1.3773


TrainOutput(global_step=11066, training_loss=1.4694233888660226, metrics={'train_runtime': 1403.0938, 'train_samples_per_second': 63.092, 'train_steps_per_second': 7.887, 'total_flos': 8674451270424576.0, 'train_loss': 1.4694233888660226, 'epoch': 1.0})

### Evaluate the model

In [19]:
# Get predictions for the validation dataset
predictions, _, _ = trainer.predict(validation_dataset)

# Separate the start and end logits from the predictions
start_logits, end_logits = predictions

# Calculate and return the evaluation metrics
compute_metrics(start_logits, end_logits, validation_dataset, squad["validation"])


  0%|          | 0/10570 [00:00<?, ?it/s]

{'exact_match': 74.720908230842, 'f1': 83.58559900485677}

In [20]:
# Upload the latest version of the model to HuggingFace Hub
trainer.push_to_hub(commit_message="Training completed !!!")

events.out.tfevents.1724516600.882de7528be9.243.0:   0%|          | 0.00/9.73k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ashaduzzaman/distilbert-finetuned-squad/commit/e605e550fdaa034e7c06928e079da19291c967cf', commit_message='Training completed !!!', commit_description='', oid='e605e550fdaa034e7c06928e079da19291c967cf', pr_url=None, pr_revision=None, pr_num=None)

## Inference



### Inference with pipeline()

In [25]:
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"

In [26]:
from transformers import pipeline

# Load the fine-tuned model for question answering
model_checkpoint = "Ashaduzzaman/distilbert-finetuned-squad"

question_answerer = pipeline(
    "question-answering",
    model=model_checkpoint,
)

# Perform question answering on the provided question and context
question_answerer(question=question, context=context)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.8826050162315369,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

### Inference with Gradio

In [30]:
from transformers import pipeline
import gradio as gr

# Load a pre-trained question-answering model from Hugging Face
model_checkpoint = "distilbert-base-uncased-distilled-squad"  # You can replace this with your fine-tuned model
question_answerer = pipeline("question-answering", model=model_checkpoint)

# Define the function to handle question answering
def answer_question(question, context):
    result = question_answerer(question=question, context=context)
    return result['answer']

# Set up the Gradio interface
interface = gr.Interface(
    fn=answer_question,                # The function that handles inference
    inputs=[
        gr.Textbox(lines=2, placeholder="Enter your question here..."),  # Input for the question
        gr.Textbox(lines=10, placeholder="Enter the context here...")    # Input for the context
    ],
    outputs="text",                    # The output will be text (the answer)
    title="Question Answering with DistilBERT",
    description="Enter a question and a context. The model will find the answer to the question within the context."
)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [31]:
# Launch the Gradio interface
interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e987c839863738f8f2.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


