# Exploring Extractive Question Answering in Web Science by Fine-tuning DistilBERT

**Pablo Apausa, Anxo Muñiz, Carlos Honrado**  
Web Science

Universidad Politécnica de Madrid  
December, 2025

## 1. Introducción

### 1.1. Extractive QA

This notebook demonstrates the fine-tuning of DistilBERT for extractive question-answering using the Stanford Question Answering Dataset (SQuAD) and preprocessing with sliding windows. Unlike generative approaches that generate text, extractive QA identifies the answer from the context.

Extractive QA is impacting the field of Web Science by changing how users interact with web content. Browser engines are starting to incorporate QA capabilities that allow direct extraction of relevant answers from web pages, eliminating the need to manually read long documents. Which reduces cognitive load while searching for information. 

### 1.2. DistilBERT

DistilBERT has been selected instead of models like BERT for computational efficiency reasons: when developing from scratch it is important to value the ability to iterate quickly and run many tests. It is a distilled version that retains 97% of BERT's performance while reducing the number of parameters by 40% and increasing efficiency by 60% **(Sanh et al., 2020)**. This optimization allows the model to be trained more quickly without sacrificing prediction quality.

Encoder-only architectures like BERT excel at answering direct questions, but present limitations with open-ended questions (for these situations that require elaboration, encoder-decoder architectures are more appropriate since they can generate more elaborate responses).

### 1.3. Fine-tuning

Fine-tuning consists of adapting a pretrained model to a specific task to improve its performance. In this case, while pretraining provides the model with a general understanding of language, fine-tuning specializes it to identify the start and end positions of answers within a given context, **achieving 77.22% in EM and 85.34% F1 accuracy**.

QA systems enable users to ask specific questions about any webpage and receive precise answers extracted from their content. This is especially useful for documents where information must be found quickly. This task also benefits web accessibility by helping users with disabilities navigate complex content. 

### 1.4. Index

The notebook is structured in the following phases:
- **3. Data Preparation**: SQuAD, preprocessing with sliding windows, maintaining character-to-token alignment and generating token-level labels.
- **4. Training**: Fine-tuning DistilBERT to predict answer fragments on 87,735 features in approximately 40.5 minutes.
- **5. Evaluation**: Post-processing predictions, feature aggregation, calculating SQuAD metrics.
- **6. Performance Analysis**: Demonstration of improvement over the base model on specific questions about class slides.

----

## 2. Initial Setup

### 2.1 GPU Check

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

### 2.2 Library Installation

- transformers: provides pretrained transformer models like DistilBERT and tokenizers.
- datasets: enables efficient loading and processing of SQuAD.
- evaluate: contains evaluation metrics that measure the quality of predictions.
- protobuf==3.20.3: ensures compatibility with other libraries.

In [None]:
pip install -q transformers datasets evaluate protobuf==3.20.3

## 3. Data Preparation

### 3.1 SQuAD Dataset

The Stanford Question Answering Dataset (SQuAD) is one of the most used benchmarks for evaluating extractive question-answering systems. SQuAD v1.1, consists of more than 100000 questions posed by collaborators about a set of Wikipedia articles, where the answer to each question is a segment of the corresponding passage **(Rajpurkar et al., 2016)**.

Each example in the dataset consists of three fields: a `context` paragraph, a `question` about the context, and `answers`; which is a dictionary with `text` and `answer_start` subfields (the character index where the answer begins in the context). This character position is crucial for converting to token positions during training.

Training examples contain exactly one answer per question, while validation examples may have multiple acceptable answers from different annotators. During evaluation, predictions are compared against all valid answers and the best score is taken.

In [None]:
from datasets import load_dataset

# Load SQuAD dataset: contains context, question, and answer (with character position)
# Returns DatasetDict with 'train' and 'validation' splits
squad = load_dataset("squad")

### 3.2 Tokenizer Declaration

Transformers like DistilBERT require text to be converted into numerical tokens. The tokenizer handles this conversion while preserving semantic relationships between words and subwords.

For question answering, the tokenizer combines the question and context into a single sequence using special tokens: `[CLS]` at the beginning, `[SEP]` separating the question from the context, and another `[SEP]` at the end. This structure allows DistilBERT to distinguish the question from the context.

The `distilbert-base-cased` tokenizer variant is case-sensitive, which can be important for distinguishing proper nouns and acronyms. Distilbert also handles padding to ensure all sequences have the same length for efficient batch processing. And it provides offset mappings that link each token to its position in the original text, which allows for converting answer positions from characters to tokens that the model can learn from.

In [None]:
from transformers import AutoTokenizer

# load distilBERT
model_name = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

### 3.3 Preprocessing the Training Dataset

Preprocessing the training data is the most complex part of the process, as long contexts must be managed and labels generated for answer positions. The SQuAD dataset provides answer positions as characters, but the model needs token indices to learn where answers begin and end.

The challenge is that many contexts in SQuAD exceed the maximum input length of 320. Instead of truncating and losing information, a sliding window approach is employed; dividing long contexts into overlapping fragments with a stride of 128 tokens. This ensures that if an answer approaches a truncation point, it will appear completely within at least one fragment.

This preprocessing function must map answer positions to indices within each fragment. This is achieved through offset mappings: a list of tuples `(start_char, end_char)` that shows what character range in the original text each token represents.

The labeling process tokenizes question-context pairs, assigns a null label when answers fall outside the fragments, and otherwise finds the first and last token of the fragment to create the `start_positions` and `end_positions` labels. Additionally, all sequences are padded to the maximum length for efficient batch processing.


In [None]:
max_length = 320 # maximum sequence length
stride = 128 # sliding window stride for handling long contexts

def preprocess_training(examples):
    questions = [q.strip() for q in examples["question"]] # strip extra whitespace from questions
    
    # tokenize question-context pairs
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second", # keep questions intact, split contexts
        stride=stride,
        return_overflowing_tokens=True, # create multiple features from long contexts
        return_offsets_mapping=True, # get character-to-token alignment
        padding="max_length",
    )

    # extract offset mappings and sample mappings for label generation
    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    # generate start and end position labels for each feature
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # if answer is not fully inside the context chunk, label as (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # find token indices for answer start and end using offset mappings
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

This preprocessing is applied using `Dataset.map()` with `batched=True` for efficiency. And `remove_columns=squad["train"].column_names` to replace the original structure with tokenized inputs and position labels. The `batched=True` flag is essential as the dataset length is being changed (one original example can produce multiple training features due to the sliding window approach). 

In [None]:
# Apply preprocessing to training set
train_dataset = squad["train"].map(
    preprocess_training,
    batched=True, # process multiple examples at once for efficiency
    remove_columns=squad["train"].column_names, # replace original structure with tokenized inputs and labels
)

len(squad["train"]), len(train_dataset)

El preprocesamiento aumenta el dataset de 87599 a 90968 ejemplos. Es decir, se han creado 3369 muestras adicionales.

### 3.4 Preprocessing the Validation Dataset

Preprocessing the validation set is simpler, as answer labels are not generated, but rather the model predicts them during evaluation. However, information must be preserved to map predictions back to the original examples and extract the answer text.

The main challenge during evaluation lies in the fact that the model generates predictions at the token level, but the span of the original context needs to be reported. Additionally, when long contexts are split into multiple features, each feature might produce a different prediction, and they must be aggregated to determine the best answer for the original example.

So to solve this, offset mappings are maintained just as in training, but with a modification: the offsets of question tokens are are set to `None`. This is crucial because during post-processing, there is no access to the `sequence_ids()` method. And in this way, only context tokens are considered when predicting answer spans.

Example IDs from the original dataset are preserved, because sliding windows create multiple features for each example. The `overflow_to_sample_mapping` tracks which original example each feature came from, allowing example IDs to group features together during evaluation. 

In [None]:
def preprocess_validation(examples):
    questions = [q.strip() for q in examples["question"]]  # strip extra whitespace from questions

    # same as training, but no labels needed
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        ids.append(examples["id"][sample_idx]) # store example ID to group features from same example during evaluation

        # set question token offsets to None, keeping only context token offsets
        # this helps identify valid answer positions during post-processing
        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = ids
    return inputs

This preprocessing is applied to the validation set, replacing the original columns with processed features that contain tokenized inputs, offset mappings, and example IDs for tracking. 

In [None]:
# apply preprocessing to validation set
validation_dataset = squad["validation"].map(
    preprocess_validation,
    batched=True,
    remove_columns=squad["validation"].column_names,
)

len(squad["validation"]), len(validation_dataset)

The validation preprocessing has added 547 features, which suggests that the contexts in the validation set are typically shorter than those in the training set and require less splitting.



## 4. Implementation

### 4.1 Model Initialization

The base DistilBERT model provides contextualized representations for each token in the input, while the question-answering module adds two linear layers: one that predicts the probability of each token being the start and another that predicts the probability of each token being the end of the answer.

The warning about newly initialized weights is expected: the pre-trained layers retain language understanding, while the QA layers are randomly initialized (as they're specific to this task). Fine-tuning trains both components for the task.

In [None]:
from transformers import AutoModelForQuestionAnswering

# initialize distilBERT with qa head
model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)

### 4.3 Hyperparameters Definition

`TrainingArguments` encapsulates all hyperparameters that control the training process. Each parameter has been chosen to balance training efficiency, model performance, and computational constraints for this specific task.

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-squad", # output directory for checkpoints
    eval_strategy="epoch", # evaluate at end of each epoch
    save_strategy="epoch", # save checkpoint at end of each epoch
    learning_rate=2e-5, # standard learning rate for BERT fine-tuning
    num_train_epochs=1, # three passes through dataset
    weight_decay=0.01, # regularization to prevent overfitting
    fp16=True, # mixed precision training for speed
    report_to="none", # disable logging
)

### 4.4 Trainer Declaration

`Trainer` manages the training loop where each **step** represents one forward and backward pass through a single batch of examples: processing the batch, computing predictions, calculating the loss, computes gradients via backpropagation, and updating the model's weights through the optimizer. Over many steps across multiple epochs, these incremental updates gradually adapt the pre-trained DistilBERT to accurately identify answer spans in question-answering contexts.

In [None]:
from transformers import Trainer

# Create trainer for fine-tuned model
trainer = Trainer(
    model=model, # distilBERT with qa head
    args=args, # training configuration
    train_dataset=train_dataset, # preprocessed training features with labels
    eval_dataset=validation_dataset, # preprocessed validation features
    processing_class=tokenizer, # tokenizer for batch collation
)

# an additonal trainer class is declared for the base model, to compare it predictions against the fine-tuned one. 
base_trainer = Trainer(
    model=model,
    args=args,
    eval_dataset=validation_dataset,
    processing_class=tokenizer,
)

### 4.5 *Fine-tuning* Execution

Training is initiated with `trainer.train()`, processing 90968 features across 11371 steps per epoch. Its training takes around 80 minutes.

In [None]:
trainer.train()


## 5. Results and Discussion

### 5.1. Evaluation of fine-tuned model

Evaluation measures how well the fine-tuned model performs on the validation set using the SQuAD metrics. This process is more complex than in classification because numerical predictions are converted into text fragments, considering that multiple features might represent the same original example.

This strategy considers the top 20 start and end positions for each feature, evaluates all their valid combinations, and scores each candidate by summing its start and end logits to perform a make ranking (skipping the softmax normalization because calibrated probabilities are not relevant). 

Answers that are filtered out are those outside the context, with negative length, or longer than the `max_answer_length`. When a long context is split into features, the answer with the highest score is selected; which leverages multiple perspectives on the same context while choosing the single best prediction. If no valid prospects are found for an example, an empty string is returned as the prediction.

Evaluation uses two complementary metrics. First is exact match (EM), which measures whether the model accurately identifies the context fragment: giving a score of 1 for exact answers and 0 otherwise. And the F1 score, which calculates the harmonic mean of precision between the prediction and the ground-truth. It is more flexible than EM, granting a partial score when there is overlap between the predicted answer and the correct one.


In [None]:
from tqdm.auto import tqdm
import collections
import numpy as np
import evaluate

n_best = 20  # consider top 20 start/end positions
max_answer_length = 30  # maximum answer span length in tokens
metric = evaluate.load("squad") # load SQuAD metric for evaluation (exact match and f1)

def check_score(start_logits, end_logits, features, examples):
    # map each example to its features
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    
    # process each example to find best answer across all its features
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # collect candidate answers from all features for this example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            # get top 20 start and end positions by logit score
            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            
            # Eealuate all valid (start, end) combinations
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # skip if answer is not fully in context (offset is None for question tokens)
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # skip invalid answers: negative length or too long
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    # extract answer text from context using offsets
                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # select answer with highest logit score (or empty string if no valid answers)
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    # format ground truth answers and compute SQuAD metrics (em and f1)
    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

Once trained, `predict()` evaluates the fine-tuned model on the validation set to measure its QA performance. This method runs inference on the entire validation dataset and returns the model's predictions. Which are start and end logits in each token per feature.

These logits are passed to the `compute_metrics()` function along with the validation features and original examples. This function identifies valid answer spans, scoring candidates, aggregating across all features, and comparing against ground truth answers.

In [None]:
# run inference on validation set and get start/end logits
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions

# post-process predictions to extract answer spans and compute metrics
check_score(start_logits, end_logits, validation_dataset, squad["validation"])

**On fine-tuning model validation: 77.22% EM and 85.34% F1 are achieved**. 

In [None]:
# run inference on validation set with untrained base model
base_predictions, _, _ = base_trainer.predict(validation_dataset)
base_start_logits, base_end_logits = base_predictions

# compute metrics for base model
base_results = check_score(base_start_logits, base_end_logits, validation_dataset, squad["validation"])
print("Base model results (without fine-tuning):")
print(base_results)

**Whereas on the base model, X% EM and X% F1 are achieved**. 

### 5.2. Exemplified comparison

To exemplify this improvement, the base model is also compared with its fine-tuned version on test cases covering various topics. This tests whether the model can accurately locate factual information within technical contexts. The base model has only seen generic pre-training and lacks specific question-answering training, while the fine-tuned model has learned to extract answer spans from the SQuAD dataset.

In [None]:
from transformers import pipeline

# create qa pipeline for base model (no fine-tuning)
base_qa_pipeline = pipeline(
    task="question-answering",
    model="distilbert-base-cased"
)

# create qa pipeline for fine-tuned model
tuned_qa_pipeline = pipeline(
    task="question-answering",
    model=trainer.model,
    tokenizer=tokenizer
)

# test cases
test_cases = [
    {
        "context": """
           A search engine is defined as a system that retrieves relevant information from a large data collection in response to a query.
           Common examples of this technology include web platforms like Google and Bing, as well as internal company and e-commerce search engines.
           To function, these systems utilize a general architecture comprised of four key stages: crawling, indexing, query processing, and ranking.
        """,
        "question": "What is defined as a system that retrieves relevant information from a large data collection?"
    }, {
        "context": """
            Text similarity can be evaluated using specific metrics, such as the Jaccard Index.
            It measures the similarity between finite sample sets.
            Alternatively, Cosine Similarity assesses similarity by measuring the cosine of the angle between two vectors.
        """,
        "question": "Which index measures similarity by dividing the size of the intersection by the size of the union of sample sets?"
    }, {
        "context": """
            To address vocabulary limitations, the tokenizer is trained with the training corpus up to a maximum number.
            It uses the Byte Pair Encoding (BPE) algorithm, a method originally intended to compress text.
            This allows tokens to be generated by combinations of subwords (subtokens), which helps the model effectively avoid "Out Of Vocabulary" errors. 
        """,
        "question": "Which algorithm is used to generate combinations of subwords to avoid Out Of Vocabulary issues?"
    }, {
        "context": """
            The GPT language model utilizes a transformer decoder architecture consisting of 12 decoders.
            It was trained using the BooksCorpus(en) to predict the next word.
            To achieve this, the model employs masked attention, which ensures that it does not look at future tokens during the process.
        """,
        "question": "What specific corpus was used to train the GPT-1 language model?"
    }
]

# compare base and fine-tuned model on each test case
for i, test_case in enumerate(test_cases, 1):
    context = test_case["context"].strip()
    question = test_case["question"]

    base_result = base_qa_pipeline(question=question, context=context)
    base_answer = base_result["answer"]
    base_score = base_result["score"]

    tuned_result = tuned_qa_pipeline(question=question, context=context)
    tuned_answer = tuned_result["answer"]
    tuned_score = tuned_result["score"]

    print(f"Test case #{i}")
    print(f"Context: {context}")
    print(f"Question: {question}")
    print()
    
    print("Base model:")
    print(f"- Answer: {base_answer}")
    print(f"- Confidence: {base_score:.4f}")
    print()
    
    print("Fine-tuned model:")
    print(f"- Answer: {tuned_answer}")
    print(f"- Confidence: {tuned_score:.4f}")
    print()
    print()

## 6. Conclusion and future steps

In the evaluation results the fine-tuned model achieves an Exact Match score of 77.22% and an F1 score of 85.34% on the validation set after training for three epochs, which represent good performance. The model identifies the precise answer more than one-fifth of the time, and when it doesn't match exactly, there's still substantial overlap with the correct answer (reflected in the high F1 score).

On the other hand, the comparison with the base model in the previous section demonstrates the effectiveness of fine-tuning. This first model has almost random behavior, with very low confidence levels and incoherent answers that extract meaningless fragments. While the fine-tuned model demonstrates:
1. Much greater confidence: reflecting more decisive predictions.
2. Semantic coherence: the answers are appropriate and complete, such as "175 billion" for the number of parameters in GPT-3.
3. And generalization capability: the model answers questions about class concepts not present in the dataset, which shows that it has learned general reading comprehension patterns rather than memorizing examples.

This journey from random behavior to useful and reliable predictions in the same model shows the value of fine-tuning. Even when performed on a small model like DistilBERT. The results validate that a few training epochs are sufficient to adapt pretrained linguistic knowledge to the extractive question-answering task.

**The precise answer and efficient inference make this fine-tuned model well suited for web science applications**, such as retrieval systems in browser engines. The high accuracy would ensure reliable information extraction from web documents, while its inherent efficiency would allow it to run client side. 

QA systems integrated in the browser could handle diverse web content across different domains, from technical documentation and academic papers to news articles and e-commerce pages. Accurate answer extraction could improve web navigation and accessibility, benefiting users with visual impairments or cognitive disabilities who could avoid navigating complex documents.

Future work could explore two directions: experimenting with larger models such as `bert-base-cased` to evaluate whether they achieve better results; or applying fine-tuning to specific domains to optimize the model for particular content types. Like technical manuals or legal documents. This would improve performance in web applications.

----

## References

- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). *DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter*. arXiv preprint arXiv:1910.01108.

- Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). *Squad: 100,000+ questions for machine comprehension of text*. arXiv preprint arXiv:1606.05250.