<span style="font-size:25px;">J040 Nathan Dsouza</span>

# SQuAD Question Answering Task with DistilBERT
This notebook demonstrates an end-to-end workflow for training a question answering model using the Stanford Question Answering Dataset (SQuAD) and the DistilBERT transformer model. The steps include:
- Loading and postprocessing the SQuAD data
- Preprocessing for token-level answer mapping
- Model training and evaluation with a custom token-level IoU metric
- Building an inference pipeline to test the trained model

In [1]:
import torch
import transformers
import pandas as pd
import numpy as np
import json

from sklearn import model_selection, metrics
from datasets import Dataset
from tqdm.notebook import tqdm

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## 1. Import Required Libraries
This cell imports all necessary libraries for data processing, model training, and evaluation.

In [2]:
config = {
    "max_length": 512,
    "model_path": "distilbert-base-uncased",
    "output_dir": "./my-qa-model",
    "train_batch_size": 8,
    "valid_batch_size": 8,
    "learning_rate": 3e-5,
    "epochs": 2,
    "debug": False,
}

## 2. Configuration
This cell sets up the configuration parameters for the model, including maximum sequence length, model path, batch sizes, learning rate, and number of epochs.

In [3]:
def preprocess_function(sample):
    inputs = tokenizer(
        sample["question"],
        sample["context"],
        max_length=config["max_length"],
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sequence_ids = inputs.sequence_ids()

    # Find the start and end of the context in the tokenized input
    context_start = None
    context_end = None
    for i, seq_id in enumerate(sequence_ids):
        if seq_id == 1 and context_start is None:
            context_start = i
        if seq_id == 1:
            context_end = i + 1

    context_offsets = offset_mapping[context_start:context_end]

    answer_start_char = sample["answer_start"]
    answer_end_char = sample["answer_end"]

    start_pos = end_pos = None
    for idx, (start, end) in enumerate(context_offsets):
        if start <= answer_start_char < end:
            start_pos = idx + context_start
        if start < answer_end_char <= end:
            end_pos = idx + context_start
    # If answer_end falls at the end of a token, use the last token that covers it
    if end_pos is None:
        for idx, (start, end) in reversed(list(enumerate(context_offsets))):
            if start < answer_end_char <= end:
                end_pos = idx + context_start
                break
        if end_pos is None:
            end_pos = start_pos

    # If answer not found, set to first context token
    if start_pos is None or end_pos is None:
        start_pos = context_start
        end_pos = context_start

    inputs["start_positions"] = start_pos
    inputs["end_positions"] = end_pos

    return inputs

## 3. Preprocessing Function
This cell defines the preprocessing function that maps character-level answer positions to token-level positions using the tokenizer's offset mapping. This is essential for training the model to predict the correct answer span.

In [4]:
data = json.load(open("/kaggle/input/stanford-question-answering-dataset/train-v1.1.json"))


flattened_data = []

for sample in data["data"]:
    for para in sample["paragraphs"]:
        for qas in para["qas"]:
            flattened_data.append({
                "context": para["context"],
                "question": qas["question"],
                "answer": qas["answers"][0]["text"],
                "answer_start": qas["answers"][0]["answer_start"],  
            })
            
            
df = pd.DataFrame(flattened_data)
df["answer_end"] = df["answer_start"] + df["answer"].apply(len)
print(df.shape)
df.head(10)


tokenizer = transformers.AutoTokenizer.from_pretrained(config["model_path"])

(87599, 5)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## 4. Data Loading and Postprocessing
This cell loads the SQuAD dataset from JSON, flattens it into a DataFrame, and computes the answer end positions. It also initializes the tokenizer.

In [5]:
sample = df.iloc[1]

print("Question:\n{}".format(sample["question"]))
print()
print("Context:\n{}".format(sample["context"]))
print()
print("Answer:", sample["answer"])
print()
print(sample["context"][sample["answer_start"] : sample["answer_end"]])

Question:
What is in front of the Notre Dame Main Building?

Context:
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Answer: a copper statue of Christ

a copper statue of Christ


## 5. Data Sample Display
This cell displays a sample question, context, and answer from the processed DataFrame to help visualize the data format.

In [6]:
enc = tokenizer(
    sample["question"],
    sample["context"],
    
    return_offsets_mapping=True
)


enc

{'input_ids': [101, 2054, 2003, 1999, 2392, 1997, 1996, 10289, 8214, 2364, 2311, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595, 9648, 4674, 2061, 12083, 9711, 2271, 1999, 8517, 1012, 2012, 1996, 2203, 1997, 1996, 2364, 3298, 1006, 1998, 1999, 1037, 3622, 2240, 2008, 8539, 2083, 1017, 11342, 1998, 1996, 2751, 8514, 1007, 1010, 200

## 6. Tokenization Example
This cell shows how the tokenizer processes a sample question and context, including offset mappings. Useful for understanding how character positions are mapped to tokens.

In [7]:
print(enc.sequence_ids())

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]


## 7. Sequence IDs Display
This cell prints the sequence IDs for the tokenized sample, showing which tokens belong to the question and which to the context.

In [8]:
if config["debug"]:
    print("DEBUG MODE!!!")
    df = df.sample(10_000)



train, valid = model_selection.train_test_split(
    df,
    test_size=0.25,
    random_state=1123,
    shuffle=True,
)

## 8. Train/Validation Split
This cell splits the processed DataFrame into training and validation sets for model training and evaluation.

In [9]:
train_ds = Dataset.from_pandas(train)
valid_ds = Dataset.from_pandas(valid)

## 9. Dataset Creation
This cell converts the train and validation DataFrames into HuggingFace Dataset objects for efficient processing and training.

In [10]:
%%time

train_ds = train_ds.map(preprocess_function)
valid_ds = valid_ds.map(preprocess_function)

Map:   0%|          | 0/65699 [00:00<?, ? examples/s]

Map:   0%|          | 0/21900 [00:00<?, ? examples/s]

CPU times: user 1min 31s, sys: 645 ms, total: 1min 32s
Wall time: 1min 31s


## 10. Preprocessing Datasets
This cell applies the preprocessing function to the train and validation datasets, preparing them for model training.

In [11]:
# train_ds[0]

In [12]:
model = transformers.AutoModelForQuestionAnswering.from_pretrained(
    config["model_path"]
)

2025-09-07 07:02:41.387301: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757228561.708061      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757228561.807364      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 11. Model Initialization
This cell loads the DistilBERT model for question answering using HuggingFace Transformers.

In [13]:
model

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
     

In [14]:
with torch.no_grad():
    out = model(**tokenizer("hello", "world", return_tensors="pt"))
    
out

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-0.1381, -0.0645,  0.3512,  0.2330,  0.3512]]), end_logits=tensor([[-0.2287,  0.0357, -0.0914,  0.0028, -0.0913]]), hidden_states=None, attentions=None)

In [15]:
import numpy as np

def compute_metrics(eval_data):
    logits, labels = eval_data
    pred_start_pos = np.argmax(logits[0], axis=-1)
    pred_end_pos = np.argmax(logits[1], axis=-1)

    scores = []
    for pred_start, pred_end, (label_start, label_end) in zip(pred_start_pos, pred_end_pos, zip(*labels)):
        pred_tokens = set(range(pred_start, pred_end + 1))
        label_tokens = set(range(label_start, label_end + 1))
        intersection = len(pred_tokens & label_tokens)
        union = len(pred_tokens | label_tokens)
        iou = intersection / union if union > 0 else 0.0
        scores.append(iou)
    return {
        "iou": np.mean(scores)
    }


training_args = transformers.TrainingArguments(
    output_dir = config["output_dir"],

    eval_strategy="epoch",
    per_device_train_batch_size=config["train_batch_size"],
    per_device_eval_batch_size=config["valid_batch_size"],
    learning_rate=config["learning_rate"],
    num_train_epochs=config["epochs"],
    dataloader_num_workers=4,

    save_strategy="epoch",
    save_total_limit=2,
    report_to="none",

    load_best_model_at_end=True,
    fp16=True,
)

## 12. Custom Metric and Training Arguments
This cell defines the token-level IoU metric for evaluation and sets up the training arguments for the Trainer.

In [16]:
trainer = transformers.Trainer(
    model,
    training_args,
    
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = transformers.Trainer(


## 13. Trainer Initialization
This cell initializes the HuggingFace Trainer with the model, training arguments, datasets, tokenizer, and custom metric.

In [17]:
trainer.train()
trainer.save_state()
trainer.save_model()



Epoch,Training Loss,Validation Loss,Iou
1,1.2222,1.114534,0.692941
2,0.9047,1.082789,0.71263




## 14. Model Training and Saving
This cell trains the model, saves the training state, and exports the trained model for inference.

In [18]:
# Inference

from transformers import pipeline

qa_pipeline = pipeline(
    task="question-answering",
    model=config["output_dir"],
    device="cuda"
)

Device set to use cuda


## 15. Inference Pipeline
This cell sets up the HuggingFace pipeline for question answering using the trained model, enabling predictions on new data.

In [19]:
preds = []

for idx, row in valid.reset_index(drop=True).iterrows():
    context = row["context"]
    question = row["question"]
    
    pred = qa_pipeline(
        question=question,
        context=context
    )
    
    
    preds.append(
        pred
    )
    
    if idx == 10:
        break

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


## 16. Run Predictions on Validation Set
This cell uses the inference pipeline to generate predictions for the first 11 samples in the validation set.

In [20]:
pred_df = pd.DataFrame(preds)
pred_df["gold_answer"] = valid["answer"].tolist()[: 11]

pred_df

Unnamed: 0,score,start,end,answer,gold_answer
0,0.996172,267,271,1861,1861
1,0.022319,514,597,Some ethnic groups are concerned about the pot...,federal assistance
2,0.178762,307,320,Anglo-Burmese,Anglo-Burmese
3,0.152809,236,303,"such that the ""off"" output is limited to leaka...","the ""off"" output is limited to leakage current..."
4,0.421363,207,257,Singer songwriter Cathal Coughlan and Sean O'H...,Cathal Coughlan and Sean O'Hagan
5,0.551454,698,718,Federal Reserve Note,Federal Reserve Note
6,0.725555,7,31,U.S. News & World Report,U.S. News & World Report
7,0.296175,118,122,bats,islands
8,0.355743,0,8,Pius XII,Second World Congress of Lay Apostolate
9,0.917606,607,616,one third,one third
