# **Brief Overview:**  
This notebook demonstrates the workflow for fine-tuning a BERT-based model for question-answering tasks. It covers the following steps:  
1. Loading and exploring the dataset.  
2. Initializing the BERT model and tokenizer.  
3. Preprocessing the dataset to prepare inputs for the model.  
4. Configuring lightweight training parameters for efficient fine-tuning.  
5. Training the model on the dataset and saving the fine-tuned version.  
6. Loading the trained model and predicting answers for given questions and contexts.  
7. Providing a practical example of using the fine-tuned model for real-world question-answering.

In [1]:
pip install -q transformers datasets

Note: you may need to restart the kernel to use updated packages.


In [2]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("HF_TOKEN")

In [3]:
from huggingface_hub import login
login(secret_value_0)

# **Import Necessary Libraries**

In [19]:
from transformers import (
    DistilBertTokenizerFast,  # This is a tokenizer specifically designed for DistilBERT. 
                              # It converts input text into token IDs (numerical representations) 
                              # that the model can understand. The "Fast" version is optimized for speed 
                              # and efficiency using the Hugging Face `tokenizers` library.

    DistilBertForQuestionAnswering,  # This is a pre-trained DistilBERT model architecture fine-tuned 
                                      # specifically for Question Answering tasks. It is capable of 
                                      # taking a context paragraph and a question as input and predicting 
                                      # the start and end positions of the answer within the context.

    TrainingArguments,  # This is a helper class provided by Hugging Face Transformers. 
                        # It allows you to define hyperparameters and settings for model training, 
                        # such as batch size, learning rate, number of epochs, and more.

    Trainer,  # This is a high-level API for training and evaluating models. 
              # It abstracts away many lower-level details of the training loop, 
              # such as managing batches, gradients, and evaluation metrics.

    DataCollatorWithPadding  # This is a utility class used during training to dynamically pad input sequences 
                             # in a batch to the same length. Padding ensures that sequences of different lengths 
                             # can be processed together in batches, without wasting computation on excessive padding.
)
from datasets import load_dataset  # This function is part of the Hugging Face Datasets library.
                                   # It allows you to load pre-built or custom datasets from Hugging Face's 
                                   # dataset hub or your local files. It supports efficient dataset loading, 
                                   # preprocessing, and manipulation for training and evaluation.

import torch

# **Load [Dataset](https://huggingface.co/datasets/rajpurkar/squad)**

In [5]:
# Load a tiny dataset subset
dataset = load_dataset("squad", split="train[:200]") # Only 200 Examples
eval_dataset = load_dataset("squad", split="validation[:40]") # 40 validation examples

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

# **Load [Model](https://huggingface.co/distilbert/distilbert-base-uncased) and Tokenizer**

In [6]:
model_name = "distilbert/distilbert-base-uncased" #Smaller version of the BERT base model

tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForQuestionAnswering.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# **Preprocessing**

In [7]:
def preprocess_function(examples):
    # Strip leading/trailing whitespace from questions and contexts
    questions = [q.strip() for q in examples["question"]]  # Clean each question in the dataset
    contexts = [c.strip() for c in examples["context"]]  # Clean each context in the dataset

    # Tokenize the questions and contexts
    inputs = tokenizer(
        questions,  # List of questions to tokenize
        contexts,   # Corresponding list of contexts
        max_length=256,  # Maximum token length for input sequences
        truncation="only_second",  # Truncate the context (second input) if it exceeds max_length
        stride=128,  # Overlap between truncated segments for better context handling
        return_offsets_mapping=True,  # Include the mapping of tokens to original character positions
        padding="max_length"  # Pad sequences to max_length for uniformity
    )

    # Lists to store the start and end positions of the answers in tokenized inputs
    start_positions = []
    end_positions = []

    # Iterate over each example in the dataset
    for i, offset in enumerate(inputs["offset_mapping"]):  # 'offset_mapping' links tokens to their original text spans
        answer = examples["answers"][i]  # Get the answer for the current example
        start_char = answer["answer_start"][0]  # Character index of the answer's start
        end_char = start_char + len(answer["text"][0])  # Character index of the answer's end

        # Initialize start and end token indices
        start_token = 0
        end_token = 0

        # Iterate over the token offsets to find the start and end tokens
        for idx, (start, end) in enumerate(offset):
            if start <= start_char and end >= start_char:  # Check if the token contains the start_char
                start_token = idx
            if start <= end_char and end >= end_char:  # Check if the token contains the end_char
                end_token = idx
                break  # End token found; exit loop

        # Append the identified token indices to the respective lists
        start_positions.append(start_token)
        end_positions.append(end_token)

    # Add the calculated start and end positions to the tokenized inputs
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    # Return the preprocessed inputs
    return inputs

In [8]:
# Process Datasets

tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=32,
    remove_columns=dataset.column_names
)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [9]:
tokenized_dataset = eval_dataset.map(
    preprocess_function,
    batched=True,
    batch_size=32,
    remove_columns=dataset.column_names
)

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

In [10]:
pip -q install transformers[torch]

Note: you may need to restart the kernel to use updated packages.


# **Configuring Training Parameters for Fast and Lightweight Model Training**

In [13]:
training_args = TrainingArguments(
    output_dir="./qa-results",  # Directory where model checkpoints and outputs will be saved
    num_train_epochs=1,  # Number of times the entire training dataset will be passed through the model
    per_device_train_batch_size=4,  # Batch size for training per device (e.g., per GPU or CPU)
    per_device_eval_batch_size=4,  # Batch size for evaluation per device
    learning_rate=5e-5,  # Learning rate used by the optimizer for adjusting model weights
    weight_decay=0.01,  # Regularization technique to prevent overfitting by penalizing large weights
    logging_steps=10,  # Log training progress every 10 steps
    eval_strategy="no",  # No evaluation will be performed during training
    save_strategy="no",  # No model checkpoints will be saved during training
    use_cpu=True,  # Forces the training to use the CPU instead of a GPU (even if one is available)
    report_to="none"  # Disables logging to external platforms like WandB or TensorBoard
)

# **Initalize and Train**

In [15]:
trainer = Trainer(
    model= model,
    args = training_args,
    train_dataset=tokenized_dataset,
    data_collator= DataCollatorWithPadding(tokenizer)
)

# **Train and Save**

In [16]:
trainer.train()
model.save_pretrained("./qa-model")
tokenizer.save_pretrained("./qa-model")

Step,Training Loss
10,4.9609


('./qa-model/tokenizer_config.json',
 './qa-model/special_tokens_map.json',
 './qa-model/vocab.txt',
 './qa-model/added_tokens.json',
 './qa-model/tokenizer.json')

# **Loading a Pretrained DistilBERT Question-Answering Model and Tokenizer**

In [17]:
def load_qa_model(model_path="/kaggle/working/qa-model"):
    #Load model and tokenizer from the saved directory
    tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
    model = DistilBertForQuestionAnswering.from_pretrained(model_path)
    return model, tokenizer

# **Function to Predict Answers from a Given Question and Context Using DistilBERT**

In [18]:
def answer_question(question, context, model, tokenizer):
    # Tokenize input question and context
    inputs = tokenizer(
        question,  # The question to be answered
        context,  # The context or passage containing the answer
        return_tensors="pt",  # Return PyTorch tensors for model compatibility
        max_length=256,  # Limit the tokenized input length to 256 tokens
        truncation="only_second",  # Truncate the context (second input) if too long
        padding=True  # Add padding to make inputs uniform in length
    )
    
    # Get model predictions without computing gradients
    with torch.no_grad():  # Disable gradient computation for faster inference
        outputs = model(**inputs)  # Pass tokenized inputs through the model
    
    # Find the most likely start and end positions for the answer
    answer_start = torch.argmax(outputs.start_logits)  # Index of the highest start score
    answer_end = torch.argmax(outputs.end_logits)  # Index of the highest end score
    
    # Convert token positions to string (reconstruct the answer text)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])  # Map token IDs back to tokens
    answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end + 1])  # Convert token sequence to a string
    
    return answer  # Return the predicted answer text


# **Example Usage of Question-Answering Model to Extract Answers from Context**

In [20]:
model, tokenizer = load_qa_model()

# Example context and question
context = """
Python is a high-level programming language created by Guido van Rossum and released in 1991. 
Python's design emphasizes code readability with its notable use of significant whitespace. 
Its language constructs and object-oriented approach aim to help programmers write clear, logical code.
"""

question = "Who created Python?"

# Get answer
answer = answer_question(question, context, model, tokenizer)
print(f"\nQuestion: {question}")
print(f"Answer: {answer}")


Question: Who created Python?
Answer: ? [SEP] python is a high - level programming language created by guido van rossum and released in 1991
