In [1]:
!pip install --upgrade transformers
!pip install -U datasets fsspec

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [2]:
from datasets import load_dataset


In [3]:
dataset = load_dataset('squad')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
dataset['train']['context'][1120]

"As white settlers began populating Montana from the 1850s through the 1870s, disputes with Native Americans ensued, primarily over land ownership and control. In 1855, Washington Territorial Governor Isaac Stevens negotiated the Hellgate treaty between the United States Government and the Salish, Pend d'Oreille, and the Kootenai people of western Montana, which established boundaries for the tribal nations. The treaty was ratified in 1859. While the treaty established what later became the Flathead Indian Reservation, trouble with interpreters and confusion over the terms of the treaty led whites to believe that the Bitterroot Valley was opened to settlement, but the tribal nations disputed those provisions. The Salish remained in the Bitterroot Valley until 1891."

In [5]:
dataset['train']['question'][1120]

'What year was the Hellgate treaty formed?'

In [6]:
dataset['train']['answers'][1120]['text']

['1855']

In [7]:
dataset['train']['answers'][1120]['answer_start']

[162]

In [8]:
dataset['train']['answers'][1120]

{'text': ['1855'], 'answer_start': [162]}

In [9]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

#Load model
model_checkpoint = 'distilbert-base-uncased'
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:

# We need to set a maximum length and a stride for handling long contexts.
# When a context is longer than max_length, it will be split into multiple
# chunks, with an overlap defined by doc_stride.
max_length = 384
doc_stride = 128

In [11]:
def preprocess_func(examples):
    # Tokenize the questions and contexts. The question is the first part of the pair.
    # The context is the second part.
    # We allow truncation and set the max_length and stride defined earlier.
    tokenized_examples = tokenizer(
        examples['question'],
        examples['context'],
        truncation='only_second',
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length'
    )

    # The 'return_overflowing_tokens' creates a mapping from each new feature
    # to its original example index. We need this to link answers.
    sample_mapping = tokenized_examples.pop('overflow_to_sample_mapping')
    offset_mapping = tokenized_examples.pop('offset_mapping')

    # Now we label our sample with the start and end and token positions
    tokenized_examples['start_positions'] = []
    tokenized_examples['end_positions'] = []

    for i, offsets in enumerate(offset_mapping):
        # We need to find the start and end token positions for the answer.
        # We'll use the CLS token index as a default if the answer isn't in the span.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Get the original example corresponding to this feature
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # If there are no answers, set the cls_index as the answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Get the start and end character positions of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Find the start and end of the context within the tokenized input.
            token_start_index = 0
            while tokenized_examples.sequence_ids(i)[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while tokenized_examples.sequence_ids(i)[token_end_index] != 1:
                token_end_index -= 1

            # Check if the answer is within the boundaries of our tokenized span.
            # If not, the label is (CLS, CLS).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise, find the token indices that correspond to the answer's start and end.
                # Move the token_start_index to the first token of the answer.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)

                # Move the token_end_index to the last token of the answer.
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples






In [12]:
tokenized_datasets = dataset.map(
    preprocess_func,
    batched =True,
    remove_columns = dataset['train'].column_names,
)

In [13]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_lin", "v_lin"], # Target query and value layers
    task_type=TaskType.QUESTION_ANS,
)

# Wrap the base model with PEFT model to apply LoRA
model = get_peft_model(model, lora_config)
# Print the number of trainable parameters to see the effect of LoRA
model.print_trainable_parameters()

trainable params: 296,450 || all params: 66,660,868 || trainable%: 0.4447


In [14]:
from transformers import TrainingArguments,Trainer ,default_data_collator
training_args = TrainingArguments(
    output_dir=f"{model_checkpoint}-lora-qa",
    learning_rate= 1e-3,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False, # Set to True if you want to upload to Hub
)

In [15]:
from transformers import default_data_collator

# Instantiate the Trainer
# The Trainer class handles the entire training and evaluation loop.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=default_data_collator, # Use default collator for QA
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForQuestionAnswering`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
print("--> Step 7: Starting the fine-tuning process...")

trainer.train()

print("Fine-tuning complete!")

--> Step 7: Starting the fine-tuning process...


[34m[1mwandb[0m: Currently logged in as: [33masif-cs-ai[0m ([33masif-cs-ai-north-south-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,1.3573,1.272103


In [None]:
# Step 8: Inference with the Fine-Tuned Model
# ===========================================
print("\n--> Step 8: Running inference with the new model...")

# Define a sample context and question
context = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris,France. It is named after the engineer Gustave Eiffel,whose company designed and built the tower."
question = "Who designed the Eiffel Tower?"

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Tokenize the input
inputs = tokenizer(question, context, return_tensors="pt").to(device)

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)

# Get the start and end logits (the model's confidence scores for each token)
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Find the tokens with the highest 'start' and 'end' scores
start_index = torch.argmax(start_logits, dim=1).item()
end_index = torch.argmax(end_logits, dim=1).item()

# Get the input IDs to convert back to text
input_ids = inputs["input_ids"].cpu().numpy()[0]

# Decode the tokens between the start and end indices
predicted_answer_tokens = input_ids[start_index : end_index + 1]
predicted_answer = tokenizer.decode(predicted_answer_tokens)

print(f"\nContext: {context}")
print(f"Question: {question}")
print(f"Predicted Answer: {predicted_answer}")