DS Lab Course Week 5

Session 2:

HF Transformers - Hands-on Training GPT-2
This session focuses on the "how-to" of fine-tuning. The goal is to demystify the process and show students they can get a language model to generate text in a specific style with just a few key components.


What is GPT-2? It's a "decoder-only" transformer trained to predict the next
word in a sentence.

Setup on Google Colab (5 mins)

Guide students to create a new Colab notebook and enable the GPU runtime (Runtime -> Change runtime type -> T4 GPU).

Install the necessary libraries with one command:

In [22]:
!pip install --upgrade transformers



In [23]:
!pip install datasets accelerate evaluate



Load a Dataset: Use the datasets library to load a simple text dataset. eli5 is a good choice because it's a dataset of questions and answers, making the generated text interesting.

In [24]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
    AutoModelForCausalLM
)


What this does:

datasets is the Hugging Face library to load datasets like Wikitext easily.

transformers gives us:

Tokenizer — turns text into token IDs the model can understand.

Model — GPT-2 in our case.

DataCollator — handles batch formatting and padding.

TrainingArguments and Trainer — simplify training loops.

Why important: Without these, you’d have to write your own data loader, optimizer, evaluation loop — which is a lot of boilerplate.

1. Load and prepare the dataset

In [25]:
# Load 5000 examples from the training split
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:5000]")

# Remove empty lines (common in Wikitext)
dataset = dataset.filter(lambda ex: len(ex["text"]) > 0)

# Create train/validation split (90% train, 10% validation)
split = dataset.train_test_split(test_size=0.1, seed=42)
train_raw = split["train"]
val_raw = split["test"]


Explanation:

wikitext-2-raw-v1 is a Wikipedia-based dataset for language modeling.

Filtering removes empty strings — these waste computation.

Splitting creates a validation set to measure performance during training.

If skipped:

Without a validation set, you can’t monitor overfitting.

Without filtering, you train on garbage samples.

2. Load tokenizer and model

Tokenization: Explain that models work with numbers, not text. A tokenizer converts text into a format the model understands (input IDs).

In [26]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPT-2 has no pad token; we add one for batching
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))  # match new vocab size

Embedding(50258, 768)

Explanation:

Tokenizer maps words → integers using GPT-2’s vocab.

GPT-2 doesn’t have a padding token by default, but batching needs it, so we add one.

Model is GPT-2 with a language modeling head.

resize_token_embeddings ensures the model knows about our new pad token.

If skipped:

Without padding, batches of different lengths will crash.

Without resizing embeddings, you’ll get a size mismatch error.

3. Tokenize the text

In [27]:
max_length = 512
def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True, max_length=max_length)

train_tok = train_raw.map(tokenize_fn, batched=True, remove_columns=["text"])
val_tok = val_raw.map(tokenize_fn, batched=True, remove_columns=["text"])

# Make datasets return PyTorch tensors
train_tok.set_format(type="torch", columns=["input_ids", "attention_mask"])
val_tok.set_format(type="torch", columns=["input_ids", "attention_mask"])

Explanation:

truncation=True: cuts long texts at max_length tokens (GPT-2 limit is 1024, but we pick 512 for speed).

map applies our tokenizer to the whole dataset.

remove_columns drops the raw text after tokenization to save memory.

set_format ensures Trainer gets PyTorch tensors directly.

If skipped:

The model can’t understand raw strings.

Without truncation, you’ll get “sequence too long” errors.

4. Create the data collator

In [28]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


Explanation:

The collator batches tokenized data and pads sequences in the batch to the same length.

mlm=False means causal LM (predict next token), not masked LM like BERT.

If wrong:

Setting mlm=True would train GPT-2 in BERT-style — totally different objective.

Fine-Tuning with the Trainer API: This is the core of the session. The Trainer abstracts away the complex training loop.

5. Define training arguments

Training Arguments: Define the training parameters. Keep them simple for the session.

Common and Useful Training Arguments

Here are some of the most important arguments you might want to add, grouped by function:

For Model Performance

learning_rate: The speed at which the model updates its weights. A smaller value like $5e-5$ (which is 0.00005) is a common starting point.

weight_decay: A regularization technique to prevent the model from becoming too complex and overfitting to the training data. A common value is 0.01.

warmup_steps: The number of initial training steps where the learning rate gradually increases from 0 to its full value. This helps stabilize training. 500 is a reasonable number.

For Logging, Saving & Evaluation

evaluation_strategy: When to perform evaluation. Set to "steps" or "epoch".

eval_steps: If using evaluation_strategy="steps", this sets how often to run evaluation (e.g., every 500 steps).

save_strategy: Same as evaluation, but for saving model checkpoints. Set to "steps" or "epoch".

save_total_limit: Limits the total number of checkpoints saved to avoid filling up your disk.

load_best_model_at_end: A very useful argument. If set to True, the Trainer will load the best-performing model (based on the evaluation metric) at the end of training.

For Speed and Efficiency

fp16: Set to True to enable mixed-precision training. This can significantly speed up training on modern GPUs (like those in Colab) and reduce memory usage.

Finding All Possible Arguments

The TrainingArguments class has many more options. To see a complete list with detailed explanations, you can always check the official Hugging Face documentation. It's the best resource for exploring everything you can control.

https://huggingface.co/docs/transformers/en/main_classes/trainer

In [29]:
training_args = TrainingArguments(
    output_dir="./gpt2-wikitext-finetuned",  # save model checkpoints
    num_train_epochs=10,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    weight_decay=0.01,       # helps prevent overfitting
    warmup_steps=500,        # gradual LR increase
    eval_strategy="steps",
    eval_steps=500,          # evaluate every 500 steps
    save_strategy="steps",
    save_steps=500,          # save every 500 steps
    load_best_model_at_end=True,
    save_total_limit=3,      # only keep last 3 checkpoints
    fp16=True,               # mixed precision for speed
    report_to="none"  # disable W&B, TensorBoard, etc.
)


Teaching moment:

Warmup: starts with small learning rate → more stable.

Weight decay: L2 regularization to keep weights small.

Mixed precision (fp16): speeds up training, uses less GPU memory.

6. Create the Trainer

In [30]:
import transformers
print(transformers.__version__)
from transformers import TrainingArguments
print(TrainingArguments.__module__)

4.55.0
transformers.training_args


Instantiate Trainer: Combine everything into the Trainer.

In [31]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,        # Needed for evaluation_strategy
    data_collator=data_collator,
    tokenizer=tokenizer,         # still works, but warning suggests 'processing_class' in future
)

  trainer = Trainer(


Explanation:

Trainer automates the training loop, evaluation, saving, logging.

eval_dataset is essential because we set evaluation_strategy="steps".

If eval_dataset missing:

error

7. Train the model

Launch Training: Start the fine-tuning process. Explain that the model's weights (W) are being updated via backpropagation to minimize a loss function.

In [32]:
import time

# Start timer
start_time = time.perf_counter()

# Train
trainer.train()

# End timer
end_time = time.perf_counter()

# Calculate and format
elapsed_time = end_time - start_time
minutes, seconds = divmod(elapsed_time, 60)
print(f"Total training time: {int(minutes)} min {seconds:.2f} sec")

Step,Training Loss,Validation Loss
500,3.6647,3.271539
1000,3.2269,3.176355
1500,3.0215,3.144944
2000,2.7327,3.12207
2500,2.5613,3.162137
3000,2.4589,3.203
3500,2.2954,3.195988
4000,2.152,3.250006
4500,2.0998,3.317948
5000,2.0019,3.312302


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Total training time: 29 min 36.13 sec


Generate Text! (The Fun Part): Use the fine-tuned model to generate text. Show how it has adopted the "style" of the training data.

In [38]:
prompt = "Himalaya mountains are "

# Tokenize with attention mask, send to GPU
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

# Set pad token explicitly to avoid confusion
model.config.pad_token_id = tokenizer.pad_token_id

# Generate
outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],  # key fix
    max_length=100,
    num_return_sequences=1
)

# Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Himalaya mountains are 

the most popular of the Himalayan mountains . The mountains are the most popular of the Himalayan mountains , and are the most popular of the Himalayan mountains . The Himalayan mountains are the most popular of the Himalayan mountains , and are the most popular of the Himalayan mountains . The Himalayan mountains are the most popular of the Himalayan mountains , and are the most popular of the Himalayan mountains . The Himalayan mountains are the most popular of the


In [39]:
prompt = "photosynthesis is a function"

# Tell tokenizer to use EOS as pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

# Tokenize with attention mask, send to GPU
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

# Generate
outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_length=100,
    num_return_sequences=1,
    temperature=0.7,         # for more variety
    top_k=50,                # sample from top 50 tokens
    repetition_penalty=1.2   # reduce loops
)

# Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


photosynthesis is a function of phosphorus and nitrogen that occurs in the same chemical form as water . The two elements are formed by oxidation , which produces oxygen ( carbon ) or hydrogen ( hydrogen ) ; while the other element is converted to air when it reacts with oxygen atoms on its surface . 
 = Synthesis of organic compounds : A. hygienic acid + SbF2 → HCl −H3O 3 -hydroxyl group synthesis requires an electron transfer reaction involving both ox


In [40]:
# Make sure pad token is set (do this once)
tokenizer.pad_token = tokenizer.eos_token          # safe: reuse eos as pad
model.config.pad_token_id = tokenizer.pad_token_id

prompt = "Mahatma Gandhi is "
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_new_tokens=120,         # generate up to 120 new tokens
    do_sample=True,             # enable sampling so temperature/top_k/top_p take effect
    temperature=0.7,            # softness of sampling (0.7 is often good)
    top_k=50,                   # sample from top 50 tokens
    top_p=0.9,                  # or use nucleus sampling
    repetition_penalty=1.15,    # discourage immediate repetition
    no_repeat_ngram_size=3,     # avoid repeating the same 3-gram
    pad_token_id=tokenizer.pad_token_id,  # explicit
    eos_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Mahatma Gandhi is 
a popular deity in India , who was worshipped at various temples . In modern times , the god of Shiva and other gods may be divided into three classes : deities @-@ based , divinities @-# based and a monotheistic religion ( such as Hinduism ) , which emphasizes tolerance and inclusion of all human beings ; and non @- @ -based religions ( like Indian religious belief ) , often called " polytheism " , that focus on freedom from constraints rather than fear and guilt . According to Mahatmasand 's writings , he believed that people should not judge


In [41]:
# Make sure pad token is set (do this once)
tokenizer.pad_token = tokenizer.eos_token          # safe: reuse eos as pad
model.config.pad_token_id = tokenizer.pad_token_id

prompt = "The Amazon River is"
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_new_tokens=120,         # generate up to 120 new tokens
    do_sample=False,             # enable sampling so temperature/top_k/top_p take effect
    temperature=0.7,            # softness of sampling (0.7 is often good)
    top_k=50,                   # sample from top 50 tokens
    top_p=0.9,                  # or use nucleus sampling
    repetition_penalty=1.15,    # discourage immediate repetition
    no_repeat_ngram_size=3,     # avoid repeating the same 3-gram
    pad_token_id=tokenizer.pad_token_id,  # explicit
    eos_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The Amazon River is a major river in the state of New Mexico . It flows through several states and includes many communities , including Galveston , Texas , and San Luis Obispo County . The city of Fort Collins is home to two museums , one located on the west end ( north of downtown ) and another on its east side ( south of downtown ). 
Sarasota National Park has been designated as an international park for conservation purposes by UNESCO ; it was established after World Heritage Site 's decision to close down the Oro @-@ La Salle area due largelyto environmental concerns over the water quality


In [42]:
# Make sure pad token is set (do this once)
tokenizer.pad_token = tokenizer.eos_token          # safe: reuse eos as pad
model.config.pad_token_id = tokenizer.pad_token_id

prompt = "In physics, quantum mechanics is"
encoding = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(
    input_ids=encoding["input_ids"],
    attention_mask=encoding["attention_mask"],
    max_new_tokens=120,         # generate up to 120 new tokens
    do_sample=True,             # enable sampling so temperature/top_k/top_p take effect
    temperature=0.1,            # softness of sampling (0.7 is often good)
    top_k=50,                   # sample from top 50 tokens
    top_p=0.9,                  # or use nucleus sampling
    repetition_penalty=1.5,    # discourage immediate repetition
    no_repeat_ngram_size=3,     # avoid repeating the same 3-gram
    pad_token_id=tokenizer.pad_token_id,  # explicit
    eos_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In physics, quantum mechanics is a branch of geometry , which describes the interaction between two particles . The fundamental properties of this system include : 
 = - π + → – 1 @-@ n ( Sb ) ; and  2 − 0 sigma /Sd2 as an index for its stability in space or time on Earth ! This property has been described by Wheeler 's Theory of Relativity with respect to all other theories that deal specifically with gravity at room temperature conditions such as those involving magnetism from inside out. In his book Beyond All Other Knowledge he states that " there are no more


Right now, your prompt output is “rambling” because the model:

Is only fine-tuned briefly (1 epoch on 5,000 Wikitext examples — too small for deep learning models).

Doesn’t have domain-specific grounding (it’s just a generic GPT-2 fine-tune).

Uses a generation strategy that still leaves room for randomness (top-k, temperature).

To increase accuracy (more factually correct & coherent answers to prompts like "How does photosynthesis work?"), you can work at three levels:

1  Training Stage – Make the model smarter
More data: Train on larger, cleaner datasets about biology/science instead of generic Wikitext. E.g., SQuAD, Wikipedia Science subset, Khan Academy transcripts.

More epochs: 3–5 passes over the data, with early stopping if eval loss stops improving.

Smaller learning rate: E.g., 2e-5 instead of 5e-5 to avoid overwriting pre-trained weights too aggressively.

Use evaluation dataset: So you can monitor overfitting and pick the best checkpoint.

Domain-specific fine-tuning: If your goal is biology Q&A, curate a biology text corpus.

2   Generation Stage – Guide the answer
Lower randomness:
temperature=0.3,  # Less creative, more precise
top_k=20,
top_p=0.9
Increase repetition_penalty to avoid loops:
repetition_penalty=1.5
Set max_new_tokens instead of max_length to avoid prompt truncation.

3   Prompt Engineering – Ask better
Instead of:
How does photosynthesis work?
Try:
Explain photosynthesis step-by-step as a science teacher for 10th grade students.
Or:
Explain photosynthesis in 5 clear bullet points.
The more context and instruction you give, the more structured and accurate the output.

Load dataset → teaches reproducibility and data cleaning.

Tokenizer → explains how models process text as numbers.

Train/val split → introduces concept of evaluation and avoiding overfitting.

Collator → shows how batch padding works.

Training arguments → gives intuition for hyperparameters and resource management.

Trainer → shows benefits of using high-level APIs.

Training loop → connects all components together.

Assignemnt

Objective

In this assignment, you will fine-tune a pre-trained Transformer model (from Hugging Face) for a question-answering task using a publicly available dataset.
You will follow the same steps we practiced for text generation but adapt them for extractive QA.

Tasks
1. Dataset Selection
Choose squad – Stanford Question Answering Dataset question-answering dataset from Hugging Face Datasets.

2. Environment Setup
Install required libraries:

3. Load & Explore the Dataset

Load the dataset using datasets.load_dataset().

Display sample questions, contexts, and answers.

Explain the dataset’s structure.

4. Preprocessing

Use a tokenizer suitable for QA tasks, e.g., BertTokenizerFast or DistilBertTokenizerFast.

Tokenize question-context pairs with:

truncation="only_second"

max_length=384

stride=128

return_overflowing_tokens=True

return_offsets_mapping=True

Create train and validation datasets.

5. Model Selection

Choose "distilbert-base-uncased" pre-trained model for QA:

Load the model with AutoModelForQuestionAnswering.

6. Define Training Arguments

Use TrainingArguments with:

eval_strategy="epoch"

learning_rate=3e-5

per_device_train_batch_size=8

per_device_eval_batch_size=8

num_train_epochs=2

weight_decay=0.01

7. Define Trainer

Pass:

model

args

train_dataset

eval_dataset

tokenizer

compute_metrics (use evaluate library with F1 and EM scores).

8. Train the Model

Measure training time with time.time().

Save the model and tokenizer.

9. Evaluate & Test

Pick a custom question and context.

Tokenize and pass through the model.

Convert start and end logits to the answer span.

Print predicted answer.

10. Submission Requirements

Jupyter Notebook (.ipynb) with:

All steps clearly marked.

Plots or printed logs for training.

At least 2 custom test examples with outputs.

A short report (1 page) describing:

Dataset used

Model used

Accuracy/F1 score achieved

Any challenges faced

Example Deliverable

from datasets import load_dataset

dataset = load_dataset("squad")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

context = dataset["train"][0]["context"]

question = dataset["train"][0]["question"]

inputs = tokenizer(question, context, return_tensors="pt")

(Students will build from here)

In [None]:
# %% [markdown]
# Notebook 1 — Full Solution: Fine-tune DistilBERT on SQuAD (extractive QA)
# This notebook is a runnable, end-to-end example that: loads SQuAD, preprocesses,
# fine-tunes a DistilBERT model for question answering, evaluates it, and shows
# inference on custom QA examples.

# %% [markdown]
# Requirements
# ```bash
# pip install transformers datasets evaluate accelerate seqeval
# ```

# %%
import os
import time
from datasets import load_dataset
import evaluate
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer
)
import numpy as np
import torch

# %% [markdown]
# 1. Config
MODEL_NAME = "distilbert-base-uncased"
MAX_LENGTH = 384
DOC_STRIDE = 128
BATCH_SIZE = 8
NUM_EPOCHS = 2
OUTPUT_DIR = "./qa_distilbert_squad"
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', DEVICE)

# %% [markdown]
# 2. Load dataset
raw_datasets = load_dataset("squad")
print(raw_datasets)

# %% [markdown]
# 3. Tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME).to(DEVICE)

# %% [markdown]
# 4. Preprocessing helpers

# We will follow the standard Hugging Face approach for QA tokenization:
# - pair: question (first) + context (second)
# - truncation='only_second' to truncate long contexts
# - return_overflowing_tokens to handle long contexts producing multiple features
# - return_offsets_mapping to map token positions back to character positions


def prepare_train_features(examples):
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=MAX_LENGTH,
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features, we need a map from feature -> example
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized_examples.sequence_ids(i)

        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        if len(answers["answer_start"]) == 0:
            # For unanswerable questions (SQuAD v2) — here using SQuAD v1 so not hit
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Find token start and end that contain the answer
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # If the answer is outside the span, label with CLS
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                # Otherwise find the exact token indices
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)

                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)

    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions
    return tokenized_examples

# %% [markdown]
# 5. Tokenize dataset
train_dataset = raw_datasets["train"].map(
    prepare_train_features,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

validation_dataset = raw_datasets["validation"].map(
    prepare_train_features,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)

print(train_dataset.features)

# %% [markdown]
# 6. Training arguments and Trainer
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",        # ✅ correct name
    learning_rate=3e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=0.01,
    save_total_limit=2,
    logging_strategy="steps",           # ✅ enable logging per step
    logging_steps=50,                   # ✅ update progress more frequently
    report_to="none",                   # ✅ disables wandb if not needed
    fp16=torch.cuda.is_available(),
)

# Use the default data collator (it will pad to max length automatically)
from transformers import default_data_collator

def compute_metrics(p):
    # p.predictions are start_logits, end_logits
    start_logits, end_logits = p.predictions
    start_preds = np.argmax(start_logits, axis=1)
    end_preds = np.argmax(end_logits, axis=1)
    # Note: computing exact EM/F1 requires remapping to original texts. For simplicity,
    # this demo computes a token-level exact-match fraction (not the official SQuAD metric).
    em = (start_preds == p.label_ids[0]).mean()
    return {"simple_em": float(em)}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics,
)

# %% [markdown]
# 7. Train (timed)
start = time.perf_counter()
trainer.train()
end = time.perf_counter()
print(f"Training time: {end - start:.2f} seconds")

# %% [markdown]
# 8. Basic evaluation (using the Trainer evaluate)
eval_res = trainer.evaluate()
print(eval_res)

# %% [markdown]
# 9. Inference: convert model logits back to text answers
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)

examples = [
    {
        "context": raw_datasets["train"][10]["context"],
        "question": raw_datasets["train"][10]["question"]
    },
    {
        "context": "Photosynthesis is the process by which green plants...",
        "question": "What is photosynthesis?"
    }
]

for ex in examples:
    res = qa_pipeline(question=ex["question"], context=ex["context"])
    print("Q:", ex["question"])
    print("A:", res)

# %% [markdown]
# 10. Save model & tokenizer
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# End of full solution notebook



Device: cuda


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

{'input_ids': List(Value('int32')), 'attention_mask': List(Value('int8')), 'start_positions': Value('int64'), 'end_positions': Value('int64')}


  trainer = Trainer(


Epoch,Training Loss,Validation Loss


In [None]:
# Student's template
# %% [markdown]
# ---------------------------------------------------------------------
# Notebook 2 — Student Template (skeleton)
# Students should fill in the TODOs. Keep this file as their working notebook.

# %% [markdown]
# Requirements:
# pip install transformers datasets evaluate accelerate

# %%
# TODO: import necessary libraries
import time
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
import torch

# %% [markdown]
# 1. Configuration (students: choose model and hyperparams)
MODEL_NAME = "distilbert-base-uncased"  # TODO: try bert-base-uncased for better results
MAX_LENGTH = 384
DOC_STRIDE = 128
BATCH_SIZE = 8
NUM_EPOCHS = 2
OUTPUT_DIR = "./student_qa_model"
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', DEVICE)

# %% [markdown]
# 2. Load dataset (TODO: choose dataset from Hugging Face)
# Suggested: 'squad' or 'squad_v2'
raw_datasets = load_dataset("squad")
print(raw_datasets)

# %% [markdown]
# 3. Tokenizer & Model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME).to(DEVICE)

# %% [markdown]
# 4. Preprocessing helper - TODO: implement or copy from instructor solution

def prepare_train_features(examples):
    # TODO: implement tokenization with return_overflowing_tokens and offsets
    raise NotImplementedError

# %% [markdown]
# 5. Tokenize dataset (TODO)
# train_dataset = raw_datasets['train'].map(...)
# validation_dataset = raw_datasets['validation'].map(...)

# %% [markdown]
# 6. TrainingArguments & Trainer (TODO)
# training_args = TrainingArguments(...)
# trainer = Trainer(...)

# %% [markdown]
# 7. Train and time it (TODO)
# start = time.perf_counter(); trainer.train(); end = time.perf_counter()

# %% [markdown]
# 8. Evaluate & test on custom examples (TODO)

# %% [markdown]
# Submission instructions:
# - Upload your completed notebook (.ipynb)
# - Include brief report (1 page) with: dataset choice, model, hyperparameters, metrics, challenges
# - Provide 2 custom QA examples and the model's answers

# Good luck!