# Lab 3: Prompt Tuning - Fine-Tuning a T5 Model for Summarization
---
## Notebook 2: The Training Process

**Goal:** In this notebook, you will fine-tune a `t5-small` model on a text summarization task using **Prompt Tuning**.

**You will learn to:**
-   Load a dataset for a sequence-to-sequence task (summarization) and preprocess it.
-   Load a pre-trained T5 model.
-   Deeply understand and configure `peft.PromptTuningConfig`.
-   Apply soft prompts to the T5 model.
-   Fine-tune the model by training *only* the soft prompt embeddings using the `transformers.Trainer`.


### Step 1: Load Dataset and Preprocess

We will use the `billsum` dataset, which contains texts of US congressional bills and their corresponding summaries. This is a classic sequence-to-sequence (seq2seq) task.

#### Key Hugging Face Components:

-   `transformers.AutoTokenizer`: We'll load the tokenizer for `t5-small`. For T5, it's common practice to prepend a task-specific prefix to the input text (e.g., "summarize: ").
-   `dataset.map()`: We'll create a function to tokenize both the input text (`text`) and the target summary (`summary`). The tokenized summary will be our `labels`.


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# --- Load Dataset ---
# We'll take a small slice for a quick demonstration
dataset = load_dataset("billsum", split="train[:500]")
dataset = dataset.train_test_split(test_size=0.1)

# --- Preprocessing Function ---
def preprocess_function(examples):
    # T5 expects a prefix for summarization tasks
    inputs = ["summarize: " + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")
    
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True, padding="max_length")
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# --- Apply Preprocessing ---
tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text", "summary", "title"])
tokenized_datasets.set_format("torch")

print("✅ Dataset loaded and preprocessed.")
print(tokenized_datasets["train"][0].keys())


### Step 2: Load the Base Model

Next, we load the `t5-small` model. Since this is a seq2seq task, we can use `AutoModelForSeq2SeqLM`.


In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

print("✅ Base T5 model loaded.")


### Step 3: Configure Prompt Tuning

Now we configure Prompt Tuning. The idea is to create a small, trainable embedding that acts as a "soft prompt" to guide the frozen base model.

#### Key Hugging Face `peft` Components:

-   `peft.PromptTuningConfig`: The configuration class for this method.
    -   `task_type="SEQ_2_SEQ_LM"`: We must specify the task type. For T5, it's sequence-to-sequence language modeling.
    -   `prompt_tuning_init="TEXT"`: How to initialize the soft prompt embeddings. `"TEXT"` means we'll initialize them using the vocabulary embeddings of a specific text string. This can provide a better starting point than random initialization.
    -   `num_virtual_tokens`: The length of the soft prompt. This is the main hyperparameter to tune. It's the number of trainable embedding vectors we will create.
    -   `prompt_tuning_init_text`: The text to use for initialization if `prompt_tuning_init="TEXT"`. The vocabulary embeddings of these tokens will be averaged to create the initial soft prompt.
    -   `tokenizer_name_or_path`: We must provide the path to the tokenizer to be used for the text initialization.
-   `peft.get_peft_model`: As before, this function applies the configuration to our base model.


In [None]:
from peft import get_peft_model, PromptTuningConfig, TaskType

# --- Prompt Tuning Configuration ---
prompt_config = PromptTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    prompt_tuning_init="TEXT",
    num_virtual_tokens=8,
    prompt_tuning_init_text="Summarize the following congressional bill:",
    tokenizer_name_or_path=model_checkpoint,
)

# --- Create PeftModel ---
peft_model = get_peft_model(model, prompt_config)

# --- Print Trainable Parameters ---
peft_model.print_trainable_parameters()


### Step 4: Set Up Training

The final step is to configure and run the training process using the `transformers.Trainer`. This is very similar to the previous labs. We will reuse the `compute_metrics` function, but this time we will use the `rouge` metric, which is standard for summarization tasks.


In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq

# --- Training Arguments ---
training_args = TrainingArguments(
    output_dir="./t5-prompt-tuning-billsum",
    auto_find_batch_size=True, # Automatically find a batch size that fits
    learning_rate=1e-3, # Higher learning rate is common for PEFT methods
    num_train_epochs=5,
    logging_steps=50,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# --- Create Trainer ---
# For Seq2Seq tasks, we need a specific data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# --- Start Training ---
print("🚀 Starting training with Prompt Tuning...")
trainer.train()
print("✅ Training complete!")
