# Lab 4: Prefix Tuning - Fine-Tuning a GPT-2 Model for Generation
---
## Notebook 2: The Training Process

**Goal:** In this notebook, you will fine-tune a `gpt2` model on a text generation task using **Prefix Tuning**. We'll train it to generate positive reviews of movies.

**You will learn to:**
-   Load a dataset for text generation (`imdb`) and preprocess it.
-   Load a pre-trained GPT-2 model.
-   Deeply understand and configure `peft.PrefixTuningConfig`.
-   Apply prefixes to the GPT-2 model.
-   Fine-tune the model by training *only* the prefix vectors using the `transformers.Trainer`.


### Step 1: Load Dataset and Preprocess

We will use the `imdb` dataset, which contains movie reviews. We'll filter it to only use the positive reviews (`label=1`) to teach the model how to generate text in a specific, positive style.

#### Key Hugging Face Components:

-   `transformers.AutoTokenizer`: We'll load the tokenizer for `gpt2`. Since GPT-2 is an autoregressive model, we need to set the `pad_token` to be the same as the `eos_token`.
-   `dataset.filter()`: Used to select only the positive reviews from the dataset.
-   `dataset.map()`: We'll tokenize the reviews. For text generation, the `labels` are typically the same as the `input_ids`.


In [45]:
from datasets import load_dataset
from transformers import AutoTokenizer

model_checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token # Set pad token

# --- Load and Filter Dataset ---
dataset = load_dataset("imdb", split="train")
# Filter for positive reviews only
positive_dataset = dataset.filter(lambda example: example["label"] == 0)
positive_dataset = positive_dataset.train_test_split(test_size=0.1)


# --- Preprocessing Function ---
def preprocess_function(examples):
    # Tokenize the text
    outputs = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)
    # For language modeling, the labels are the same as the input_ids
    outputs["labels"] = outputs["input_ids"]
    return outputs

# --- Apply Preprocessing ---
tokenized_datasets = positive_dataset.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text", "label"])
tokenized_datasets.set_format("torch")



Map:   0%|          | 0/11250 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

In [46]:
print("✅ Dataset loaded and preprocessed.")
print(tokenized_datasets["train"])

✅ Dataset loaded and preprocessed.
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 11250
})


### Step 2: Load the Base Model

Next, we load the `gpt2` model. Since this is a text generation task, we use `AutoModelForCausalLM`.


In [47]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

print("✅ Base GPT-2 model loaded.")


✅ Base GPT-2 model loaded.


### Step 3: Configure Prefix Tuning

Here we configure Prefix Tuning. This method is more powerful than Prompt Tuning because the trainable parameters (the prefix) are injected into the attention mechanism of *every* transformer layer, giving it more influence over the generation process.

#### Key Hugging Face `peft` Components:

-   `peft.PrefixTuningConfig`: The configuration class for this method.
    -   `task_type="CAUSAL_LM"`: We specify the task type for causal language modeling.
    -   `num_virtual_tokens`: The length of the prefix. This is the main hyperparameter. It defines the length of the trainable prefix tensor that is fed into each attention layer.
-   `peft.get_peft_model`: Applies the configuration to our base model.


In [48]:
from peft import get_peft_model, PrefixTuningConfig, TaskType

# --- Prefix Tuning Configuration ---
prefix_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20 # This is the length of the prefix
)

# --- Create PeftModel ---
peft_model = get_peft_model(model, prefix_config)

# --- Print Trainable Parameters ---
peft_model.print_trainable_parameters()


trainable params: 368,640 || all params: 124,808,448 || trainable%: 0.2954


### Step 4: Set Up Training

The final step is to configure and run the training process using the `transformers.Trainer`.


In [51]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# --- Training Arguments ---
training_args = TrainingArguments(
    output_dir="./gpt2-prefix-tuning-imdb",
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=5,
    logging_steps=25,            # 更頻繁的日誌記錄
    logging_first_step=True,     # 記錄第一步
    eval_strategy="steps",       # 改為 steps 以便更頻繁顯示指標
    eval_steps=100,             # 每100步評估一次
    save_strategy="steps",
    load_best_model_at_end=True,
    report_to=None,             # 避免外部報告干擾
)

# --- Create Trainer ---
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

# --- Start Training ---
print("🚀 Starting training with Prefix Tuning...")
trainer.train()
print("✅ Training complete!")

  trainer = Trainer(


🚀 Starting training with Prefix Tuning...


Step,Training Loss,Validation Loss
100,5.3958,4.585141
200,4.9623,4.284848
300,4.7236,4.111228
400,4.5378,4.017278
500,4.4738,3.933469
600,4.3228,3.881747


KeyboardInterrupt: 