# Lab 2: Adapter Layers - Fine-Tuning BERT for Classification
---

## Notebook 2: The Training Process

**Goal:** In this notebook, you will fine-tune a `bert-base-uncased` model on a sequence classification task using **Traditional Adapter Layers**.

**Key Concept:** We will insert small, trainable "Adapter" modules into a pre-trained BERT model. By only training the adapters and keeping the base model frozen, we can achieve efficient fine-tuning with far fewer trainable parameters.

**You will learn to:**
-   Load a dataset for sequence classification and preprocess it with a tokenizer.
-   Load a pre-trained BERT model for `SequenceClassification`.
-   Implement traditional Adapter layers.
-   Insert adapters into the BERT model and freeze the base model's weights.
-   Fine-tune the adapter layers using the Hugging Face `Trainer`.

In [11]:
from datasets import load_dataset
from transformers import AutoTokenizer

# --- Load Dataset ---

dataset = load_dataset("glue", "mrpc")
model_checkpoint = "bert-base-uncased"

# --- Load Tokenizer ---

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

# --- Preprocessing Function ---

def preprocess_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")

# --- Apply Preprocessing ---

encoded_dataset = dataset.map(preprocess_function, batched=True)

# The Trainer expects columns named 'labels', but the dataset has 'label'. Let's rename it.
encoded_dataset = encoded_dataset.rename_column("label", "labels")

# We only need a few columns for training.
encoded_dataset.set_format("torch", columns=["input_ids", "attention_mask", "token_type_ids", "labels"])

In [12]:
print("✅ Dataset loaded and tokenized.")
print(f"Train samples: {len(encoded_dataset['train'])}")
print(f"Test samples: {len(encoded_dataset['test'])}")


✅ Dataset loaded and tokenized.
Train samples: 3668
Test samples: 1725


### Step 2: Load the Base Model

Next, we load the `bert-base-uncased` model. This will be the base model that we will enhance with adapter layers.

#### Key Hugging Face Components:
- `transformers.AutoModelForSequenceClassification`: Loads a pre-trained model with a classification head.
    - `num_labels`: Number of classes (2 for paraphrase detection: 0=not paraphrase, 1=paraphrase).

In [13]:
from transformers import AutoModelForSequenceClassification
import torch

# --- Device Setup ---

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Load Model ---

num_labels = 2
model_checkpoint = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, 
    num_labels=num_labels,
).to(device)

print("✅ BERT model loaded.")
# print(model) # Uncomment to see the model architecture

Using device: cuda


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ BERT model loaded.


### Step 3: Configure Traditional Adapters

Now we implement traditional Adapter layers and insert them into our base model.

#### Traditional Adapter Components:

-   **Down-projection Layer**: Compresses features to a smaller bottleneck dimension.
-   **Non-linear Activation**: ReLU activation for expressiveness.
-   **Up-projection Layer**: Projects features back to their original dimension.
-   **Skip Connection**: A critical residual connection that adds the adapter's output to the original input: `output = input + adapter(input)`.

By freezing the base model and only training these small adapter layers, we can adapt the model to a new task efficiently.

In [14]:
import torch
import torch.nn as nn

class AdapterLayer(nn.Module):
    """
    Traditional Adapter Layer implementation.
    """
    def __init__(self, input_dim, reduction_factor=16, non_linearity="relu"):
        super(AdapterLayer, self).__init__()
        
        # Calculate bottleneck dimension
        bottleneck_dim = input_dim // reduction_factor
        
        # Adapter components
        self.down_project = nn.Linear(input_dim, bottleneck_dim)
        self.up_project = nn.Linear(bottleneck_dim, input_dim)
        
        # Activation function
        if non_linearity == "relu":
            self.activation = nn.ReLU()
        elif non_linearity == "gelu":
            self.activation = nn.GELU()
        else:
            raise ValueError(f"Unsupported activation: {non_linearity}")
        
        # Initialize weights for stable training
        self._init_weights()
        
    def _init_weights(self):
        """Initialize weights close to identity mapping"""
        nn.init.normal_(self.down_project.weight, std=1e-3)
        nn.init.zeros_(self.down_project.bias)
        nn.init.normal_(self.up_project.weight, std=1e-3)
        nn.init.zeros_(self.up_project.bias)
        
    def forward(self, x):
        """Forward pass with skip connection."""
        adapter_output = self.down_project(x)
        adapter_output = self.activation(adapter_output)
        adapter_output = self.up_project(adapter_output)
        return x + adapter_output

class BertWithAdapters(nn.Module):
    """
    Wrapper for a BERT model to insert adapter layers into each transformer layer.
    """
    def __init__(self, base_model, reduction_factor=16):
        super(BertWithAdapters, self).__init__()
        
        self.base_model = base_model
        self.config = base_model.config
        
        # Freeze all base model parameters
        for param in self.base_model.parameters():
            param.requires_grad = False
        
        # Get the device of the base model
        base_device = next(self.base_model.parameters()).device
        
        # Create adapters for each layer and move to correct device
        self.adapters = nn.ModuleList()
        hidden_size = self.config.hidden_size
        
        for i in range(self.config.num_hidden_layers):
            adapter = AdapterLayer(hidden_size, reduction_factor, "relu")
            adapter.to(base_device)  # Ensure adapter is on same device as base model
            self.adapters.append(adapter)
            
    def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None):
        """Use the base model directly with manual adapter injection"""
        
        # Get embeddings
        embeddings = self.base_model.bert.embeddings(
            input_ids=input_ids, 
            token_type_ids=token_type_ids
        )
        
        # Process through encoder layers with adapter injection
        hidden_states = embeddings
        
        # Create proper attention mask format
        if attention_mask is not None:
            extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
            # Match the dtype of hidden states
            extended_attention_mask = extended_attention_mask.to(dtype=hidden_states.dtype)
            extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        else:
            extended_attention_mask = None
        
        # Process each layer
        for i, layer in enumerate(self.base_model.bert.encoder.layer):
            layer_output = layer(hidden_states, extended_attention_mask)
            hidden_states = layer_output[0]
            
            # Apply adapter after each layer (device/precision handled automatically)
            hidden_states = self.adapters[i](hidden_states)
        
        # Pooling and classification
        pooled_output = self.base_model.bert.pooler(hidden_states)
        logits = self.base_model.classifier(pooled_output)
        
        # Calculate loss if labels provided
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
            return {"loss": loss, "logits": logits}
        
        return {"logits": logits}
    
    def print_trainable_parameters(self):
        """Print number of trainable parameters"""
        trainable_params = 0
        all_param = 0
        
        for _, param in self.named_parameters():
            all_param += param.numel()
            if param.requires_grad:
                trainable_params += param.numel()
                
        print(f"trainable params: {trainable_params:,} || all params: {all_param:,} || trainable%: {100 * trainable_params / all_param:.4f}")

# Create adapter-enhanced model
adapter_model = BertWithAdapters(
    base_model=model,
    reduction_factor=16  # Bottleneck reduction factor
)

# Print parameter information
adapter_model.print_trainable_parameters()

# Print device information
print(f"🔧 Device Information:")
print(f"Base model device: {next(adapter_model.base_model.parameters()).device}")
print(f"Adapter[0] device: {adapter_model.adapters[0].down_project.weight.device}")
print(f"Model is on CUDA: {next(adapter_model.parameters()).is_cuda}")

trainable params: 894,528 || all params: 110,378,306 || trainable%: 0.8104
🔧 Device Information:
Base model device: cuda:0
Adapter[0] device: cuda:0
Model is on CUDA: True


### Step 4: Set Up Training

Now we'll train our adapter-enhanced model. We will use the standard Hugging Face `Trainer`.

#### Key Training Concepts:

-   **Parameter-Efficient Tuning**: We only train the adapter weights, which is a small fraction of the total model parameters. This saves time and computational resources.
-   **Frozen Base Model**: The original BERT model weights are not updated during training. It acts as a powerful feature extractor.
-   **Standard Training Loop**: The process is compatible with the standard `Trainer` from the `transformers` library, making it easy to set up.

In [None]:
import numpy as np
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score

# --- Metrics Calculation Function ---

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    # Calculate metrics using sklearn
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    
    return {
        "accuracy": accuracy,
        "f1": f1,
    }

# --- Create Trainer ---


training_args = TrainingArguments(
    output_dir="./bert-adapters-mrpc",
    learning_rate=1e-3,  # Higher learning rate for adapters
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    eval_strategy="steps",        # 改為 steps 以便更頻繁顯示指標
    eval_steps=50,               # 每50步評估一次
    save_strategy="steps",
    load_best_model_at_end=True,
    logging_steps=25,            # 更頻繁的日誌記錄
    logging_first_step=True,     # 記錄第一步
    metric_for_best_model="f1",
    greater_is_better=True,
    report_to=None,              # 避免外部報告干擾
)

# --- Create Trainer ---

trainer = Trainer(
    model=adapter_model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# --- Start Training ---

print("🚀 Starting training with Traditional Adapters...")
print("📊 Training configuration:")
print(f"   • Base model: BERT (frozen)")
print(f"   • Adapter layers: Trainable")
print(f"   • Batch size: {training_args.per_device_train_batch_size}")
print(f"   • Learning rate: {training_args.learning_rate}")
print(f"   • Epochs: {training_args.num_train_epochs}")

trainer.train()

# --- Final Evaluation ---

print("📊 Final Evaluation Results:")
final_metrics = trainer.evaluate()
print(f"Final Accuracy: {final_metrics['eval_accuracy']:.4f}")
print(f"Final F1 Score: {final_metrics['eval_f1']:.4f}")
print(f"Final Loss: {final_metrics['eval_loss']:.4f}")

In [17]:
import torch
import gc

# 如果您有不再需要的模型或訓練器等變數，可以取消以下註解來刪除它們
# del model
# del trainer

# 執行垃圾回收
gc.collect()

# 清空 PyTorch 的 CUDA 快取
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    
print("GPU memory has been released.")

GPU memory has been released.
