<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Sentiment_Classification_Using_Adapters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

useing adapters for a sentiment classification task. Instead of fine-tuning all of BERT's parameters, I inserted adapter modules into the model, minimizing training time and memory requirements.

Add Adapter layers after the feed-forward and attention layers of BERT.

Keep the original model weights frozen.

Fine-tune only the adapter layers for the new task.

In [None]:
import torch
from torch import nn
from transformers import BertTokenizer, BertModel, BertConfig

# Define the Adapter Module
class Adapter(nn.Module):
    def __init__(self, input_dim, adapter_dim=64):
        super(Adapter, self).__init__()
        self.down_projection = nn.Linear(input_dim, adapter_dim)
        self.non_linearity = nn.ReLU()
        self.up_projection = nn.Linear(adapter_dim, input_dim)

    def forward(self, x):
        residual = x  # Save the residual connection
        x = self.down_projection(x)
        x = self.non_linearity(x)
        x = self.up_projection(x)
        return x + residual  # Add residual connection


# Define the Adapter-Enhanced BERT Model
class BertWithAdapters(nn.Module):
    def __init__(self, model_name="bert-base-uncased", adapter_dim=64, num_labels=2):
        super(BertWithAdapters, self).__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.bert.config.output_hidden_states = True  # To get intermediate layers

        # Freeze BERT weights
        for param in self.bert.parameters():
            param.requires_grad = False

        # Add adapters to each transformer layer
        self.adapters = nn.ModuleList(
            [Adapter(self.bert.config.hidden_size, adapter_dim) for _ in range(self.bert.config.num_hidden_layers)]
        )

        # Classification head
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        hidden_states = outputs.hidden_states  # Get all hidden states from BERT

        # Pass each layer's output through its corresponding adapter
        adapted_outputs = []
        for i, hidden_state in enumerate(hidden_states[1:]):  # Skip embedding layer (index 0)
            adapted_output = self.adapters[i](hidden_state)
            adapted_outputs.append(adapted_output)

        # Use the last adapted layer's output for classification
        final_output = adapted_outputs[-1][:, 0, :]  # CLS token output
        logits = self.classifier(final_output)
        return logits


# Load Data and Train the Model
from transformers import AdamW
from datasets import load_dataset

# Load the tokenizer and dataset
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("imdb")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=128)

encoded_dataset = dataset.map(preprocess_function, batched=True)
encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# Create DataLoaders
train_loader = torch.utils.data.DataLoader(encoded_dataset["train"], batch_size=16, shuffle=True)
test_loader = torch.utils.data.DataLoader(encoded_dataset["test"], batch_size=16)

# Initialize the model
model = BertWithAdapters()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define optimizer and loss function
optimizer = AdamW(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

# Training Loop
for epoch in range(3):  # Number of epochs
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}")

# Evaluation
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids, attention_mask)
        _, predicted = torch.max(outputs, dim=1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy: {correct/total:.2f}")


Key Features of This Implementation
Adapters:
Each transformer layer has its own adapter to introduce trainable parameters while keeping the core model frozen.
They consist of a down-projection, a non-linearity, and an up-projection with residual connections.

Freezing the Pre-Trained Model:
Only adapters and the classification head are trained, making the approach efficient.

Modularity:
The adapter modules can be reused or extended for different transformer architectures or tasks.

Flexibility:
You can tune only specific layers, use different adapter dimensions, or extend adapters for tasks beyond classification.


To ensure the adapter-based approach works effectively and meets the goals of your machine learning project, you must consider several factors and challenges. Here's a comprehensive guide:

1. Ensuring the Adapter Works

To verify that the adapter is functioning as intended:

Baseline Comparison: Compare performance metrics (e.g., accuracy, F1 score) with and without the adapter to measure its impact.
Frozen Backbone: Ensure the pre-trained model's weights remain frozen during training. If weights accidentally update, it can lead to inconsistent behavior.
Loss Monitoring: Track the loss curve to confirm that the model is learning and not overfitting or underfitting.
Task-Specific Metrics: Evaluate performance using metrics appropriate for the task (e.g., precision, recall, BLEU scores for text tasks).
Visual Inspection: For tasks like image generation or sentiment analysis, inspect outputs qualitatively to ensure the adapter contributes meaningfully.

2. Challenges of the Adapter Approach

A. Adapter Design
Bottleneck Size (Adapter Dimensionality):

The adapter’s down-projection and up-projection dimensions must balance computational efficiency and representational capacity.
Too small: The adapter may not capture sufficient task-specific knowledge.
Too large: The model may overfit or lose the efficiency advantage.
Placement:

Deciding where to insert adapters (e.g., after attention layers, feed-forward layers, or both) affects their ability to fine-tune the model effectively.

B. Limited Capacity
Adapters can only adjust the model indirectly by modifying intermediate representations, which may limit their effectiveness on highly complex tasks requiring significant modifications to the backbone.

C. Fine-Tuning Stability
Adapters can cause instability during training, particularly if initialization or learning rates are not carefully managed.

D. Dataset Size
Adapters may underperform with very small datasets due to insufficient task-specific signals for fine-tuning, even though they are designed to work efficiently with fewer parameters.

E. Model Compatibility
Not all pre-trained models are inherently compatible with adapters, and adding adapters might require careful engineering of the forward pass.

3. Considerations for Successful Deployment

A. Hyperparameter Tuning
Learning Rate: Adapters typically require a smaller learning rate than traditional fine-tuning since fewer parameters are updated.
Adapter Dimension: Experiment with different adapter dimensions to balance capacity and efficiency.
Batch Size: Use a reasonable batch size to stabilize gradient updates.

B. Evaluation Pipeline
Use a robust evaluation pipeline that includes cross-validation or hold-out testing to ensure reliable performance.
C. Task-Specific Pretraining
If the downstream task is too different from the pre-trained model's original tasks, consider task-specific pretraining to improve the adapter's effectiveness.

D. Distributed Inference
If deploying adapters in production, optimize inference using quantization or pruning to ensure low latency.
E. Scalability
Ensure that adapters scale efficiently across tasks, especially when deploying multiple models or working in multi-task scenarios.


4. Best Practices

Experiment with Placement:

Adapters can be inserted after attention layers, feed-forward layers, or both. Experiment to identify the most impactful positions for your task.
Leverage Pretrained Adapters:

Consider using pretrained adapters from libraries like AdapterHub for common tasks to save time.

Use Residual Connections:

Residual connections in adapters help maintain stability and allow information flow from the frozen backbone.

Monitor Performance Across Iterations:

Continuously evaluate and refine the adapter design using metrics like validation accuracy and loss.

Avoid Overfitting:

Regularize training using dropout, weight decay, or early stopping to prevent overfitting, especially on small datasets.


Tools and Frameworks

AdapterHub: A library for working with adapters in Hugging Face models.

Hugging Face Transformers: Built-in support for integrating adapters in transformer-based models.

Optuna or Ray Tune: Use these tools to automate hyperparameter optimization for adapter-specific parameters.

Validation Checklist

 Compare with baseline performance without adapters.

 Experiment with different adapter configurations (dimensionality, placement).

 Monitor training metrics (loss, validation accuracy) to confirm stable learning.

 Validate against real-world test cases or benchmarks.

 Ensure compatibility with deployment constraints (e.g., latency, memory).