<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Amazon_Personalized_Product_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here are two specific **use cases** for adapters at companies like **Amazon** and **Apple**, focusing on scenarios where this approach could shine.

---

### **1. Amazon: Personalized Product Recommendations**
**Use Case**: Fine-tuning pre-trained large language models (LLMs) for personalized product recommendations across millions of users.

#### **Scenario**:
Amazon uses a foundational transformer model trained on large-scale text data, such as customer reviews, product descriptions, and behavioral data. To cater to diverse customer preferences, adapters can be added to the pre-trained model for personalization tasks.

#### **Adapter Workflow**:
1. **Goal**: Predict product recommendations based on user interaction history.
2. **Process**:
   - Insert adapters after feed-forward and attention layers of the transformer model.
   - Fine-tune the adapters on specific customer groups (e.g., regional preferences, seasonal demands) while freezing the core model.
3. **Example**:
   - Train adapters for users in different regions (e.g., North America, Europe) to capture region-specific preferences.
   - Train another set of adapters for holiday shopping preferences like "Black Friday" vs. "Prime Day."
4. **Advantages**:
   - Reduce the computational cost by not fine-tuning the entire model for every customer segment.
   - Quickly deploy personalized adapters for new use cases (e.g., a new product line).
5. **Challenges**:
   - Handling sparse datasets for niche customer segments.
   - Ensuring inference latency remains low during high-traffic periods.

---

### **2. Apple: Voice Recognition and Personalization in Siri**
**Use Case**: Enhancing Siri's voice recognition and response accuracy for specific user accents or languages.

#### **Scenario**:
Apple uses foundational models for automatic speech recognition (ASR) and natural language understanding (NLU) in Siri. Instead of retraining the entire model to support a new accent or dialect, adapters can be added for quick adaptation.

#### **Adapter Workflow**:
1. **Goal**: Fine-tune Siri's model to improve recognition for a specific accent (e.g., Australian English or Indian English).
2. **Process**:
   - Freeze the core ASR/NLU model trained on a generic English corpus.
   - Add adapters trained specifically on datasets from Australian or Indian speakers.
   - Deploy the adapted model to users in these regions without requiring significant changes to the core architecture.
3. **Example**:
   - Use adapters to specialize in domain-specific tasks like medical terminology or legal language.
   - Train adapters for new device-specific commands (e.g., "AirPods Pro gestures" or "Apple Vision Pro navigation").
4. **Advantages**:
   - Improves Siri's accuracy for specific user demographics without retraining the full model.
   - Enables efficient deployment to edge devices like iPhones and Apple Watches with minimal storage overhead.
5. **Challenges**:
   - Ensuring adapters don't degrade performance for general users.
   - Maintaining real-time processing speeds on devices with limited computational power.

---

### **Comparison of Use Cases**

| **Aspect**             | **Amazon: Recommendations**                                     | **Apple: Siri Personalization**                            |
|-------------------------|-----------------------------------------------------------------|-----------------------------------------------------------|
| **Model Type**          | Transformer-based LLM for text understanding                   | Transformer-based ASR and NLU models                      |
| **Task**                | Product recommendations based on user interaction              | Voice recognition and response accuracy for accents       |
| **Dataset**             | Customer reviews, clickstream data, purchase history           | Speech samples, language corpora, accent-specific datasets |
| **Adapter Placement**   | After attention layers to capture user-specific preferences    | After feed-forward layers to capture accent-specific nuances |
| **Deployment**          | Server-side for personalized recommendations                  | On-device for real-time voice interactions                |
| **Challenges**          | Data sparsity, latency during peak traffic                     | Balancing accuracy and latency on edge devices            |

---

### **Key Benefits of Adapters in These Use Cases**
1. **Efficiency**: Adapters add minimal parameters, reducing the cost and time of fine-tuning large models.
2. **Scalability**: Different adapters can be trained and deployed for various tasks or regions, allowing rapid scaling.
3. **Compatibility**: Adapters can work seamlessly with existing deployment pipelines (e.g., server-side inference or on-device processing).

These examples demonstrate how companies like Amazon and Apple could use adapters to enhance their ML workflows, optimize resource usage, and improve user experience.

# 1 Amazon: Personalized Product Recommendations
use a pre-trained transformer model (e.g., GPT, BERT) as the backbone and leverage adapters for personalization. The approach includes dynamic adapter switching, multi-level personalization, and optimization for scalability and low-latency inference.

Dynamic Adapter Modules:

Adapters are dynamically selected based on user profiles (e.g., region, preferences, season).
Incorporate a gating mechanism to automatically route requests to the appropriate adapter.

Multi-Task Learning:

Use shared adapters for global preferences and specialized adapters for region-specific, seasonal, or user-segment-specific preferences.

Federated Adapter Fine-Tuning:

Train adapters on-device using anonymized data to improve personalization while ensuring privacy.

Optimized Distributed Inference:

Employ model sharding and low-latency communication for scalable inference during high-traffic periods.

In [None]:
import torch
from torch import nn
from transformers import BertModel, BertTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import random


#1. Foundation Model Setup
Use a pre-trained LLM (e.g., GPT or BERT) trained on Amazon’s large-scale text data, including customer reviews, product descriptions, and clickstream data.

Pre-trained LLM as the Backbone

Model Choice: Use a pre-trained model like GPT-2, BERT, or DistilBERT from the Hugging Face library. These models have already learned a rich set of language representations, which we can fine-tune for our personalized recommendation use case.

Data Types:
Customer reviews: Capture user sentiment about products.
Product descriptions: Provide context for recommendations.
Clickstream data: Understand user preferences and behavior.

#Synthetic Data
1. Customer Reviews

Schema:
review_id: Unique ID for the review.
product_id: ID of the product being reviewed.
user_id: ID of the user leaving the review.
review_text: Text of the review.
star_rating: User rating (1 to 5).
timestamp: When the review was created.


In [None]:
Customer Reviews=[
  {
    "review_id": "R001",
    "product_id": "P101",
    "user_id": "U001",
    "review_text": "This laptop has excellent battery life and is very lightweight.",
    "star_rating": 5,
    "timestamp": "2024-01-10T15:30:00Z"
  },
  {
    "review_id": "R002",
    "product_id": "P102",
    "user_id": "U002",
    "review_text": "The headphones are decent but not very comfortable for long use.",
    "star_rating": 3,
    "timestamp": "2024-01-09T18:45:00Z"
  },
  {
    "review_id": "R003",
    "product_id": "P103",
    "user_id": "U003",
    "review_text": "Great camera quality but overpriced for the features offered.",
    "star_rating": 4,
    "timestamp": "2024-01-08T11:20:00Z"
  }
]


Product Descriptions

Schema:

product_id: ID of the product.

product_name: Name of the product.

product_description: Text describing the product.

category: Product category.

In [None]:
Product Descriptions=[
  {
    "product_id": "P101",
    "product_name": "Ultrabook X300",
    "product_description": "A sleek and lightweight laptop with a 15-hour battery life, perfect for professionals on the go.",
    "category": "Electronics"
  },
  {
    "product_id": "P102",
    "product_name": "Noise-Cancelling Headphones Pro",
    "product_description": "High-quality headphones with active noise cancellation, great sound clarity, and a comfortable fit.",
    "category": "Audio"
  },
  {
    "product_id": "P103",
    "product_name": "4K DSLR Camera",
    "product_description": "A professional-grade DSLR camera with 4K video recording and advanced low-light performance.",
    "category": "Photography"
  }
]


Clickstream Data

Schema: user_id: ID of the user.

session_id: Unique session identifier.

timestamp: When the action occurred.

event_type: Type of user action (view, click, add_to_cart, purchase).

product_id: ID of the product involved in the action.

In [None]:
Clickstream=[
  {
    "user_id": "U001",
    "session_id": "S001",
    "timestamp": "2024-01-10T14:00:00Z",
    "event_type": "view",
    "product_id": "P101"
  },
  {
    "user_id": "U001",
    "session_id": "S001",
    "timestamp": "2024-01-10T14:15:00Z",
    "event_type": "add_to_cart",
    "product_id": "P101"
  },
  {
    "user_id": "U002",
    "session_id": "S002",
    "timestamp": "2024-01-09T18:00:00Z",
    "event_type": "view",
    "product_id": "P102"
  },
  {
    "user_id": "U002",
    "session_id": "S002",
    "timestamp": "2024-01-09T18:10:00Z",
    "event_type": "purchase",
    "product_id": "P102"
  },
  {
    "user_id": "U003",
    "session_id": "S003",
    "timestamp": "2024-01-08T10:45:00Z",
    "event_type": "view",
    "product_id": "P103"
  },
  {
    "user_id": "U003",
    "session_id": "S003",
    "timestamp": "2024-01-08T11:00:00Z",
    "event_type": "click",
    "product_id": "P103"
  }
]


Data Integration

Merge Datasets:

Combine customer reviews, product descriptions, and clickstream data using product_id and user_id as keys.
Example: A user viewed a product (clickstream), purchased it, and left a review.

Feature Engineering:

Convert review_text and product_description into embeddings using a pre-trained LLM (e.g., BERT or GPT).
Encode user behavior (view, add_to_cart, purchase) as numerical features for model input.

Train Model:

Train adapters to predict product relevance or purchase likelihood using the combined dataset.

In [None]:
import pandas as pd

# Load customer reviews
customer_reviews = pd.DataFrame([
    {"review_id": "R001", "product_id": "P101", "user_id": "U001", "review_text": "This laptop has excellent battery life and is very lightweight.", "star_rating": 5, "timestamp": "2024-01-10T15:30:00Z"},
    {"review_id": "R002", "product_id": "P102", "user_id": "U002", "review_text": "The headphones are decent but not very comfortable for long use.", "star_rating": 3, "timestamp": "2024-01-09T18:45:00Z"},
    {"review_id": "R003", "product_id": "P103", "user_id": "U003", "review_text": "Great camera quality but overpriced for the features offered.", "star_rating": 4, "timestamp": "2024-01-08T11:20:00Z"}
])

# Load product descriptions
product_descriptions = pd.DataFrame([
    {"product_id": "P101", "product_name": "Ultrabook X300", "product_description": "A sleek and lightweight laptop with a 15-hour battery life, perfect for professionals on the go.", "category": "Electronics"},
    {"product_id": "P102", "product_name": "Noise-Cancelling Headphones Pro", "product_description": "High-quality headphones with active noise cancellation, great sound clarity, and a comfortable fit.", "category": "Audio"},
    {"product_id": "P103", "product_name": "4K DSLR Camera", "product_description": "A professional-grade DSLR camera with 4K video recording and advanced low-light performance.", "category": "Photography"}
])

# Load clickstream data
clickstream_data = pd.DataFrame([
    {"user_id": "U001", "session_id": "S001", "timestamp": "2024-01-10T14:00:00Z", "event_type": "view", "product_id": "P101"},
    {"user_id": "U001", "session_id": "S001", "timestamp": "2024-01-10T14:15:00Z", "event_type": "add_to_cart", "product_id": "P101"},
    {"user_id": "U002", "session_id": "S002", "timestamp": "2024-01-09T18:00:00Z", "event_type": "view", "product_id": "P102"},
    {"user_id": "U002", "session_id": "S002", "timestamp": "2024-01-09T18:10:00Z", "event_type": "purchase", "product_id": "P102"},
    {"user_id": "U003", "session_id": "S003", "timestamp": "2024-01-08T10:45:00Z", "event_type": "view", "product_id": "P103"},
    {"user_id": "U003", "session_id": "S003", "timestamp": "2024-01-08T11:00:00Z", "event_type": "click", "product_id": "P103"}
])

# Merge datasets
merged_data = pd.merge(customer_reviews, product_descriptions, on="product_id")
merged_data = pd.merge(merged_data, clickstream_data, on=["user_id", "product_id"], how="left")
print(merged_data.head())


#2. Adapter Integration
Add adapters after attention and feed-forward layers in the transformer.
Use a bottleneck dimension (e.g., 64) to balance performance and efficiency.

In [None]:
class Adapter(nn.Module):
    def __init__(self, input_dim, bottleneck_dim=64):
        super(Adapter, self).__init__()
        self.down_projection = nn.Linear(input_dim, bottleneck_dim)
        self.non_linearity = nn.ReLU()
        self.up_projection = nn.Linear(bottleneck_dim, input_dim)

    def forward(self, x):
        residual = x
        x = self.down_projection(x)
        x = self.non_linearity(x)
        x = self.up_projection(x)
        return x + residual

In [None]:
class Adapter(nn.Module):
    def __init__(self, input_dim, bottleneck_dim=64):
        super(Adapter, self).__init__()
        self.down_projection = nn.Linear(input_dim, bottleneck_dim)
        self.non_linearity = nn.ReLU()
        self.up_projection = nn.Linear(bottleneck_dim, input_dim)

    def forward(self, x):
        residual = x
        x = self.down_projection(x)
        x = self.non_linearity(x)
        x = self.up_projection(x)
        return x + residual

# class TransformerWithAdapters(nn.Module):
#     def __init__(self, transformer_model, bottleneck_dim=64, num_adapters=5):
#         super(TransformerWithAdapters, self).__init__()
#         self.transformer = transformer_model
#         self.adapters = nn.ModuleDict({
#             f"adapter_{i}": Adapter(self.transformer.config.hidden_size, bottleneck_dim)
#             for i in range(num_adapters)
#         })

#     def forward(self, input_ids, attention_mask, adapter_key):
#         outputs = self.transformer(input_ids, attention_mask=attention_mask, output_hidden_states=True)
#         hidden_states = outputs.hidden_states

#         # Apply selected adapter
#         adapted_states = self.adapters[adapter_key](hidden_states[-1])  # Use last hidden state
#         return adapted_states


In [None]:
class TransformerWithAdapters(nn.Module):
    def __init__(self, model_name="bert-base-uncased", bottleneck_dim=64, num_adapters=5):
        super(TransformerWithAdapters, self).__init__()
        self.bert = BertModel.from_pretrained(model_name)
        for param in self.bert.parameters():
            param.requires_grad = False  # Freeze the backbone

        # Add adapters for each user segment
        self.adapters = nn.ModuleDict({
            f"adapter_{i}": Adapter(self.bert.config.hidden_size, bottleneck_dim)
            for i in range(num_adapters)
        })

        # Classification head
        self.classifier = nn.Linear(self.bert.config.hidden_size, 1)  # Predict relevance score

    def forward(self, input_ids, attention_mask, adapter_key):
        outputs = self.bert(input_ids, attention_mask=attention_mask, output_hidden_states=True)
        hidden_states = outputs.hidden_states[-1]  # Use the last hidden state

        # Pass through the selected adapter
        adapted_states = self.adapters[adapter_key](hidden_states[:, 0, :])  # Use CLS token
        logits = self.classifier(adapted_states)
        return logits


#3: Personalization Workflow
A. User-Specific Adapters
Train separate adapters for each user segment (e.g., regional preferences, seasonal preferences):
Dataset: Split data by region (e.g., North America, Europe) and seasonal demand (e.g., "Black Friday" vs. "Prime Day").
Fine-Tuning: Use only the adapter parameters for training while freezing the core model.
B. Dynamic Adapter Selection
Use a gating mechanism based on metadata (e.g., user location, browsing history) to select the appropriate adapter during inference.

In [None]:
class AdapterSelector(nn.Module):
    def __init__(self, input_dim, num_adapters):
        super(AdapterSelector, self).__init__()
        self.gate = nn.Linear(input_dim, num_adapters) # 128: embedding dimension for metadata

    def forward(self, user_metadata):
        adapter_weights = self.gate(user_metadata)
        adapter_key = torch.argmax(adapter_weights, dim=1)  # Select the most relevant adapter
        return adapter_key


In [None]:
# Load the tokenizer and dataset
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("amazon_us_reviews", "Digital_Software_v1_00", split="train[:5000]")

def preprocess_function(examples):
    return tokenizer(examples["review_body"], padding="max_length", truncation=True, max_length=128)

# Preprocess dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)
encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "star_rating"])

# Split dataset
train_size = int(0.8 * len(encoded_dataset))
test_size = len(encoded_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(encoded_dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)


#4: Federated Adapter Fine-Tuning
Use federated learning to fine-tune adapters on-device, preserving user privacy:
Fine-tune adapters locally with anonymized user data.
Periodically aggregate adapter updates using secure aggregation.

In [None]:
# Initialize the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TransformerWithAdapters(num_adapters=5).to(device)
adapter_selector = AdapterSelector(input_dim=16, num_adapters=5).to(device)  # Dummy metadata dim

# Optimizer and loss function
optimizer = torch.optim.Adam(
    list(model.parameters()) + list(adapter_selector.parameters()), lr=1e-4
)
criterion = nn.MSELoss()

# Training
for epoch in range(3):  # Train for 3 epochs
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        ratings = batch["star_rating"].float().unsqueeze(1).to(device)

        # Simulate user metadata (random for this example)
        user_metadata = torch.rand(input_ids.size(0), 16).to(device)
        adapter_keys = adapter_selector(user_metadata)

        # Forward pass
        outputs = torch.cat(
            [model(input_ids[i:i+1], attention_mask[i:i+1], f"adapter_{adapter_keys[i].item()}") for i in range(input_ids.size(0))]
        )

        loss = criterion(outputs, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch + 1}, Loss: {total_loss / len(train_loader)}")


#5. Evaluation and Validation
Metrics:
Accuracy: Precision and recall for personalized recommendations.
Coverage: Fraction of users covered by adapters.
Latency: Time to infer recommendations during high-traffic periods.
Testing:
Compare the performance of the adapter-enhanced model to:
The baseline model without adapters.
Full model fine-tuning approaches.

In [None]:
# Evaluation Loop
model.eval()
adapter_selector.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        ratings = batch["star_rating"].float().unsqueeze(1).to(device)

        # Simulate user metadata (random for this example)
        user_metadata = torch.rand(input_ids.size(0), 16).to(device)
        adapter_keys = adapter_selector(user_metadata)

        outputs = torch.cat(
            [model(input_ids[i:i+1], attention_mask[i:i+1], f"adapter_{adapter_keys[i].item()}") for i in range(input_ids.size(0))]
        )

        # Evaluate predictions
        correct += ((outputs.round() == ratings.round()).sum().item())
        total += ratings.size(0)

print(f"Accuracy: {correct / total:.2f}")


#6: Optimized Distributed Inference
A. Model Sharding:
Divide the model and adapters across multiple servers:
Store global adapters on central servers.
Cache region-specific adapters on edge servers.
B. Quantization:
Apply 8-bit quantization to reduce memory usage and improve inference speed for adapters.
C. Edge-Cloud Hybrid Deployment:
Perform initial recommendation generation on edge devices.
Use cloud servers for complex operations (e.g., reranking results).

#Challenges and Solutions
Challenge 1: Sparse Data for Niche Segments
Solution: Use multi-task learning to share adapters between related segments, allowing them to learn from overlapping data.

Challenge 2: Low-Latency Inference
Solution:
Use adapter caching for frequently accessed segments.
Implement approximate nearest neighbor (ANN) search for fast user-to-adapter mapping.

Challenge 3: Scalability
Solution:
Use adapter distillation to reduce the number of active adapters during inference.
Dynamically load adapters based on traffic patterns.

# Deployment Example
Training:

Train adapters for regions (e.g., US, EU) and special events (e.g., "Holiday Season").
Use federated learning to refine adapters locally.
Inference:

Input: User metadata (e.g., region, preferences).
Select adapter: Use the gating mechanism to choose the appropriate adapter.
Generate recommendations: Pass user data through the selected adapter and transformer.

In [None]:
def generate_recommendations(model, input_data, user_metadata):
    adapter_key = adapter_selector(user_metadata)
    recommendations = model(input_data["input_ids"], input_data["attention_mask"], adapter_key)
    return recommendations


# Advantages of This Approach
Cost-Efficient Personalization:
Only adapter parameters are fine-tuned, reducing computational overhead.

Scalable Deployment:
Adapters can scale across millions of users with dynamic selection and caching.

Privacy-Preserving:
Federated learning ensures that user data remains local.

#Features of This project

Adapters:
Dynamically switch based on user metadata.
Train separately for different user segments.

Scalability:
Efficient use of adapters reduces computational overhead.

Dynamic Selection:
AdapterSelector routes data to the correct adapter based on metadata.

Personalization:
Adapters specialize in user-specific preferences (e.g., regions, seasonal events).

1. Parallel Adapters

Key Idea: Instead of inserting adapters sequentially into the transformer layers, parallel adapters are attached alongside the main layer computations (e.g., attention or feed-forward layers).

Advantages:
Reduces interference with the original model's flow.
Improves flexibility by learning parallel paths for task-specific adaptation.
Applications: Multitask learning where tasks are vastly different.

In [None]:
class ParallelAdapter(nn.Module):
    def __init__(self, input_dim, bottleneck_dim=64):
        super(ParallelAdapter, self).__init__()
        self.down_projection = nn.Linear(input_dim, bottleneck_dim)
        self.non_linearity = nn.ReLU()
        self.up_projection = nn.Linear(bottleneck_dim, input_dim)

    def forward(self, x):
        adapted = self.up_projection(self.non_linearity(self.down_projection(x)))
        return x + adapted  # Combine with original flow in parallel


2. HyperAdapters

Key Idea: Use a hypernetwork to generate adapter parameters dynamically based on task metadata (e.g., task embeddings, user preferences, or contextual features).

Advantages:
Allows adaptation to a wide range of tasks without explicitly training separate adapters.
Scales efficiently for multitask and few-shot learning.
Applications: Personalized assistants, multilingual models, and dynamic task adaptation.

In [None]:
class HyperAdapter(nn.Module):
    def __init__(self, input_dim, bottleneck_dim=64, task_embedding_dim=32):
        super(HyperAdapter, self).__init__()
        self.task_embedding_to_adapter = nn.Linear(task_embedding_dim, input_dim * bottleneck_dim + bottleneck_dim * input_dim)

    def forward(self, x, task_embedding):
        # Generate adapter weights dynamically
        adapter_weights = self.task_embedding_to_adapter(task_embedding)
        down_projection_weights = adapter_weights[:input_dim * bottleneck_dim].reshape(input_dim, bottleneck_dim)
        up_projection_weights = adapter_weights[input_dim * bottleneck_dim:].reshape(bottleneck_dim, input_dim)

        # Apply adapter transformation
        down_projected = x @ down_projection_weights
        up_projected = down_projected @ up_projection_weights
        return x + up_projected


LoRA (Low-Rank Adaptation of Large Language Models)

Key Idea: Decompose the adapter weights into two low-rank matrices to further reduce the number of parameters.

Advantages:
Highly efficient for fine-tuning massive models like GPT or T5.
Significantly reduces memory requirements.
Applications: Large-scale language models for domain-specific tasks.

In [None]:
class LoRAAdapter(nn.Module):
    def __init__(self, input_dim, rank=4):
        super(LoRAAdapter, self).__init__()
        self.low_rank_A = nn.Linear(input_dim, rank, bias=False)
        self.low_rank_B = nn.Linear(rank, input_dim, bias=False)

    def forward(self, x):
        return x + self.low_rank_B(self.low_rank_A(x))  # Add low-rank transformation


4. Task-Attention Adapters

Key Idea: Use attention mechanisms within adapters to focus on task-relevant information dynamically.

Advantages:
Captures nuanced task-specific relationships.
Allows the adapter to adaptively decide which parts of the input to emphasize.
Applications: Multi-task and multi-domain models.

In [None]:
class TaskAttentionAdapter(nn.Module):
    def __init__(self, input_dim, bottleneck_dim=64):
        super(TaskAttentionAdapter, self).__init__()
        self.query = nn.Linear(input_dim, bottleneck_dim)
        self.key = nn.Linear(input_dim, bottleneck_dim)
        self.value = nn.Linear(input_dim, bottleneck_dim)
        self.output_layer = nn.Linear(bottleneck_dim, input_dim)

    def forward(self, x):
        Q = self.query(x)
        K = self.key(x)
        V = self.value(x)
        attention_weights = torch.softmax(Q @ K.transpose(-1, -2), dim=-1)
        attended = attention_weights @ V
        return x + self.output_layer(attended)  # Add residual connection


5. Mixture-of-Adapters (MoA)

Key Idea: Use a gating mechanism to dynamically route inputs to different adapters, enabling specialization for sub-tasks.

Advantages:
Allows the model to scale efficiently for multiple tasks.
Improves performance by specializing adapters for sub-groups.
Applications: Multilingual tasks or fine-grained user personalization.

In [None]:
class MixtureOfAdapters(nn.Module):
    def __init__(self, input_dim, bottleneck_dim=64, num_adapters=4):
        super(MixtureOfAdapters, self).__init__()
        self.adapters = nn.ModuleList([Adapter(input_dim, bottleneck_dim) for _ in range(num_adapters)])
        self.gate = nn.Linear(input_dim, num_adapters)

    def forward(self, x):
        adapter_weights = torch.softmax(self.gate(x), dim=-1)  # Compute gating scores
        outputs = torch.stack([adapter(x) * weight for adapter, weight in zip(self.adapters, adapter_weights.T)], dim=0)
        return torch.sum(outputs, dim=0)  # Weighted sum of adapters


6. Prompt-Tuning with Adapters

Key Idea: Combine prompt-tuning (embedding-based task conditioning) with adapter modules to achieve lightweight adaptation.

Advantages:
Efficient for text generation or natural language understanding tasks.
Maintains high generalization with minimal overhead.
Applications: Few-shot learning and prompt engineering for LLMs.


### **Comparison of Methods**

| **Method**                 | **Parameters** | **Speed** | **Specialization** | **Best For**                                       |
|----------------------------|----------------|-----------|---------------------|---------------------------------------------------|
| **Parallel Adapters**       | Moderate       | Fast      | High               | Multitask learning                                |
| **HyperAdapters**           | Low            | Fast      | Very High          | Personalized or dynamic tasks                    |
| **LoRA**                    | Very Low       | Very Fast | High               | Fine-tuning large-scale LLMs                     |
| **Task-Attention Adapters** | Moderate       | Moderate  | Very High          | Multitask learning with nuanced relationships    |
| **Mixture-of-Adapters**     | High           | Moderate  | High               | Multilingual tasks or diverse sub-groups         |
| **Prompt-Tuning Adapters**  | Very Low       | Very Fast | Moderate           | Few-shot learning or lightweight text tasks      |

---

### **Choosing the Right Adapter Method**

1. **For Efficiency**: Use **LoRA** or **Prompt-Tuning** for large-scale models.
2. **For Flexibility**: Use **HyperAdapters** or **Task-Attention Adapters** for dynamic or multi-task scenarios.
3. **For Scalability**: Use **Mixture-of-Adapters** for multilingual or multi-domain tasks.

These cutting-edge adapter methods are highly flexible and efficient, offering solutions for a wide range of real-world applications.

In summary, LoRA stands out as the most popular adapter method currently, owing to its efficiency and integration into widely used machine learning frameworks. Other methods like HyperAdapters and Task-Attention Adapters are emerging but have not yet reached the same level of adoption.