# Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters

**Note:** [This is a modified version of the article written by Sebastain Raschka in Lightning AI blog.](https://lightning.ai/pages/community/article/understanding-llama-adapters/)

The purpose of this notebook is to learn how popular parameter-efficient finetuning methods for LLM work: prefix tuning, adapters, and LLaMA-Adapter.

# 1 - Introduction

Benefits of Parameter efficient finetuning:

* **Reduced computational & resource footprint:** Reusing pre-trained models and fine-tuning a minimal number of parameters saves money, time, and energy.

* **Wider hardware compatibility:** Allows training on devices with limited power, like laptops, smartphones, and IoT.

* **Environmental sustainability:** Lower energy consumption reduces carbon footprint of training large AI models.* 

# 2 - Finetuning LLMs

Since GPT-2 (Radford et al.) and GPT-3 (Brown et al.), we have seen that generative LLMs pretrained on a general text corpus are capable of **in-context learning**, which doesn’t require us to further train or finetune pretrained LLMs if we want to perform specific or new tasks that the LLM wasn’t explicitly trained on. Instead, we can directly provide a few examples of a target task via the input prompt, as illustrated in the example below:

<table>
    <tr>
        <td><img src="./images_1/in_context_learning.png" width="400"/></td>
    </tr>
</table>

In-context learning is a valuable and user-friendly method **for situations where direct access to the LLM is limited**, such as when interacting with the LLM through an API or user interface.

However, if we have access to the LLM, adapting and finetuing it on a target task using data from a target domain usually leads to superior results. So, how can we adapt a model to a target task? There are three conventional approaches:

<table>
    <tr>
        <td><img src="./images_1/fine_tuning_llms.png" width="700"/></td>
    </tr>
</table>

All three methods are compatible with generative (decoder-style) models such as GPT and embedding-focused (encoder-style) models such as BERT. 

In contrast to these three approaches, in-context learning only applies to generative models. It's also worth highlighting that when we finetune generative models, we work with and build on the embeddings they create instead of the generated output texts.

## 2.1 - Feature-based approach

In the feature-based approach, we load a pretrained LLM and apply it to our target dataset. Here, we are particularly interested in generating the output embeddings for the training set, which we can use as input features to train a classification model.

While this approach is particularly common for embedding-focused models like BERT, we can also extract embeddings from generative GPT-style models.

The classification model can then be a logistic regresion/softmax model, a random forest, or XGBoost - whatever our hearts desire. However, based on my experience, linear classifiers like logistic regression perform best here.

### 2.1.1 - Example

[In this example, we are using the embeddings from a pretrained transformer to train a random forest and logistic regression model in scikit-learn.](https://github.com/rasbt/blog-finetuning-llama-adapters/blob/main/three-conventional-methods/1_distilbert-feature-extractor.ipynb)

<table>
    <tr>
        <td><img src="./images_1/1_feature-based.png" width="400"/></td>
    </tr>
</table>

In theory, this approach should perform similarly well, in terms of modeling performance and speed, as the feature-based approach since we use the same frozen backbone model. However, since the feature-based approach makes it slightly easier to pre-compute and store the embedded features for the training dataset, the feature-based approach may be more convenient for specific practical scenarios.

In [1]:
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


#### Loading the data

We are going to use the IMDB dataset

In [2]:
import os.path as op

from datasets import load_dataset

import lightning as L
from lightning.pytorch.loggers import CSVLogger
from lightning.pytorch.callbacks import ModelCheckpoint

import numpy as np
import pandas as pd
import torch

from sklearn.feature_extraction.text import CountVectorizer

from local_dataset_utilities import download_dataset, load_dataset_into_to_dataframe, partition_dataset
from local_dataset_utilities import IMDBDataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# This lines of code takes more than 3.5h to execute
# download_dataset()
# df = load_dataset_into_to_dataframe()
# partition_dataset(df)

In [4]:
df_train = pd.read_csv("train.csv")
df_val = pd.read_csv("val.csv")
df_test = pd.read_csv("test.csv")

In [5]:
# Sample to make the execution faster
df_train_sample = df_train.sample(3500)
df_val_sample = df_val.sample(500)
df_test_sample = df_test.sample(500)

df_train_sample.to_csv("train_sample.csv")
df_val_sample.to_csv("val_sample.csv")
df_test_sample.to_csv("test_sample.csv")

#### Tokenization

In [6]:
imdb_dataset = load_dataset(
    "csv",
    data_files={
        "train": "train_sample.csv",
        "validation": "val_sample.csv",
        "test": "test_sample.csv",
    },
)

print(imdb_dataset)

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)
Generating train split: 3500 examples [00:00, 74391.96 examples/s]
  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)
Generating validation split: 500 examples [00:00, 48515.99 examples/s]
  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)
Generating test split: 500 examples [00:00, 43668.83 examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'index', 'text', 'label'],
        num_rows: 3500
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'index', 'text', 'label'],
        num_rows: 500
    })
    test: Dataset({
        features: ['Unnamed: 0', 'index', 'text', 'label'],
        num_rows: 500
    })
})





In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)

Tokenizer input max length: 512
Tokenizer vocabulary size: 30522


In [8]:
def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

In [9]:
imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)

Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3500/3500 [00:00<00:00, 3897.64 examples/s]


In [10]:
del imdb_dataset

#### Using DistilBERT as a Feature extractor

In [11]:
from transformers import AutoModel

model = AutoModel.from_pretrained("distilbert-base-uncased")
model.to(device)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

In [12]:
imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [13]:
test_batch = {"attention_mask": imdb_tokenized["train"][:3]["attention_mask"].to(device),
              "input_ids": imdb_tokenized["train"][:3]["input_ids"].to(device)}

with torch.inference_mode():
    test_output = model(**test_batch)
    
test_output.last_hidden_state.shape

torch.Size([3, 512, 768])

In [14]:
cls_token_output = test_output.last_hidden_state[:, 0]
cls_token_output.shape

torch.Size([3, 768])

In [15]:
@torch.inference_mode()
def get_output_embeddings(batch):
    output = model(
        batch["input_ids"].to(device),
        attention_mask=batch["attention_mask"].to(device)).last_hidden_state[:, 0]
    return {"features": output.cpu().numpy()}

In [16]:
import time
start = time.time()

imdb_features = imdb_tokenized.map(get_output_embeddings, batched=True, batch_size=10)

Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3500/3500 [00:47<00:00, 74.04 examples/s]


In [17]:
X_train = np.array(imdb_features["train"]["features"])
y_train = np.array(imdb_features["train"]["label"])

X_val = np.array(imdb_features["validation"]["features"])
y_val = np.array(imdb_features["validation"]["label"])

X_test = np.array(imdb_features["test"]["features"])
y_test = np.array(imdb_features["test"]["label"])

#### Train model on Embeddings (extracted features)

In [18]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

print("Training accuracy", clf.score(X_train, y_train))
print("Validation accuracy", clf.score(X_val, y_val))
print("test accuracy", clf.score(X_test, y_test))

end = time.time()
elapsed = end - start
print(f"Time elapsed {elapsed/60:.2f} min")

Training accuracy 0.9
Validation accuracy 0.868
test accuracy 0.848
Time elapsed 1.07 min


In [19]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

print("Training accuracy", clf.score(X_train, y_train))
print("Validation accuracy", clf.score(X_val, y_val))
print("test accuracy", clf.score(X_test, y_test))

Training accuracy 1.0
Validation accuracy 0.842
test accuracy 0.802


## 2.2 - Finetuning I - Updating the output layers

Another popular approach is finetuning the output layers (we will refer to this approach as *finetuning I*). Similar to the feature-based approach, we keep the parameters of the pretrained LLM frozen. We only train the newly added output layers, analogous to training a logistic regression classifier or small multilayer perceptron on the embedded features.

<table>
    <tr>
        <td><img src="./images_1/2_finetune-last.png" width="400"/></td>
    </tr>
</table>

### 2.2.1 - Example

In this example, we are also going to use the IMDB dataset, but **we are going to assume that we already have the tokenized version from before**...

In [20]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

#### Set up DataLoaders

In [21]:
from torch.utils.data import DataLoader, Dataset


class IMDBDataset(Dataset):
    def __init__(self, dataset_dict, partition_key="train"):
        self.partition = dataset_dict[partition_key]

    def __getitem__(self, index):
        return self.partition[index]

    def __len__(self):
        return self.partition.num_rows

In [22]:
train_dataset = IMDBDataset(imdb_tokenized, partition_key="train")
val_dataset = IMDBDataset(imdb_tokenized, partition_key="validation")
test_dataset = IMDBDataset(imdb_tokenized, partition_key="test")

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=12,
    shuffle=True, 
    num_workers=4
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=12,
    num_workers=4
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=12,
    num_workers=4
)

#### Initializing DistilBERT

In [23]:
from transformers import AutoModelForSequenceClassification


model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Freeze all layers**

In [24]:
for param in model.parameters():
    param.requires_grad = False

**Unfreeze last layer**

In [25]:
for param in model.pre_classifier.parameters():
    param.requires_grad = True

for param in model.classifier.parameters():
    param.requires_grad = True

#### Finetuning

In [26]:
import lightning as L
import torch
import torchmetrics


class CustomLightningModule(L.LightningModule):
    def __init__(self, model, learning_rate=5e-5):
        super().__init__()

        self.learning_rate = learning_rate
        self.model = model

        self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)
        self.test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)
        
    def training_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])        
        self.log("train_loss", outputs["loss"])
        return outputs["loss"]  # this is passed to the optimizer for training

    def validation_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])        
        self.log("val_loss", outputs["loss"], prog_bar=True)
        
        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.val_acc(predicted_labels, batch["label"])
        self.log("val_acc", self.val_acc, prog_bar=True)
        
    def test_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])        
        
        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.test_acc(predicted_labels, batch["label"])
        self.log("accuracy", self.test_acc, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer
    

lightning_model = CustomLightningModule(model)

In [27]:
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers import CSVLogger


callbacks = [
    ModelCheckpoint(
        save_top_k=1, mode="max", monitor="val_acc"
    )  # save top 1 model
]
logger = CSVLogger(save_dir="logs/", name="finetuning-last")

In [28]:
trainer = L.Trainer(
    max_epochs=3,
    callbacks=callbacks,
    accelerator="gpu",
    precision="16-mixed",
    devices=[0],
    logger=logger,
    log_every_n_steps=10,
)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [29]:
import time
start = time.time()

trainer.fit(model=lightning_model,
            train_dataloaders=train_loader,
            val_dataloaders=val_loader)

end = time.time()
elapsed = end - start
print(f"Time elapsed {elapsed/60:.2f} min")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type                                | Params
-----------------------------------------------------------------
0 | model    | DistilBertForSequenceClassification | 67.0 M
1 | val_acc  | MulticlassAccuracy                  | 0     
2 | test_acc | MulticlassAccuracy                  | 0     
-----------------------------------------------------------------
592 K     Trainable params
66.4 M    Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)


Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 292/292 [00:22<00:00, 13.01it/s, v_num=3, val_loss=0.508, val_acc=0.818]

In [None]:
trainer.test(lightning_model, dataloaders=train_loader, ckpt_path="best")

In [None]:
trainer.test(lightning_model, dataloaders=val_loader, ckpt_path="best")

In [None]:
trainer.test(lightning_model, dataloaders=test_loader, ckpt_path="best")

## 2.3 - Finetuning II - Updating all layers

The original BERT paper reported that finetuning only the output layer can result in modeling performance comparable to finetuning all layers, which is substantially more expensive since more parameters are involved. For instance, a BERT base model has approximately 110 million parameters. However, the final layer of a BERT base model for binary classification consists of merely 1500 parameters. Furthermore, the last two layers of a BERT base model account for 60000 parameters, which is only around 0.6% of the total model size.

Our mileage will vary based on how similar our target task and target domain is to the dataset the model was pretrained on. But in practice, finetuning all layers almost always results in superior modeling performance.

So, when optimizing the modeling performance, the gold standard for using pretrained LLMs is to update all layers (here referredto as finetuning II). Conceptually, finetuning II is very similar to finetuning I. The only difference is that we do not freeze the parameters of the pretrained LLM but finetune them as well.

<table>
    <tr>
        <td><img src="./images_1/3_finetune-all.png" width="400"/></td>
    </tr>
</table>

### 2.3.1 - Example

In this example, we are also going to use the IMDB dataset, but **we are going to assume that we already have the tokenized version from before**. In addition, we can also use the dataloaders previously defined...

#### Initializing DistilBERT

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

In this case, all layers are unfreezed

In [None]:
lightning_model = CustomLightningModule(model)

In [None]:
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers import CSVLogger


callbacks = [
    ModelCheckpoint(
        save_top_k=1, mode="max", monitor="val_acc"
    )  # save top 1 model
]
logger = CSVLogger(save_dir="logs/", name="finetuning-full")

In [None]:
trainer = L.Trainer(
    max_epochs=3,
    callbacks=callbacks,
    accelerator="gpu",
    precision="16-mixed",
    devices=[0],
    logger=logger,
    log_every_n_steps=10,
)

In [None]:
import time
start = time.time()

trainer.fit(model=lightning_model,
            train_dataloaders=train_loader,
            val_dataloaders=val_loader)

end = time.time()
elapsed = end - start
print(f"Time elapsed {elapsed/60:.2f} min")

In [None]:
trainer.test(lightning_model, dataloaders=train_loader, ckpt_path="best")

In [None]:
trainer.test(lightning_model, dataloaders=val_loader, ckpt_path="best")

In [None]:
trainer.test(lightning_model, dataloaders=test_loader, ckpt_path="best")

-----

#### Results

**With our subset of the data (for faster execution):**
* Feature-based approach with random forest: **TODO%** test accuracy
* Feature-based approach with logistic regression: **TODO%** test accuracy
* Finetuning I, updating the last 2 layers: **TODO%** accuracy
* Finetuning II, updating all layers: **TODO%** accuracy.

**With ALL of the data (not sampled):**

* Feature-based approach with random forest: **83%** test accuracy
* Feature-based approach with logistic regression: **87%** test accuracy
* Finetuning I, updating the last 2 layers: **87%** accuracy
* Finetuning II, updating all layers: **92%** accuracy.

These results are consistent with the general rule of thumb that finetuning more layers often results in better performance, but it comes with increased cost.

<table>
    <tr>
        <td><img src="./images_1/performance_comparison.png" width="600"/></td>
    </tr>
</table>

# 3 - Parameter-efficient finetuning

In the previous sections, we learned that finetuning more layers usually leads to better results. Now, the experiments above are based on a DistilBERT model, which is relatively small. What if we want to fnetune larger model that only barely fit into GPU memory, for example, the latest generative LLMs? 

In that case, we can use the feature-based or finetuning I approach as above, but, suppose we want to get a similar modeling quality as finetuning II?

[Over the years, researchers developed several techniques (Lialin et al., 2023) to finetune LLMs with high modeling performance while only requiring the training of only a small number of parameters. These methods are usually referred to as parameter-efficient finetuning techniques (PEFT).](https://arxiv.org/abs/2303.15647)

Some of the most widely used PEFT techniques are summarized in the figure below.

<table>
    <tr>
        <td><img src="./images_1/performance_comparison.png" width="600"/></td>
    </tr>
</table>

One PEFT technique that made big waves in 2023 was the LLaMA-Adapter, which was proposed for Meta's popular LLaMA model ([Touvron et la., 2023](https://arxiv.org/abs/2302.13971)). However, while LLaMA-Adapter was proposed in the context of LLaMA, the idea is model-agnostic.

To understand how LLaMA-Adapter works, we have to take a (small) step back and review two related techniques called *prefix tuning* and *adapters* - LLaMA-Adapter ([Zhang et al., 2023](https://arxiv.org/abs/2303.16199)) combines and extends both of these ideas.

## 3.1 - Prompt tuning and prefix tuning

The original concept of prompt tuning refers to techniques that vary the input to achieve better modeling results. For example, suppose we are interested in translating an English sentence into German. We can ask the model in various different ways, as illustrated below.

<table>
    <tr>
        <td><img src="./images_1/hard-prompting.png" width="600"/></td>
    </tr>
</table>

Now, this concept illustrated above is referred to as ***hard* prompt tuning** since we directly change the discrete input tokens, which are not differentiable.

In contrast, ***soft* prompt tuning** concatenates the embeddings of the input tokens with a trainable tensor that can be optimized via backpropagation to improve the modeling performance on a target task.

A specific flavor of prompt tuning is [**prefix tuning** (Li and Liang, 2021)](https://arxiv.org/abs/2101.00190). 

The idea in prefix tuning is to [**add a trainable tensor to each transformer block** instead of only the input embeddings, as in *soft* prompt tuning](https://medium.com/@musicalchemist/prefix-tuning-lightweight-adaptation-of-large-language-models-for-customized-natural-language-a8a93165c132). The following figure illustrates the difference between a regular transformer block and a transformer block modified with a prefix.

<table>
    <tr>
        <td><img src="./images_1/prefix-tuning.png" width="700"/></td>
    </tr>
</table>

Note that in the figure above, the "fully connected layers" refer to a small multilayer perceptron (two fully connected layers with a nonlinear activation function in-between). These fully connected layers embed the soft prompt in a feature space with the same dimensionality as the transformer-block input to ensure compatibility for concatenation. 


#### Benefits of Prefix Tuning:

* **Less Memory Intensive:** Storing just the prefix saves space compared to a full fine-tuned model.
* **Faster Training:** Only optimizing a small set of parameters speeds up the process.
* **Easy Adaptation:** Quickly adjust the prefix for different tasks with minimal effort.

Think of it this way: Instead of teaching the whole class a new topic, you provide a focused guide (the prefix) to help them understand it quickly.

#### Limitations:

* Prefix tuning might not achieve the same accuracy as full fine-tuning in all cases.
* Finding the optimal prefix length and design requires some experimentation.

## 3.2 - Adapters

The original *adapter* method ([Houlsby et al., 2019](https://arxiv.org/abs/1902.00751)) is somewhat related to the aforementioned *prefix tuning* as they also add aditional parameters to each transformer block. Howver, instead of prepending prefix to the input embeddings, the adapter method adds adapter layers in two places, as illustrated in the figure below:

<table>
    <tr>
        <td><img src="./images_1/adapter-outline.png" width="700"/></td>
    </tr>
</table>

Or in Python pseudo-code, the adapter layer can be written as follows:

<table>
    <tr>
        <td><img src="./images_1/adapter-pseudocode.png" width="400"/></td>
    </tr>
</table>

Note that the fully connected layers of the adapters are usually relatively small and have a bottleneck structure similar to autoencoders. Each adapter block's first fully connected layer projects the input down onto a low-dimensional representation. The second fully connected layer projects the input back into the input dimension. How is this parameter-efficient?

For example, assume the first fully connected layer projects a 1024-dimensional input down to 24 dimensions, and the second fully connected layer projects it back into 1024 dimensions. This means we introduced `1,024 * 24 + 24 * 1,024 = 49,152` weight parameters.

In contrast, a single fully connected layer that reprojects a 1024-dimensional input into a 1024-dimensional space would have `1,024 * 1,024 = 1,048,576` parameters.

According to [the original adapter paper](https://arxiv.org/abs/1902.00751), a BERT model trained with the adapter method reaches a modeling performance comparable to a fully finetuned BERT model while only requiring the training of 3.6% of the parameters.

Now, the question is how the adapter method compares to prefix tuning. Based on [the original prefix tuning paper](https://arxiv.org/abs/2101.00190), the adapter method performed slightly worse than the prefix tuning method **when 0.1% of the total number of model parameters were tuned**. However, **when the adapter method is used to tune 3% of the model parameters, the method ties with prefix tuning** of 0.1% of the model parameters. So, we may conclude that the prefix tuning method is the more efficient of the two.

# 4 - Extending prefix tuning and adapters: LLaMA-Adapter

Extending the ideas of prefix tuning and the original adapter method, researchers recently proposed [LLaMA-Adapter (Zhang et al., 2023)](https://arxiv.org/abs/2303.16199), a parameter-efficient finetuning method for LLaMA.

Like prefix tuning , the LLaMA-Adapter method prepends tunable prompt tensors to the embedded inputs. It's worth noting that in the LLaMA-Adapter method, the prefix is learned and maintained within an embedding table rather than being provided externally. Each transformer block in the model has its own distinct learned prefix, allowing for more tailored adaptation across different model layers.

In addition, LLaMA-Adapter introduces a zero-initialized attention mechanism coupled with gating. The motivation behind this so-called *zero-init* attention and gating is that adapters and prefix tuning could potentially disrupt the linguistic knowledge of the pretrained LLM by incorporating randomly initialized tensors (prefix prompts or adapter layers), resulting in unstable finetuning and high loss values during initial training phases.

Another difference compared to prefix tuning and the original adapter method is that LLaMA-Adapter adds the learnable adaption prompts only to the $L$ topmost transformer layers instead of all transformer layers. The authors argue that this approach enables more effective tuning of language representations focusing on higher-level semantic information.

While the basic idea of the LLaMA adapter method is related to prefix tuning (prepending tunable soft prompts, i.e., tensors), there are some additional subtle differences in how this is implemented. for instance, only a self-attention input's key and value sequences are modified via the tunable soft prompt. Then, depending on the gating factor (which is set to zero at the beginning of the training), the prefix-modified attention is either used or not. This concept is illustrated in the following visualization:

<table>
    <tr>
        <td><img src="./images_1/llama-adapter.png" width="600"/></td>
    </tr>
</table>

In pseudo-code, we may express this as follows:

<table>
    <tr>
        <td><img src="./images_1/llama-adapter-pseudocode.png" width="600"/></td>
    </tr>
</table>

In short, the differences between LLaMA-Adapter and regular prefix tuning are:

* LLaMA-Adapter only modifies the top (i.e., the first few) transformer blocks.
* LLaMA-Adapter introduces a gating mechanism to stabilize the training.
  
While the researchers specifically experiment with LLaMA, their proposed Adapter method is a general method that can also be applied to other types of LLMs (like GPT).

Using the LLaMA-Adapter approach, the researchers were able to finetune a 7 billion parameter LLaMA model in only 1 hour (using A100 GPUs) on a dataset consisting of 52k instruction pairs. Furthermore, the finetuned LLaMA-Adapter model outperformed all other models compared in this study on question-answering tasks, while only 1.2M parameters (the adapter layers) needed to be finetuned.

[If you want to check out the LLaMA-Adapter method, you can find the original implementation on top of the GPL-licensed LLaMA code here.](https://github.com/ZrrSkywalker/LLaMA-Adapter)

[Alternatively, if your use cases are incompatible with the GPL license, which requires you to open source all derivative works under a similar license, check out the Lit-LLaMA GitHub repository. Lit-LLaMA is a readable implementation of LLaMA on top of the Apache-licensed nanoGPT code, which has less restrictive licensing terms.](https://github.com/Lightning-AI/lit-llama)