---
title: "PEFT - Adapter Tuning"
description: Perameter efficient finetuning using Adapters
author: "Uday"
date: "2024-09-18"
categories: [NLP, PEFT, Fine Tuning]
image: "images/adapters_1.png"
---

Large pre-trained language models (e.g., BERT, GPT) have revolutionized NLP tasks by leveraging massive amounts of unlabeled data. Transfer learning involves first pre-training these models on large corpora and then fine-tuning them on smaller, task-specific datasets. However, fine-tuning all the parameters of a model like BERT is computationally expensive and inefficient, particularly when there are multiple downstream tasks

# Adapter Layers

- `Adapters` are small, task-specific layers added between the layers of the pre-trained model.
- Instead of fine-tuning all the parameters of the model, only the parameters of the adapter layers are updated during training for a specific task. The rest of the model's parameters remain frozen.
- This method significantly reduces the number of trainable parameters and, thus, the computational cost of fine-tuning.

# Adapter Design

![Adapter Design](images/adapters_1.png "https://arxiv.org/pdf/1902.00751")


- Each adapter consists of a `down-projection`, a `non-linearity`, and an `up-projection` as shown in above image
- The down-projection reduces the dimensionality of the intermediate layer activations, and the up-projection restores it, thus keeping the adapter small and efficient.
- The adapters first project the original d-dimensional features into a smaller dimension, m, apply a nonlinearity, then project back to d dimensions. 
- so The total number of parameters added per layer, including biases, is `2md + d + m`.
-  By setting `m << d`, we limit the number of parameters added per task.


# Adapter Fusion

- Sequential fine-tuning and multi-task learning are methods aiming to incorporate knowledge from multiple tasks; however, they suffer from catastrophic forgetting and difficulties in dataset balancing.
- AdapterFusion addresses this by non-destructively composing multiple tasks. Rather than overwriting model parameters for each task, the method fuses information from different adapters to solve a new task.
- algorithm:
	- train a adapter layers for each task seperatly. 
	- AdapterFusion learns a weighted combination of previously trained all adapters as shown in below figure.

![Adapter Fusion](images/adapters_4.png "https://arxiv.org/pdf/2005.00247")

- This fusion mechanism allows the model to leverage knowledge from all tasks in a modular fashion.
- The adapters themselves remain independent, and the fusion weights can be tuned to emphasize adapters that are most relevant for a specific task.




Reference:

1. https://arxiv.org/pdf/1902.00751
2. https://arxiv.org/pdf/2005.00247