Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Task Arithmetics #698

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

lenglaender
Copy link
Member

@lenglaender lenglaender commented May 8, 2024

This PR adds support for various task arithmetic options for LoRA. Until now, our library supported averaging only by linearly combining different adapters. This may be insufficient, especially for LoRA — hence, several publications have proposed other ways to perform task arithmetic.

This PR:

  • makes it easier to implement different weighting methods
  • adds 2 additional merging methods for LoRA following these papers
  • adds method to merge heads
  • Docu & notebook

@lenglaender lenglaender changed the title WIP: Add support for Task Arithmetics Add support for Task Arithmetics Jul 4, 2024
@lenglaender lenglaender marked this pull request as ready for review July 4, 2024 11:54
@lenglaender lenglaender requested review from calpt, TimoImhof and hSterz and removed request for calpt and TimoImhof July 10, 2024 12:34
Copy link
Member

@calpt calpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good overall!

Looked over everything except for the notebook and left some comments.

@@ -57,6 +57,9 @@ cd adapters
pip install .
```

> **Note**: The _Adapters_ library has replaced the [`adapter-transformers`](https://github.com/adapter-hub/adapter-transformers-legacy) package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this moved down? I'd prefer to keep it close to the top since there might still be adapter-transformers users that are redirected to this page

@@ -36,9 +36,9 @@ A Unified Library for Parameter-Efficient and Modular Transfer Learning
[![GitHub](https://img.shields.io/github/license/adapter-hub/adapters.svg?color=blue)](https://github.com/adapter-hub/adapters/blob/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/adapters)](https://pypi.org/project/adapters/)

_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [various adapter methods](https://docs.adapterhub.ml/overview.html) into [state-of-the-art pre-trained language models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. _Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. _Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks.
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.
_Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks.

@@ -156,7 +159,7 @@ Currently, adapters integrates all architectures and methods listed below:
| UniPELT | [Mao et al. (2022)](https://arxiv.org/pdf/2110.07577.pdf) | [Docs](https://docs.adapterhub.ml/method_combinations.html#unipelt) |
| Prompt Tuning | [Lester et al. (2021)](https://aclanthology.org/2021.emnlp-main.243/) | [Docs](https://docs.adapterhub.ml/methods.html#prompt-tuning) |
| QLoRA | [Dettmers et al. (2023)](https://arxiv.org/pdf/2305.14314.pdf) | [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb) |
| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) |
| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add task arithmetics paper here?

As this process is typically not done dynamically at runtime, `adapters` provides `average_adapter()` as a dedicated method for parameter averaging.
In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`.
Note that for this to succeed, all averaged adapters must use the same adapter configuration.
### Merging Adapters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can think about adding a separate doc page for merging to make it more discoverable?

else:
avg_state_dict[k] = zhang_weight * v

elif combine_strategy == "lora_delta_w_svd":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this strategy to a helper method since the implementation is slightly lengthy?

def _average_shared_parameters(self, adapter_name: str, input_adapters: Dict[str, float], combine_strategy: str):
if combine_strategy != "linear":
raise ValueError(
f"Combine strategy {combine_strategy} not supported for Compacter. Only 'linear' is supported."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Combine strategy {combine_strategy} not supported for Compacter. Only 'linear' is supported."
f"Combine strategy {combine_strategy} not supported for shared parameters. Only 'linear' is supported."

Comment on lines +1394 to +1395
def _get_head_config_hash(config):
return get_adapter_config_hash({k: v for k, v in config.items() if k not in keys_to_ignore})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this directly call the hash method with the new ignore param?

except ValueError as ex:
self.delete_adapter(adapter_name)
raise ex
if set_active:
self.set_active_adapters(adapter_name)

def average_head(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method live in the flexible heads mixin since it only applies to those model classes?

self.assertTrue(torch.allclose(expected_A, merged_lora.lora_A, atol=1e-5))
self.assertTrue(torch.allclose(expected_B, merged_lora.lora_B, atol=1e-5))

# 2. if we merge multiple adapters with weight 0 except one adapter with weight 1, the resulting adapter should be the same as the adapter with weight 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like cases 1 and 2 should live in separate test classes here

Comment on lines +218 to +236
if isinstance(module, LoRALayer):
if f"{name}_0" in module.loras and f"{combine_strategy}_case1" in module.loras:
original_lora = module.loras[f"{name}_0"]
merged_lora = module.loras[f"{combine_strategy}_case1"]

# Compute SVD of the original delta_w
u, s, v = torch.svd(original_lora.delta_w)
u = u[:, :svd_rank]
s = s[:svd_rank]
v = v[:, :svd_rank]

# Reconstruct A and B matrices
expected_A = v.t()
expected_B = u @ torch.diag(s)

# Compare with merged adapter
self.assertTrue(torch.allclose(expected_A, merged_lora.lora_A, atol=1e-5))
self.assertTrue(torch.allclose(expected_B, merged_lora.lora_B, atol=1e-5))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move the svd part to a helper method since it is repeated multiple times in this test class?

Copy link
Member

@hSterz hSterz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I just have some small questions

@@ -94,28 +94,6 @@ def add_adapter(self, adapter_name: str, layer_idx: int) -> bool:

return False

def average_adapter(self, adapter_name: str, input_adapters: Dict[str, float]) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this method go?


class DebertaModelAdaptersMixin(BertModelAdaptersMixin):
# Same as BERT, except that Deberta does not support the "lora_delta_w_svd" combine_strategy
support_lora_delta_w_svd = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is DeBERTa not supporting lora_delta_w_svd?

\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
$$

2. `combine_strategy = "lora_linear_only_negate_b"` Following [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html), this method only uses negative weights for the B-matrix if the weight is negative:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only in the name is redundant. I would remove it to make it shorter

In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`.
Note that for this to succeed, all averaged adapters must use the same adapter configuration.
### Merging Adapters
We can create new adapters by combining the parameters of multiple trained adapters, i.e. merging multiple existing adapters into a new one. The `average_adapter()` method provides this functionality:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can shortly say why this is a cool feature (average adapters on the same task can increase performance and combining adapters for different tasks can increase or reduce specific skills/features).

expected_new_head_weights[base_k] += weight * v

# Average the heads
model.average_head(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this average_headmethod implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants