-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Task Arithmetics #698
base: main
Are you sure you want to change the base?
Conversation
…statements in python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good overall!
Looked over everything except for the notebook and left some comments.
@@ -57,6 +57,9 @@ cd adapters | |||
pip install . | |||
``` | |||
|
|||
> **Note**: The _Adapters_ library has replaced the [`adapter-transformers`](https://github.com/adapter-hub/adapter-transformers-legacy) package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this moved down? I'd prefer to keep it close to the top since there might still be adapter-transformers users that are redirected to this page
@@ -36,9 +36,9 @@ A Unified Library for Parameter-Efficient and Modular Transfer Learning | |||
[![GitHub](https://img.shields.io/github/license/adapter-hub/adapters.svg?color=blue)](https://github.com/adapter-hub/adapters/blob/main/LICENSE) | |||
[![PyPI](https://img.shields.io/pypi/v/adapters)](https://pypi.org/project/adapters/) | |||
|
|||
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [various adapter methods](https://docs.adapterhub.ml/overview.html) into [state-of-the-art pre-trained language models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. | |||
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. _Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. _Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks. | |
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. | |
_Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks. |
@@ -156,7 +159,7 @@ Currently, adapters integrates all architectures and methods listed below: | |||
| UniPELT | [Mao et al. (2022)](https://arxiv.org/pdf/2110.07577.pdf) | [Docs](https://docs.adapterhub.ml/method_combinations.html#unipelt) | | |||
| Prompt Tuning | [Lester et al. (2021)](https://aclanthology.org/2021.emnlp-main.243/) | [Docs](https://docs.adapterhub.ml/methods.html#prompt-tuning) | | |||
| QLoRA | [Dettmers et al. (2023)](https://arxiv.org/pdf/2305.14314.pdf) | [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb) | | |||
| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) | | |||
| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add task arithmetics paper here?
As this process is typically not done dynamically at runtime, `adapters` provides `average_adapter()` as a dedicated method for parameter averaging. | ||
In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`. | ||
Note that for this to succeed, all averaged adapters must use the same adapter configuration. | ||
### Merging Adapters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can think about adding a separate doc page for merging to make it more discoverable?
else: | ||
avg_state_dict[k] = zhang_weight * v | ||
|
||
elif combine_strategy == "lora_delta_w_svd": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this strategy to a helper method since the implementation is slightly lengthy?
def _average_shared_parameters(self, adapter_name: str, input_adapters: Dict[str, float], combine_strategy: str): | ||
if combine_strategy != "linear": | ||
raise ValueError( | ||
f"Combine strategy {combine_strategy} not supported for Compacter. Only 'linear' is supported." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Combine strategy {combine_strategy} not supported for Compacter. Only 'linear' is supported." | |
f"Combine strategy {combine_strategy} not supported for shared parameters. Only 'linear' is supported." |
def _get_head_config_hash(config): | ||
return get_adapter_config_hash({k: v for k, v in config.items() if k not in keys_to_ignore}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't this directly call the hash method with the new ignore param?
except ValueError as ex: | ||
self.delete_adapter(adapter_name) | ||
raise ex | ||
if set_active: | ||
self.set_active_adapters(adapter_name) | ||
|
||
def average_head( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method live in the flexible heads mixin since it only applies to those model classes?
self.assertTrue(torch.allclose(expected_A, merged_lora.lora_A, atol=1e-5)) | ||
self.assertTrue(torch.allclose(expected_B, merged_lora.lora_B, atol=1e-5)) | ||
|
||
# 2. if we merge multiple adapters with weight 0 except one adapter with weight 1, the resulting adapter should be the same as the adapter with weight 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like cases 1 and 2 should live in separate test classes here
if isinstance(module, LoRALayer): | ||
if f"{name}_0" in module.loras and f"{combine_strategy}_case1" in module.loras: | ||
original_lora = module.loras[f"{name}_0"] | ||
merged_lora = module.loras[f"{combine_strategy}_case1"] | ||
|
||
# Compute SVD of the original delta_w | ||
u, s, v = torch.svd(original_lora.delta_w) | ||
u = u[:, :svd_rank] | ||
s = s[:svd_rank] | ||
v = v[:, :svd_rank] | ||
|
||
# Reconstruct A and B matrices | ||
expected_A = v.t() | ||
expected_B = u @ torch.diag(s) | ||
|
||
# Compare with merged adapter | ||
self.assertTrue(torch.allclose(expected_A, merged_lora.lora_A, atol=1e-5)) | ||
self.assertTrue(torch.allclose(expected_B, merged_lora.lora_B, atol=1e-5)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move the svd part to a helper method since it is repeated multiple times in this test class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I just have some small questions
@@ -94,28 +94,6 @@ def add_adapter(self, adapter_name: str, layer_idx: int) -> bool: | |||
|
|||
return False | |||
|
|||
def average_adapter(self, adapter_name: str, input_adapters: Dict[str, float]) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this method go?
|
||
class DebertaModelAdaptersMixin(BertModelAdaptersMixin): | ||
# Same as BERT, except that Deberta does not support the "lora_delta_w_svd" combine_strategy | ||
support_lora_delta_w_svd = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is DeBERTa not supporting lora_delta_w_svd?
\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i | ||
$$ | ||
|
||
2. `combine_strategy = "lora_linear_only_negate_b"` Following [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html), this method only uses negative weights for the B-matrix if the weight is negative: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only
in the name is redundant. I would remove it to make it shorter
In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`. | ||
Note that for this to succeed, all averaged adapters must use the same adapter configuration. | ||
### Merging Adapters | ||
We can create new adapters by combining the parameters of multiple trained adapters, i.e. merging multiple existing adapters into a new one. The `average_adapter()` method provides this functionality: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can shortly say why this is a cool feature (average adapters on the same task can increase performance and combining adapters for different tasks can increase or reduce specific skills/features).
expected_new_head_weights[base_k] += weight * v | ||
|
||
# Average the heads | ||
model.average_head( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this average_head
method implemented?
This PR adds support for various task arithmetic options for LoRA. Until now, our library supported averaging only by linearly combining different adapters. This may be insufficient, especially for LoRA — hence, several publications have proposed other ways to perform task arithmetic.
This PR: