Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ReFT (LoReFT, NoReFT, DiReFT) #705

Merged
merged 20 commits into from
Jul 1, 2024
Merged

Add ReFT (LoReFT, NoReFT, DiReFT) #705

merged 20 commits into from
Jul 1, 2024

Conversation

calpt
Copy link
Member

@calpt calpt commented May 31, 2024

This PR integrates multiple ReFT variants as new adapter methods.

Paper: https://arxiv.org/pdf/2404.03592
Original code: https://github.com/stanfordnlp/pyreft

Changes

Compatibility

Tested that Pyreft & Adapters produce the same outputs on inference by converting Pyreft checkpoints to Adapters checkpoints (tested settings: LoReft, NoReft, DiReft, weight tying, prefix, suffix, rank, mostly using roberta-base).

Script for testing & checkpoint conversion here: https://github.com/calpt/pyreft/blob/main/compatibility.py.

Evaluation

Roberta-base with LoReFT on GLUE, using hyperparameters similar to the paper:

Task Score
Cola (Matthews Corr.) 53.95
MNLI (Acc.) 83.23
MRPC (F1) 91.70
QNLI (Acc.) 90.94
QQP (Acc.) 86.82
RTE (Acc.) 76.53
SST-2 (Acc.) 93.81
STS-B (Spearmanr) 88.99

Todos

  • Modeling implementations
  • Add test methods
  • Make all checks passing
  • Add documentation
  • Make sure implementation produces same outputs as original code
  • Sanity check training runs

@frankaging
Copy link

@calpt Thanks for the PR! I took a quick look, and it looks promising. Here are two minor questions:

  1. It seems like tied and untied weights are handled here:
    https://github.com/adapter-hub/adapters/pull/705/files#diff-e791dd9c62ff127d32170821d7571b69e641f7ecc9462a58227fa5a80f3502f1R80

Does this mean if I want united weights among prefix and suffix, I will create two adaptors and set self.prefix_positions or self.suffix_positions to none?

  1. Since in the ReFT paper, we only run experiments where we add interventions to the residual stream (e.g., transformer layer or block output), will, by default, ReftConfig assume this? If I want to add interventions to different streams (e.g., attention output, MLP up-projection), how will the ReftConfig look like?

Thanks!

@calpt
Copy link
Member Author

calpt commented May 31, 2024

@frankaging Thanks for looking over this! Re your questions:

Re 1: In the current implementation, one or two modules per layer will be created depending on whether the tied_weights attribute in the config is set or not. See here:

n_units = 1 if config.tied_weights else 2
self.units = nn.ModuleList(
[
ReftUnit(
in_features,
config.r,
config.orthogonality,
config.subtract_projection,
config.non_linearity,
config.dropout,
)
for _ in range(n_units)
]

This makes it very easy for a user to to tie or not tie weights when adding a single Reft adapter, e.g.:

from adapters import AutoAdapterModel, ReftConfig

model = AutoAdapterModel.from_pretrained("...")

config = ReftConfig(
    layers="all", prefix_positions=1, suffix_positions=1, r=1,
    tied_weights=True  # set to True or False to share weights
)
model.add_adapter("my_reft", config=config)
model.set_active_adapters("my_reft")

Re 2: Currently, the Reft implementation always assumes interventions are added to the residual stream as you explained, since this is the method proposed in the paper. This is done via a PyTorch hook here:

def init_reft(model):
def hook_fn(module, args, output):
return (module.reft_layer(output[0]),) + output[1:]
for _, layer in model.iter_layers():
if not hasattr(layer, "reft_layer"):
layer.reft_layer = ReftLayer(model.config, model.adapters_config)
layer.register_forward_hook(hook_fn)

While no other intervention points are added for now, we can easily extend with similar hooks for other intervention points where it makes sense to do so.

Thanks again for looking over this! Please let us know if you have any suggestions or ideas what we should add or change for the first version!
Currently, the PR is still in a draft state. Once remaining issues are fixed and we have some documentation, would be happy to get your feedback again before we merge.

@frankaging
Copy link

@calpt Thanks for your responses! It makes sense to me.

Will the hook work out of the box for accelerated training (e.g., deepspeed, etc..)? Any existing tests on this? Thanks!

@calpt
Copy link
Member Author

calpt commented Jun 1, 2024

@frankaging No extensive tests for training yet at this point. Deepspeed support of this library is unfortunately flaky in general and not really a focus at the moment, but using e.g. torch distributed or HF Accelerate should work in the end.

@calpt calpt marked this pull request as ready for review June 8, 2024 10:10
Copy link
Member

@hSterz hSterz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Just some small comments and questions

src/adapters/methods/adapter_layer_base.py Outdated Show resolved Hide resolved
src/adapters/methods/reft.py Outdated Show resolved Hide resolved
tests/methods/test_reft.py Show resolved Hide resolved
@@ -968,6 +972,17 @@ def forward_context(self, context: ForwardContext, *args, **kwargs):
if hasattr(self.base_model, "prefix_tuning"):
context.prefix_states = self.base_model.prefix_tuning(*args, **kwargs)

# TODO this does not support padding on the left
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to leave this TODO open?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, pls check if this is correct

@calpt calpt requested review from hSterz and frankaging June 13, 2024 20:37
@calpt calpt marked this pull request as draft June 20, 2024 19:43
Copy link
Member

@lenglaender lenglaender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented on some minor things; everything else looks good & correctly implemented to me

Once the open comments are resolved & left padding is implemented this is ready to merge.

docs/methods.md Outdated Show resolved Hide resolved
docs/methods.md Outdated Show resolved Hide resolved
src/adapters/methods/adapter_layer_base.py Show resolved Hide resolved
tests/methods/test_reft.py Show resolved Hide resolved
@calpt calpt marked this pull request as ready for review June 22, 2024 17:52
Copy link
Member

@lenglaender lenglaender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. This is ready to merge

@calpt
Copy link
Member Author

calpt commented Jun 25, 2024

Thanks! I've added some quick training results on GLUE tasks in the description, based on that the implementation looks good.

@frankaging re distributed training: I've tested it to work with torch distributed & HF Accelerate via the Trainer class, e.g. for GLUE:

torchrun --standalone --nnodes=1 --nproc-per-node=2 examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path roberta-large \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length $SEQ \
  --pad_to_max_length False \
  --per_device_train_batch_size 32 \
  --learning_rate $LR \
  --warmup_ratio $WARMUP \
  --num_train_epochs $EPOCH \
  --output_dir output/$TASK_NAME \
  --overwrite_output_dir \
  --train_adapter \
  --adapter_config "loreft[prefix_positions=$POS]"

@frankaging
Copy link

Thanks! I've added some quick training results on GLUE tasks in the description, based on that the implementation looks good.

@frankaging re distributed training: I've tested it to work with torch distributed & HF Accelerate via the Trainer class, e.g. for GLUE:

torchrun --standalone --nnodes=1 --nproc-per-node=2 examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path roberta-large \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length $SEQ \
  --pad_to_max_length False \
  --per_device_train_batch_size 32 \
  --learning_rate $LR \
  --warmup_ratio $WARMUP \
  --num_train_epochs $EPOCH \
  --output_dir output/$TASK_NAME \
  --overwrite_output_dir \
  --train_adapter \
  --adapter_config "loreft[prefix_positions=$POS]"

@calpt Thanks! It's great to see this approach works for different kinds of parallel training!

I was re-looking into that orthogonal matrix initialization thing (i.e., I was referencing the PEFT repo ticket and asking whether we should remove a redundant init), and I found that for some cases, removing that init step might cause unstable results. Have you look in to this again by doing some tests on your side? Thanks.

@calpt
Copy link
Member Author

calpt commented Jun 27, 2024

I was re-looking into that orthogonal matrix initialization thing (i.e., I was referencing the PEFT repo ticket and asking whether we should remove a redundant init), and I found that for some cases, removing that init step might cause unstable results. Have you look in to this again by doing some tests on your side? Thanks.

Interesting, I haven't tested this specifically. From looking at the code, it makes sense the orthogonal init is redundant, would you suggest we re-add it still?

@calpt calpt merged commit d8c991f into adapter-hub:main Jul 1, 2024
4 checks passed
@calpt calpt deleted the dev/reft branch July 1, 2024 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants