Let Huggingface Properly Initialize Arguments, and Fix FSDP-LORA Checkpoint-Saves and Resumption #53

fabianlim · 2024-02-23T07:48:16Z

@raghukiran1224 @Ssukriti @anhuong Im suggesting two fixes here

We observe failures with newer version of transformers because of a newly added xla_fsdp_v2 flag here. The current strategy is to manually patch missing flags because transformers.TrainingArguments.__post_init__ is not called. But this means one has to constantly update the manual patches if there are changes to HF code, which is not ideal. Huggingface trainer will properly initialize everything based on TrainingArguments (including gradient checkpointing.
The PeftSavingCallback is not the ideal patch now that Support saving only PEFT adapter in checkpoints when using PEFT + FSDP huggingface/transformers#28297 has been merged, where now FSDP will properly save and resume adapter checkpoints. This will supercede the PeftSavingCallback strategy, as PeftSavingCallback will not even properly handle resumptions and different state dict saving strategies, but the already merged fix will. the PEFTCallback is removed in trl here

fabianlim · 2024-02-29T23:51:23Z

I noticed that any changes made here might also need to be reflected here https://github.ibm.com/ai-foundation/sft-trainer-image/blob/main/launch_training.py.

tuning/sft_trainer.py

Ssukriti

left some questions and comments
DCO check is failing
Now that we have linted and formatted our files in main branch, the PR will need to be rebased on latest main . you need to update your fork to the latest main and then this branch with git merge main
sorry for the trouble, but any major changes upstream with linting and formatting is done now, so its a one-time pain :)

Thank you for the contribution

Ssukriti · 2024-03-07T06:41:21Z

also due to lack of unit tests at the moment, Please confirm prompt tuning still works in single gPU env with this branch. I know this PR is verified using multiple GPUs. Is it also verified in single gPU environment?

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim · 2024-03-08T03:29:39Z

@Ssukriti I have rebased the changes and linted.

also due to lack of unit tests at the moment, Please confirm prompt tuning still works in single gPU env with this branch. I know this PR is verified using multiple GPUs. Is it also verified in single gPU environment?

Yes and I have verified that it works for:

single GPU
multi GPU
prompt tuning

Ssukriti

Thanks a lot @fabianlim for thoroughly testing your PR!

tuning/config/configs.py

Ssukriti · 2024-03-08T19:21:43Z

cant merge because of failing pylint, which seems to be a valid warning. Since this is your first commit Fabian, I have to manually start the workflow checks - pylint etc, thats why you didnt catch it earlier. Once this PR is merged and your fork is recognized as a contributor, going forward the pylint checks with run automatically on your PRs and you wont have to wait for an admin to start checks :)

to run pylint locally on your machine you can do tox -e lint as documented here
https://github.com/foundation-model-stack/fms-hf-tuning/pull/84/files

Our contributing guides should be merged soon.

I have approved the PR and will merge as soon as pylint checks pass

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim · 2024-03-09T00:14:07Z

@Ssukriti I have added one commit on top of your merge above #84, that removes the extranous __post_init__. I have also ran the pylint (which checks ok). Can you help to trigger the approving workflows?

Ssukriti

thank you!!!

…kpoint-Saves and Resumption (foundation-model-stack#53) * training args should call post init to intialize all HF flags Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * remove run_distribtued flag and peft_saving callback Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * revert deletion of validation checks on some train args Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * revert the addition of __post_init__ as it is actually not needed Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> --------- Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>

This was referenced Feb 27, 2024

bug: sft_trainer.py::train() raises exception due to change in transformers 4.38.0 #62

Closed

model checkpoints are broken when doing full parameter tuning #23

Closed

fabianlim force-pushed the fix/arguments branch from 809fb2f to 7196479 Compare February 28, 2024 00:04

fabianlim mentioned this pull request Mar 6, 2024

Adding unit tests for each tuning type #79

Closed

2 tasks

Ssukriti reviewed Mar 7, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti reviewed Mar 7, 2024

View reviewed changes

tuning/sft_trainer.py Show resolved Hide resolved

Ssukriti reviewed Mar 7, 2024

View reviewed changes

tuning/sft_trainer.py Show resolved Hide resolved

Ssukriti requested changes Mar 7, 2024

View reviewed changes

Ssukriti mentioned this pull request Mar 7, 2024

Fix PeftSavingCallback #24

Closed

fabianlim requested review from anhuong and alex-jw-brooks as code owners March 7, 2024 15:41

fabianlim mentioned this pull request Mar 8, 2024

Raise if fsdp plugin is unset #30

Closed

fabianlim added 2 commits March 8, 2024 10:50

training args should call post init to intialize all HF flags

0998b52

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

remove run_distribtued flag and peft_saving callback

7dfe174

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the fix/arguments branch from 6c7ac3b to 72bcea8 Compare March 8, 2024 03:16

revert deletion of validation checks on some train args

e718d04

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the fix/arguments branch from 72bcea8 to e718d04 Compare March 8, 2024 03:28

Merge branch 'main' into fix/arguments

5a2097c

Ssukriti previously approved these changes Mar 8, 2024

View reviewed changes

Ssukriti reviewed Mar 8, 2024

View reviewed changes

tuning/config/configs.py Outdated Show resolved Hide resolved

Merge branch 'main' into fix/arguments

4b69e28

fabianlim dismissed Ssukriti’s stale review via f38e994 March 9, 2024 00:09

revert the addition of __post_init__ as it is actually not needed

1da923e

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

fabianlim force-pushed the fix/arguments branch from f38e994 to 1da923e Compare March 9, 2024 00:12

Ssukriti approved these changes Mar 9, 2024

View reviewed changes

Ssukriti merged commit 0729820 into foundation-model-stack:main Mar 9, 2024
3 checks passed

Ssukriti mentioned this pull request Mar 9, 2024

Unit tests #83

Merged

2 tasks

fabianlim deleted the fix/arguments branch March 9, 2024 02:44

tedhtchang mentioned this pull request Mar 11, 2024

Switches dependencies from txt file to toml file #68

Merged

3 tasks

fabianlim mentioned this pull request Mar 12, 2024

bug: Boolean values are represented as strings in default fsdp config translates to True #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let Huggingface Properly Initialize Arguments, and Fix FSDP-LORA Checkpoint-Saves and Resumption #53

Let Huggingface Properly Initialize Arguments, and Fix FSDP-LORA Checkpoint-Saves and Resumption #53

fabianlim commented Feb 23, 2024 •

edited

Loading

fabianlim commented Feb 29, 2024 •

edited

Loading

Ssukriti left a comment •

edited

Loading

Ssukriti commented Mar 7, 2024 •

edited

Loading

fabianlim commented Mar 8, 2024 •

edited

Loading

Ssukriti left a comment

Ssukriti commented Mar 8, 2024

fabianlim commented Mar 9, 2024

Ssukriti left a comment

Let Huggingface Properly Initialize Arguments, and Fix FSDP-LORA Checkpoint-Saves and Resumption #53

Let Huggingface Properly Initialize Arguments, and Fix FSDP-LORA Checkpoint-Saves and Resumption #53

Conversation

fabianlim commented Feb 23, 2024 • edited Loading

fabianlim commented Feb 29, 2024 • edited Loading

Ssukriti left a comment • edited Loading

Choose a reason for hiding this comment

Ssukriti commented Mar 7, 2024 • edited Loading

fabianlim commented Mar 8, 2024 • edited Loading

Ssukriti left a comment

Choose a reason for hiding this comment

Ssukriti commented Mar 8, 2024

fabianlim commented Mar 9, 2024

Ssukriti left a comment

Choose a reason for hiding this comment

fabianlim commented Feb 23, 2024 •

edited

Loading

fabianlim commented Feb 29, 2024 •

edited

Loading

Ssukriti left a comment •

edited

Loading

Ssukriti commented Mar 7, 2024 •

edited

Loading

fabianlim commented Mar 8, 2024 •

edited

Loading