Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let Huggingface Properly Initialize Arguments, and Fix FSDP-LORA Checkpoint-Saves and Resumption #53

Merged
merged 6 commits into from
Mar 9, 2024

Conversation

fabianlim
Copy link
Collaborator

@fabianlim fabianlim commented Feb 23, 2024

@raghukiran1224 @Ssukriti @anhuong Im suggesting two fixes here

  1. We observe failures with newer version of transformers because of a newly added xla_fsdp_v2 flag here. The current strategy is to manually patch missing flags because transformers.TrainingArguments.__post_init__ is not called. But this means one has to constantly update the manual patches if there are changes to HF code, which is not ideal. Huggingface trainer will properly initialize everything based on TrainingArguments (including gradient checkpointing.
  2. The PeftSavingCallback is not the ideal patch now that Support saving only PEFT adapter in checkpoints when using PEFT + FSDP huggingface/transformers#28297 has been merged, where now FSDP will properly save and resume adapter checkpoints. This will supercede the PeftSavingCallback strategy, as PeftSavingCallback will not even properly handle resumptions and different state dict saving strategies, but the already merged fix will. the PEFTCallback is removed in trl here

@fabianlim
Copy link
Collaborator Author

fabianlim commented Feb 29, 2024

I noticed that any changes made here might also need to be reflected here https://github.ibm.com/ai-foundation/sft-trainer-image/blob/main/launch_training.py.

tuning/sft_trainer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. left some questions and comments
  2. DCO check is failing
  3. Now that we have linted and formatted our files in main branch, the PR will need to be rebased on latest main . you need to update your fork to the latest main and then this branch with git merge main
    sorry for the trouble, but any major changes upstream with linting and formatting is done now, so its a one-time pain :)

Thank you for the contribution

@Ssukriti
Copy link
Collaborator

Ssukriti commented Mar 7, 2024

also due to lack of unit tests at the moment, Please confirm prompt tuning still works in single gPU env with this branch. I know this PR is verified using multiple GPUs. Is it also verified in single gPU environment?

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
@fabianlim
Copy link
Collaborator Author

fabianlim commented Mar 8, 2024

@Ssukriti I have rebased the changes and linted.

also due to lack of unit tests at the moment, Please confirm prompt tuning still works in single gPU env with this branch. I know this PR is verified using multiple GPUs. Is it also verified in single gPU environment?

Yes and I have verified that it works for:

  • single GPU
  • multi GPU
  • prompt tuning

Ssukriti
Ssukriti previously approved these changes Mar 8, 2024
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @fabianlim for thoroughly testing your PR!

tuning/config/configs.py Outdated Show resolved Hide resolved
@Ssukriti
Copy link
Collaborator

Ssukriti commented Mar 8, 2024

cant merge because of failing pylint, which seems to be a valid warning. Since this is your first commit Fabian, I have to manually start the workflow checks - pylint etc, thats why you didnt catch it earlier. Once this PR is merged and your fork is recognized as a contributor, going forward the pylint checks with run automatically on your PRs and you wont have to wait for an admin to start checks :)

to run pylint locally on your machine you can do tox -e lint as documented here
https://github.com/foundation-model-stack/fms-hf-tuning/pull/84/files

Our contributing guides should be merged soon.

I have approved the PR and will merge as soon as pylint checks pass

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
@fabianlim
Copy link
Collaborator Author

@Ssukriti I have added one commit on top of your merge above #84, that removes the extranous __post_init__. I have also ran the pylint (which checks ok). Can you help to trigger the approving workflows?

Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!!!

@Ssukriti Ssukriti merged commit 0729820 into foundation-model-stack:main Mar 9, 2024
3 checks passed
@Ssukriti Ssukriti mentioned this pull request Mar 9, 2024
2 tasks
@fabianlim fabianlim deleted the fix/arguments branch March 9, 2024 02:44
jbusche pushed a commit to jbusche/fms-hf-tuning that referenced this pull request Mar 25, 2024
…kpoint-Saves and Resumption (foundation-model-stack#53)

* training args should call post init to intialize all HF flags

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* remove run_distribtued flag and peft_saving callback

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* revert deletion of validation checks on some train args

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* revert the addition of __post_init__ as it is actually not needed

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

---------

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>
anhuong pushed a commit to anhuong/fms-hf-tuning that referenced this pull request Apr 3, 2024
…kpoint-Saves and Resumption (foundation-model-stack#53)

* training args should call post init to intialize all HF flags

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* remove run_distribtued flag and peft_saving callback

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* revert deletion of validation checks on some train args

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* revert the addition of __post_init__ as it is actually not needed

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

---------

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants