Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model checkpoints are broken when doing full parameter tuning #23

Closed
lchu-ibm opened this issue Jan 26, 2024 · 1 comment
Closed

model checkpoints are broken when doing full parameter tuning #23

lchu-ibm opened this issue Jan 26, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@lchu-ibm
Copy link
Contributor

lchu-ibm commented Jan 26, 2024

Currently when doing full parameter tuning (peft_config=None), the model can be fine-tuned and ckpt can be saved, however, the ckpt cannot be re-load back.

Root cause has been discovered with offline discussions, and most specifically:

  1. PeftSavingCallback is not necessary for saving a full-parameter-tuning model. This callback is designed to separately save an adapter-only model, and the rationale behind it was that Trainer by itself will save everything in the root folder but for PEFT we want a clean adapter only folder, thus we duplicately save some of them in a separate place.
  2. there is some corner case bug in the model saving with safe tensor if saving with save_pretrained directly as it will drop some shared tensors (e.g. lm_head) during saving. The better way to do it would be save_pretrained(..., state_dict=state_dict) by explicitly passing the full state dict, same as how Trainer natively save it.

So... due to 2, the ckpt is missing some tensors thus cannot load back, and due to 1, this callback-generated broken ckpt is either being used or overwrites original ckpt.

Solution would be revise the callback to skip doing anything on full parameter tuning.

@lchu-ibm lchu-ibm self-assigned this Jan 26, 2024
@lchu-ibm lchu-ibm added the bug Something isn't working label Jan 26, 2024
@fabianlim
Copy link
Collaborator

@lchu-ibm PeftSavingCallback is not needed anymore after this fix huggingface/transformers#28297

I had made a PR #53 to remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants