model checkpoints are broken when doing full parameter tuning #23

lchu-ibm · 2024-01-26T16:14:35Z

Currently when doing full parameter tuning (peft_config=None), the model can be fine-tuned and ckpt can be saved, however, the ckpt cannot be re-load back.

Root cause has been discovered with offline discussions, and most specifically:

PeftSavingCallback is not necessary for saving a full-parameter-tuning model. This callback is designed to separately save an adapter-only model, and the rationale behind it was that Trainer by itself will save everything in the root folder but for PEFT we want a clean adapter only folder, thus we duplicately save some of them in a separate place.
there is some corner case bug in the model saving with safe tensor if saving with save_pretrained directly as it will drop some shared tensors (e.g. lm_head) during saving. The better way to do it would be save_pretrained(..., state_dict=state_dict) by explicitly passing the full state dict, same as how Trainer natively save it.

So... due to 2, the ckpt is missing some tensors thus cannot load back, and due to 1, this callback-generated broken ckpt is either being used or overwrites original ckpt.

Solution would be revise the callback to skip doing anything on full parameter tuning.

The text was updated successfully, but these errors were encountered:

fabianlim · 2024-02-27T23:08:15Z

@lchu-ibm PeftSavingCallback is not needed anymore after this fix huggingface/transformers#28297

I had made a PR #53 to remove it.

lchu-ibm self-assigned this Jan 26, 2024

lchu-ibm added the bug Something isn't working label Jan 26, 2024

lchu-ibm mentioned this issue Jan 26, 2024

Fix PeftSavingCallback #24

Closed

Ssukriti closed this as completed Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model checkpoints are broken when doing full parameter tuning #23

model checkpoints are broken when doing full parameter tuning #23

lchu-ibm commented Jan 26, 2024 •

edited

Loading

fabianlim commented Feb 27, 2024

model checkpoints are broken when doing full parameter tuning #23

model checkpoints are broken when doing full parameter tuning #23

Comments

lchu-ibm commented Jan 26, 2024 • edited Loading

fabianlim commented Feb 27, 2024

lchu-ibm commented Jan 26, 2024 •

edited

Loading