Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving pytorch_model.bin with QLORA #123

Closed
grimulkan opened this issue Nov 3, 2023 · 7 comments
Closed

Saving pytorch_model.bin with QLORA #123

grimulkan opened this issue Nov 3, 2023 · 7 comments

Comments

@grimulkan
Copy link

At least for me, the per-epoch/step saving function of transformers trainer only saves the intermediate adapter_model.bin, but this does not include the trainable embed and norm layers. Is there some other strategy to get those layers (or force pytorch_model.bin to be saved)? Are the embed/norm layers also being trained with QLORA?

@gianlucamacri
Copy link
Contributor

hi @grimulkan , kinda of a late reply but I've been facing the same issue when not using deepspeed, hence I'm sharing what I did if that could help you or anyone that will read this. Specifically, I did two things:

  • use 8bit quantization instead of 4bit, because model saving for 4bit quantized models is yet to be supported in an official release of bitsandbytes, but they are woking on it so this may be unnecessary in the future;
  • add the following code after the training model.base_model.save_pretrained(training_args.output_dir). This will save the pytorch_model.bin corresponding to the base model with the weights updated due to the training for the embed and norm layers .

If you or the repo original authors found another method, let me know 😄

@grimulkan
Copy link
Author

grimulkan commented Nov 15, 2023

Thanks. I should have posted what I did earlier also. Here is how I addressed it:

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        modules_to_save = ["embed", "norm"]
        state_dict = kwargs["model"].state_dict()
        to_save = {}
        for key, value in state_dict.items():
            if any(module_name in key for module_name in modules_to_save):
                to_save[key.replace("base_model.model.", "")] = value
        torch.save(to_save, os.path.join(checkpoint_folder, "trainable_params.bin"))    

        return control    

and then when creating the Trainer

        trainer = Trainer(
            ...
            callbacks=[SavePeftModelCallback],
                        )

This will save intermediate steps based on the save strategy specified, same as the LORA adapter weights.

@gianlucamacri
Copy link
Contributor

grimulkan

great solution! with a minor tweak to generalize modules to save I think this should be merged into the sftq script @yukang2017 since it a rather annoying issue that one discovers only when training is completed

@yukang2017
Copy link
Member

Thanks for your great contribution. I will mention this in the README.md.

@grimulkan Would you mind provide a PR for fixing this? I will merge it into the main branch.

@grimulkan
Copy link
Author

Will do

@RonanKMcGovern
Copy link

@grimulkan , is the reason this works because the norm and embed layers are not quantized?

If they were quantized, I assume that would run into saving issues as saving 4-bit models is not supported

@grimulkan
Copy link
Author

Yes, they are being trained so they are not quantized, and they are small, so you don't need LORA/PEFT.

This also reminds me to actually submit this PR. I somehow forgot, so thanks for the reminder!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants