Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

断点重训:如何设置resume_from_checkpoint #52

Closed
simonqian opened this issue Apr 9, 2023 · 4 comments
Closed

断点重训:如何设置resume_from_checkpoint #52

simonqian opened this issue Apr 9, 2023 · 4 comments

Comments

@simonqian
Copy link

新手想问一个关于断点重训的问题:在重新训练的时候,resume_from_checkpoint设置为哪个目录呢?

我现在的finetune脚本是:

DATA_PATH="./sample/merge.json" #"../dataset/instruction/guanaco_non_chat_mini_52K-utf8.json" #"./sample/merge_sample.json"
OUTPUT_PATH="my-lora-Vicuna"
MODEL_PATH="../llama-13b-hf/"
lora_checkpoint="../Chinese-Vicuna-lora-13b-belle-and-guanaco/"
TEST_SIZE=2000

python finetune.py \
--data_path $DATA_PATH \
--output_path $OUTPUT_PATH \
--model_path $MODEL_PATH \
--eval_steps 200 \
--save_steps 200 \
--test_size $TEST_SIZE

目前训练时间需要240个小时。
假设我现在停止训练,然后 OUTPUT_PATH="my-lora-Vicuna" 的目录输出如下:

my-lora-Vicuna/
├── checkpoint-200
│   ├── optimizer.pt
│   ├── pytorch_model.bin
│   ├── rng_state.pth
│   ├── scaler.pt
│   ├── scheduler.pt
│   ├── trainer_state.json
│   └── training_args.bin
└── checkpoint-400
    ├── optimizer.pt
    ├── pytorch_model.bin
    ├── rng_state.pth
    ├── scaler.pt
    ├── scheduler.pt
    ├── trainer_state.json
    └── training_args.bin

2 directories, 14 files

如果我要重新训练,resume_from_checkpoint参数应该设置为 my-lora-Vicuna/checkpoint-400 吗?

@Facico
Copy link
Owner

Facico commented Apr 9, 2023

可以,设置为最后保存的一个checkpoint就行。可以参考我们finetune_continue.sh中的设置

@simonqian
Copy link
Author

好的,谢谢

@1530426574
Copy link

为什么用int8精度加载权重有optimizer.pt这些文件,但是用16精度加载模型,没有这些文件

@1530426574
Copy link

可以,设置为最后保存的一个checkpoint就行。可以参考我们finetune_continue.sh中的设置
为什么用int8精度加载权重有optimizer.pt这些文件,但是用16精度加载模型,没有这些文件

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants