断点重训：如何设置resume_from_checkpoint #52

simonqian · 2023-04-09T15:48:01Z

新手想问一个关于断点重训的问题：在重新训练的时候，resume_from_checkpoint设置为哪个目录呢？

我现在的finetune脚本是：

DATA_PATH="./sample/merge.json" #"../dataset/instruction/guanaco_non_chat_mini_52K-utf8.json" #"./sample/merge_sample.json"
OUTPUT_PATH="my-lora-Vicuna"
MODEL_PATH="../llama-13b-hf/"
lora_checkpoint="../Chinese-Vicuna-lora-13b-belle-and-guanaco/"
TEST_SIZE=2000

python finetune.py \
--data_path $DATA_PATH \
--output_path $OUTPUT_PATH \
--model_path $MODEL_PATH \
--eval_steps 200 \
--save_steps 200 \
--test_size $TEST_SIZE

目前训练时间需要240个小时。
假设我现在停止训练，然后 OUTPUT_PATH="my-lora-Vicuna" 的目录输出如下：

my-lora-Vicuna/
├── checkpoint-200
│   ├── optimizer.pt
│   ├── pytorch_model.bin
│   ├── rng_state.pth
│   ├── scaler.pt
│   ├── scheduler.pt
│   ├── trainer_state.json
│   └── training_args.bin
└── checkpoint-400
    ├── optimizer.pt
    ├── pytorch_model.bin
    ├── rng_state.pth
    ├── scaler.pt
    ├── scheduler.pt
    ├── trainer_state.json
    └── training_args.bin

2 directories, 14 files

如果我要重新训练，resume_from_checkpoint参数应该设置为 my-lora-Vicuna/checkpoint-400 吗？

Facico · 2023-04-09T15:53:26Z

可以，设置为最后保存的一个checkpoint就行。可以参考我们finetune_continue.sh中的设置

simonqian · 2023-04-10T00:56:23Z

好的，谢谢

1530426574 · 2023-07-24T08:47:04Z

为什么用int8精度加载权重有optimizer.pt这些文件，但是用16精度加载模型，没有这些文件

1530426574 · 2023-07-24T08:47:21Z

可以，设置为最后保存的一个checkpoint就行。可以参考我们finetune_continue.sh中的设置
为什么用int8精度加载权重有optimizer.pt这些文件，但是用16精度加载模型，没有这些文件

simonqian closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

断点重训：如何设置resume_from_checkpoint #52

断点重训：如何设置resume_from_checkpoint #52

simonqian commented Apr 9, 2023

Facico commented Apr 9, 2023

simonqian commented Apr 10, 2023

1530426574 commented Jul 24, 2023

1530426574 commented Jul 24, 2023

断点重训：如何设置resume_from_checkpoint #52

断点重训：如何设置resume_from_checkpoint #52

Comments

simonqian commented Apr 9, 2023

Facico commented Apr 9, 2023

simonqian commented Apr 10, 2023

1530426574 commented Jul 24, 2023

1530426574 commented Jul 24, 2023