Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调会自动删除ep文件, 导致微调结束后找不到需要ep文件 #1668

Open
bird-9 opened this issue Apr 26, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@bird-9
Copy link

bird-9 commented Apr 26, 2024

🐛 Bug

微调会自动删除ep文件, 导致微调结束后找不到需要ep文件

Code sample

训练参数

torchrun \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node ${gpu_num} \
../../../funasr/bin/train.py \
++model="${model_name_or_model_dir}" \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.batch_size=40000 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=8 \
++train_conf.max_epoch=100 \
++train_conf.log_interval=1 \
++train_conf.resume=false \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}

报错日志:

查看outputs最后只保留了20个ep文件,导致Checkpoint file not found

[2024-04-26 07:54:21,218][root][INFO] - Update best acc: 0.1071, /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.best
[2024-04-26 07:54:21,220][root][INFO] - Delete: /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep80  训练的时候他会删除一些ep文件
[2024-04-26 07:54:21,367][root][INFO] - rank: 0, time_escaped_epoch: 0.014 hours, estimated to finish 100 epoch: 0.000 hours

average_checkpoints: ['/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep0', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep1', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep2', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep3', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep4', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep5', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep6', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep7', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep8', '/diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep9']
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep0 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep1 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep2 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep3 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep4 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep5 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep6 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep7 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep8 not found.
Checkpoint file /diskb/code/dev/FunASR/examples/industrial_data_pretraining/paraformer/outputs/model.pt.ep9 not found.

Expected behavior

Environment

  • OS (e.g., Linux): ubuntu
  • FunASR Version (e.g., 1.0.0): 1.0.25
  • ModelScope Version (e.g., 1.11.0):
  • PyTorch Version (e.g., 2.0.0): 2.3.0
  • How you installed funasr (pip, source): pip
  • Python version: 3.10.14
  • GPU (e.g., V100M32) 3096
  • CUDA/cuDNN version (e.g., cuda11.7):
  • Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
  • Any other relevant information:

image

@bird-9 bird-9 added the bug Something isn't working label Apr 26, 2024
@chenmiaotian
Copy link

我也出现了,本来是有的,被删除了,你那解决了没
Checkpoint file ./outputs/model.pt.ep1 not found.
Checkpoint file ./outputs/model.pt.ep2 not found.
Checkpoint file ./outputs/model.pt.ep3 not found.
Checkpoint file ./outputs/model.pt.ep4 not found.
Checkpoint file ./outputs/model.pt.ep5 not found.
Checkpoint file ./outputs/model.pt.ep6 not found.
Checkpoint file ./outputs/model.pt.ep7 not found.
Checkpoint file ./outputs/model.pt.ep8 not found.
Checkpoint file ./outputs/model.pt.ep9 not found.
Checkpoint file ./outputs/model.pt.ep10 not found.
Error executing job with overrides: ['++model=iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch', '++train_data_set_list=data/train.jsonl', '++valid_data_set_list=data/val.jsonl', '++dataset_conf.batch_size=20000', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=4', '++train_conf.max_epoch=50', '++train_conf.log_interval=1', '++train_conf.resume=false', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++optim_conf.lr=0.0002', '++output_dir=./outputs']
Traceback (most recent call last):
File "/mnt/workspace/FunASR/funasr/bin/train.py", line 250, in
main_hydra()
File "/opt/conda/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/mnt/workspace/FunASR/funasr/bin/train.py", line 51, in main_hydra
main(**kwargs)
File "/mnt/workspace/FunASR/funasr/bin/train.py", line 244, in main
average_checkpoints(trainer.output_dir, trainer.avg_nbest_model)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/workspace/FunASR/funasr/train_utils/average_nbest_models.py", line 65, in average_checkpoints
raise RuntimeError("No checkpoints found for averaging.")
RuntimeError: No checkpoints found for averaging.

@LauraGPT
Copy link
Collaborator

LauraGPT commented May 7, 2024

try to keep ++train_conf.keep_nbest_models equals ++train_conf.avg_nbest_model.

@chenmiaotian
Copy link

try to keep ++train_conf.keep_nbest_models equals ++train_conf.avg_nbest_model.

下面这3个参数的值必须一样吗,我试过如果下面这样还是会报之前上面的错误
max_epoch=50
keep_nbest_models=20
avg_nbest_model=20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants