Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use --resume, error:FileNotFoundError: [Errno 2] No such file or directory: '.cfg' #378

Closed
Joker9194 opened this issue Dec 7, 2021 · 13 comments

Comments

@Joker9194
Copy link

I use the command: python train.py --device 0 --batch-size 4 --img 640 640 --cfg cfg/yolov4-pacsp-x-mish.cfg --data data/mydata.yaml --weight weights/yolov4-csp-x-mish.weights --name yolov4-pacsp-x-mish --resume, but had the error:Use --resume, error:FileNotFoundError: [Errno 2] No such file or directory: '.cfg'

So what is happen? My command is inaccurate?

@Joker9194
Copy link
Author

I see the code, why opt.cfg = ''?
if opt.resume: # resume an interrupted run
ckpt = opt.resume if isinstance(opt.resume, str) else get_latest_run() # specified or most recent path
assert os.path.isfile(ckpt), 'ERROR: --resume checkpoint does not exist'
with open(Path(ckpt).parent.parent / 'opt.yaml') as f:
opt = argparse.Namespace(**yaml.load(f, Loader=yaml.FullLoader)) # replace
opt.cfg, opt.weights, opt.resume = '', ckpt, True
logger.info('Resuming training from %s' % ckpt)

@Grutschus
Copy link

Grutschus commented Dec 14, 2021

I ran into the same issue.

It appears, that the config file is needed regardless of whether a checkpoint is loaded:

pretrained = weights.endswith('.pt')
if pretrained:
    with torch_distributed_zero_first(rank):
        attempt_download(weights)  # download if not found locally
    ckpt = torch.load(weights, map_location=device)  # load checkpoint
    --> model = Darknet(opt.cfg).to(device)  # create
    state_dict = {
        k: v
        for k, v in ckpt['model'].items()
        if model.state_dict()[k].numel() == v.numel()
    }
    model.load_state_dict(state_dict, strict=False)
    print('Transferred %g/%g items from %s' %
          (len(state_dict), len(model.state_dict()), weights))  # report
else:
    model = Darknet(opt.cfg).to(device)  # create

To me it seems like, the plan originally was to save (and serialize) the config together with the model here:

        save = (not opt.nosave) or (final_epoch and not opt.evolve)
        if save:
            with open(results_file, 'r') as f:  # create checkpoint
                ckpt = {
                    'epoch':
                    epoch,
                    'best_fitness':
                    best_fitness,
                    'best_fitness_p':
                    best_fitness_p,
                    'best_fitness_r':
                    best_fitness_r,
                    'best_fitness_ap50':
                    best_fitness_ap50,
                    'best_fitness_ap':
                    best_fitness_ap,
                    'best_fitness_f':
                    best_fitness_f,
                    'training_results':
                    f.read(),
                    'model':
                    ema.ema.module.state_dict()
                    if hasattr(ema, 'module') else ema.ema.state_dict(),
                    'optimizer':
                    None if final_epoch else optimizer.state_dict(),
                    'wandb_id':
                    wandb_run.id if wandb else None
                }

This is similar to what has been done here

However, the config is not save in the ckpt at the moment, which is why the *.cfg file is needed. I feel like this is a good idea, though.

TLDR:
Config file is needed. Changing the commented line to the one below fixed the issue for me.

    # opt.cfg, opt.weights, opt.resume = '', ckpt, True
    opt.weights, opt.resume = ckpt, True

@Joker9194
Copy link
Author

Joker9194 commented Dec 20, 2021

Continue training with the following code? I will try it.
# opt.cfg, opt.weights, opt.resume = '', ckpt, True
opt.weights, opt.resume = ckpt, True

@Grutschus
Copy link

Continue training with the following code? I will try it. # opt.cfg, opt.weights, opt.resume = '', ckpt, True opt.weights, opt.resume = ckpt, True

At least that worked for me. Be aware though that once you've tried to continue training without that change to the code, the script will override your hyperparameters.yaml file. Therefore, the cfg entry (in the .yaml) will be empty (cfg=''). You might have to fix that too...

@Joker9194
Copy link
Author

At least that worked for me. Be aware though that once you've tried to continue training without that change to the code, the script will override your hyperparameters.yaml file. Therefore, the cfg entry (in the .yaml) will be empty (cfg=''). You might have to fix that too...

ok, thx.

@YoungjaeDev
Copy link

with open(save_dir / 'opt.yaml', 'w') as f:
        yaml.dump(vars(opt), f, sort_keys=False)

It doesn't look like the dump yaml is being written here.

@Joker9194
Copy link
Author

with open(save_dir / 'opt.yaml', 'w') as f:
        yaml.dump(vars(opt), f, sort_keys=False)

It doesn't look like the dump yaml is being written here.

where is the code?

@YoungjaeDev
Copy link

@Joker9194

PyTorch_YOLOv4/train.py

Lines 57 to 60 in eb5f166

with open(save_dir / 'hyp.yaml', 'w') as f:
yaml.dump(hyp, f, sort_keys=False)
with open(save_dir / 'opt.yaml', 'w') as f:
yaml.dump(vars(opt), f, sort_keys=False)

@Joker9194
Copy link
Author

@Joker9194

PyTorch_YOLOv4/train.py

Lines 57 to 60 in eb5f166

with open(save_dir / 'hyp.yaml', 'w') as f:
yaml.dump(hyp, f, sort_keys=False)
with open(save_dir / 'opt.yaml', 'w') as f:
yaml.dump(vars(opt), f, sort_keys=False)

I check the run/train/name/ the opt.yaml has saved during training, but not loaded when use --resume

@YoungjaeDev
Copy link

Continue training with the following code? I will try it. # opt.cfg, opt.weights, opt.resume = '', ckpt, True opt.weights, opt.resume = ckpt, True

@Joker9194

@Joker9194
Copy link
Author

Continue training with the following code? I will try it. # opt.cfg, opt.weights, opt.resume = '', ckpt, True opt.weights, opt.resume = ckpt, True

@Joker9194

it not work for me

@Joker9194
Copy link
Author

I ran into the same issue.

It appears, that the config file is needed regardless of whether a checkpoint is loaded:

pretrained = weights.endswith('.pt')
if pretrained:
    with torch_distributed_zero_first(rank):
        attempt_download(weights)  # download if not found locally
    ckpt = torch.load(weights, map_location=device)  # load checkpoint
    --> model = Darknet(opt.cfg).to(device)  # create
    state_dict = {
        k: v
        for k, v in ckpt['model'].items()
        if model.state_dict()[k].numel() == v.numel()
    }
    model.load_state_dict(state_dict, strict=False)
    print('Transferred %g/%g items from %s' %
          (len(state_dict), len(model.state_dict()), weights))  # report
else:
    model = Darknet(opt.cfg).to(device)  # create

To me it seems like, the plan originally was to save (and serialize) the config together with the model here:

        save = (not opt.nosave) or (final_epoch and not opt.evolve)
        if save:
            with open(results_file, 'r') as f:  # create checkpoint
                ckpt = {
                    'epoch':
                    epoch,
                    'best_fitness':
                    best_fitness,
                    'best_fitness_p':
                    best_fitness_p,
                    'best_fitness_r':
                    best_fitness_r,
                    'best_fitness_ap50':
                    best_fitness_ap50,
                    'best_fitness_ap':
                    best_fitness_ap,
                    'best_fitness_f':
                    best_fitness_f,
                    'training_results':
                    f.read(),
                    'model':
                    ema.ema.module.state_dict()
                    if hasattr(ema, 'module') else ema.ema.state_dict(),
                    'optimizer':
                    None if final_epoch else optimizer.state_dict(),
                    'wandb_id':
                    wandb_run.id if wandb else None
                }

This is similar to what has been done here

However, the config is not save in the ckpt at the moment, which is why the *.cfg file is needed. I feel like this is a good idea, though.

TLDR: Config file is needed. Changing the commented line to the one below fixed the issue for me.

    # opt.cfg, opt.weights, opt.resume = '', ckpt, True
    opt.weights, opt.resume = ckpt, True

I check the code, it word to me. Because in the code
https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/eb5f1663ed0743660b8aa749a43f35f505baa325/train.py#L500-501
the new opt replace the old opt and new opt.cfg always exists. i think the code is a bug?

@thinktu2
Copy link

This problem is caused by cfg field in opt.yaml not loading correctly, I think.
To resolve it, pass --cfg /path/to/cfg param in shell and modify code in train.py from line 497.

# Resume
if opt.resume:  # resume an interrupted run
    ckpt = opt.resume if isinstance(opt.resume, str) else get_latest_run()  # specified or most recent path
    assert os.path.isfile(ckpt), 'ERROR: --resume checkpoint does not exist'
    cfg = opt.cfg if opt.cfg is not None else ''  ###################ADD#######################
    with open(Path(ckpt).parent.parent / 'opt.yaml') as f:
        opt = argparse.Namespace(**yaml.load(f, Loader=yaml.FullLoader))  # replace
    opt.cfg, opt.weights, opt.resume = cfg, ckpt, True ###################CHANGE#######################
    logger.info('Resuming training from %s' % ckpt)

It works for me, anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants