Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume from checkpoint #49

Closed
darshangera opened this issue Oct 6, 2018 · 3 comments
Closed

Resume from checkpoint #49

darshangera opened this issue Oct 6, 2018 · 3 comments

Comments

@darshangera
Copy link

Sairam.
Hi, I trained lightccnn_29_v2 on MSceleb DB for 14 epochs. lr reduced from 0.001 to 0.0004575 at step size of 10. Validation Accuracy improved from 86 to 95.95 and Avg loss reduced from 11 to 0.28 after 14 epochs.
Now, when I resume from saved model, it starts well as shown below:
Test set: Average loss: 0.28508767582333855, Accuracy: (95.95295308032168)
But loss is decreasing and Precision is also not improving as shown below:
Epoch: [14][0/38671] Loss 0.1796 (0.1796) Prec@1 94.531 (94.531) Prec@5 97.656 (97.656)
Epoch: [14][100/38671] Loss 0.2190 (0.2752) Prec@1 93.750 (93.379) Prec@5 99.219 (98.337)
Epoch: [14][200/38671] Loss 0.4000 (0.3149) Prec@1 89.062 (92.405) Prec@5 96.875 (98.084)
......
Epoch: [14][6100/38671] Loss 0.8118 (0.5597) Prec@1 85.938 (86.939) Prec@5 92.188 (95.961)
Epoch: [14][6200/38671] Loss 0.7565 (0.5611) Prec@1 80.469 (86.915) Prec@5 93.750 (95.945)
Epoch: [14][6300/38671] Loss 0.6364 (0.5625) Prec@1 84.375 (86.893) Prec@5 96.875 (95.927)

Did you face the above issue? Could you help me what could be issue?
Thanks.
Darshan,SSSIHL.

@darshangera
Copy link
Author

Sorry, typo in the above: But loss is not decreasing and Precision is also not improving as shown below:

@AlfredXiangWu
Copy link
Owner

AlfredXiangWu commented Oct 9, 2018

The reason is that we don't save optimizer in the checkpoint. Light CNN is trained by SGD with momentum. If you resume from the saved model, the weights are loaded from the checkpoint but the status of optimizer is not.

You can modify these lines

save_checkpoint({
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'prec1': prec1,
        }, save_name)

And you also need to modify the lines of loading checkpoint

if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            model.load_state_dict(checkpoint['state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

@darshangera
Copy link
Author

Thanks a lot. Sairam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants