Resume from checkpoint #49

darshangera · 2018-10-06T11:51:21Z

Sairam.
Hi, I trained lightccnn_29_v2 on MSceleb DB for 14 epochs. lr reduced from 0.001 to 0.0004575 at step size of 10. Validation Accuracy improved from 86 to 95.95 and Avg loss reduced from 11 to 0.28 after 14 epochs.
Now, when I resume from saved model, it starts well as shown below:
Test set: Average loss: 0.28508767582333855, Accuracy: (95.95295308032168)
But loss is decreasing and Precision is also not improving as shown below:
Epoch: [14][0/38671] Loss 0.1796 (0.1796) Prec@1 94.531 (94.531) Prec@5 97.656 (97.656)
Epoch: [14][100/38671] Loss 0.2190 (0.2752) Prec@1 93.750 (93.379) Prec@5 99.219 (98.337)
Epoch: [14][200/38671] Loss 0.4000 (0.3149) Prec@1 89.062 (92.405) Prec@5 96.875 (98.084)
......
Epoch: [14][6100/38671] Loss 0.8118 (0.5597) Prec@1 85.938 (86.939) Prec@5 92.188 (95.961)
Epoch: [14][6200/38671] Loss 0.7565 (0.5611) Prec@1 80.469 (86.915) Prec@5 93.750 (95.945)
Epoch: [14][6300/38671] Loss 0.6364 (0.5625) Prec@1 84.375 (86.893) Prec@5 96.875 (95.927)

Did you face the above issue? Could you help me what could be issue?
Thanks.
Darshan,SSSIHL.

darshangera · 2018-10-06T11:54:44Z

Sorry, typo in the above: But loss is not decreasing and Precision is also not improving as shown below:

AlfredXiangWu · 2018-10-09T00:50:28Z

The reason is that we don't save optimizer in the checkpoint. Light CNN is trained by SGD with momentum. If you resume from the saved model, the weights are loaded from the checkpoint but the status of optimizer is not.

You can modify these lines

save_checkpoint({
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'prec1': prec1,
        }, save_name)

And you also need to modify the lines of loading checkpoint

if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            model.load_state_dict(checkpoint['state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

darshangera · 2018-10-09T03:26:38Z

Thanks a lot. Sairam

AlfredXiangWu closed this as completed May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume from checkpoint #49

Resume from checkpoint #49

darshangera commented Oct 6, 2018

darshangera commented Oct 6, 2018

AlfredXiangWu commented Oct 9, 2018 •

edited

Loading

darshangera commented Oct 9, 2018

Resume from checkpoint #49

Resume from checkpoint #49

Comments

darshangera commented Oct 6, 2018

darshangera commented Oct 6, 2018

AlfredXiangWu commented Oct 9, 2018 • edited Loading

darshangera commented Oct 9, 2018

AlfredXiangWu commented Oct 9, 2018 •

edited

Loading