Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[Torch classifier agent][bug fix]Fix optimizer loading in classifier agent #4406

Merged
merged 1 commit into from Mar 9, 2022

Conversation

dexterju27
Copy link
Contributor

Patch description
I encountered an issue where my jobs would crash with KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer during final evaluation.

After some digging, I found the issue was the following:

  1. In torch classifier agent, we didn't load back the optimizer states while loading the checkpoint, instead, we create a new optimizer with model. Parameters.
  2. This breaks the training in the following scenario (when the model was saved during warm up):
  3. The optimizer was created without initial_lr, the intial_lr was added by LambdaLR when last_epoch= -1. torch/optim/lr_scheduler.py:35
  4. The optimizer was then saved with intial_lr.
  5. However, when loading the optimizer back, we ignored the optimizer states_dict, created a new optimizer that doesn't have initial_lr key.
  6. This crashes the warm-up schedule initialization, since we are resuming the optimizer from last_epoch = training steps. It will expect an initial_lr that is not in the optimizer.
    self.init_optim(optim_params)

Proposed changes:
Change this line to what torch generator agent was doing, loading the optimizer states back instead of creating a new one from model.parameters

was_reset = self.init_optim(

Testing steps
You could reproduce the issue by setting a high warm up and load such model back when resume training. The issue was fixed after the proposed change.

@dexterju27 dexterju27 merged commit d4fded0 into main Mar 9, 2022
@dexterju27 dexterju27 deleted the fix-optim-torch-classifier branch March 9, 2022 15:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants