Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Fix LR scheduler cooldown #3719

Merged
merged 3 commits into from Jun 15, 2021
Merged

Fix LR scheduler cooldown #3719

merged 3 commits into from Jun 15, 2021

Conversation

stephenroller
Copy link
Contributor

Patch description
Context:

  • Originally in ParlAI, fixed LR schedulers like cosine/linear would consume (warmup_updates + max_lr_steps) updates, cooling down to 0 eventually
  • [train] New training options for logging/validation based on number of steps #3379 changed this behavior such that only max_lr_steps would be consumed, but did not change the cooldown to be faster
  • This PR changes it so the full cooldown is completed by the end of max_lr_steps.

Testing steps
Adjusted CI, new assertions.

Copy link
Contributor

@emilydinan emilydinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix! this lgtm

if optim_states and saved_optim_type != opt['optimizer']:
# we changed from adam to adamax, or sgd to adam, or similar
logging.warning('Not loading optim state since optim class changed.')
return False
return True
elif optim_states:
# check for any fp16/fp32 conversions we need to do
optimstate_fp16 = 'loss_scaler' in optim_states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't leave a comment for it, but are the semantics correct for line 1099? the elif not optimstate_fp16 and self.fp16 block? are we returning True always because of the lower precision conversion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

self.scheduler = optim.lr_scheduler.LambdaLR(optimizer, self._linear_lr)

def _linear_lr(self, step):
# this multiplicative factor ensures linear decay rate
# lr_mult = float(self.max_lr_steps - step - 1) / float(self.max_lr_steps - step)
lr_mult = max(0.0, 1e-6 + (1.0 - step / self.max_lr_steps) * (1 - 1e-6))
lr_mult = max(0.0, 1.0 - step / self.max_lr_steps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need 1e-6 anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an executive call to let it actually go to 0 :P

@stephenroller stephenroller merged commit d3713fe into master Jun 15, 2021
@stephenroller stephenroller deleted the lrschedulemax branch June 15, 2021 23:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants