Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Generalize LR scheduler #2345

Merged
merged 23 commits into from Jan 28, 2019
Merged

Generalize LR scheduler #2345

merged 23 commits into from Jan 28, 2019

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jan 11, 2019

This is the followup PR to feature request #2334.

So far I've implemented a base Scheduler and LearningRateScheduler class, where LearningRateScheduler inherits from Scheduler. I also refactored all existing LR schedulers to inherit from LearningRateScheduler, and cleaned up the wrappers for PyTorch LR schedulers. Still to do: implement a momentum scheduler and integrate into the Trainer class.

Copy link
Contributor

@matt-peters matt-peters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! It's hard to review the PR as the diff is very large but it's mostly moving things around. Are they any changes to the existing schedulers or tests other then moving locations? We could also merge this before adding the momentum schedulers to break up the PRs.

@joelgrus should also take a look as this as it touches the trainer (although minimally).

@@ -274,7 +274,7 @@ def test_trainer_can_resume_with_lr_scheduler(self):
num_epochs=4, serialization_dir=self.TEST_DIR)
epoch, _ = new_trainer._restore_checkpoint()
assert epoch == 2
assert new_trainer._learning_rate_scheduler.lr_scheduler.last_epoch == 1
assert new_trainer._learning_rate_scheduler.lr_scheduler.last_epoch == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this value change?

if self._initial_param_group_field not in group:
raise KeyError(f"{self._initial_param_group_field} missing from param_groups[{i}]")
self.base_values = [group[self._initial_param_group_field] for group in self.optimizer.param_groups]
self.step(epoch=last_epoch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might account for the difference in the trainer test -- in the pytorch base torch.optim.lr_scheduler._LRScheduler this line is self.step(last_epoch + 1).

Copy link
Member Author

@epwalsh epwalsh Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-peters yea, I think that the code / test was wrong before due to the way PyTorch LR schedulers are implemented. I left a note in the comments right above explaining it. I think GitHub is having some issues RN though so my latest commits are not showing up

Copy link
Member Author

@epwalsh epwalsh Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commits are showing up now, it was just slow. Sorry I didn't see your comment about holding off on the momentum schedulers yet. But I still haven't added anything to the trainer class

@epwalsh
Copy link
Member Author

epwalsh commented Jan 11, 2019

@matt-peters so far there were only slight changes to the existing LR schedulers to account for a slightly different base class. But the behavior is unchanged (although I split up the tests, I didn't change anything significant), other than fixing (I believe) how PyTorch schedulers behave. Previously PyTorch schedulers were prematurely updating the learning rate an epoch early.

@matt-peters
Copy link
Contributor

Got it -- despite the pytorch base class prematurely updating the learning rate, if we change the behavior then does it break backward compatibility with any of the existing schedulers? Or were they already aware of the issue and worked around it? My concern is different behavior with a given scheduler before/after this change.

@epwalsh
Copy link
Member Author

epwalsh commented Jan 12, 2019

I don't think this breaks backwards compatibility, insofar that the schedules from PyTorch LR schedulers will be the same but just shifted by one epoch. But IMHO the new shifted schedule is actually what is expected. I'm not really a fan of how PyTorch's LR schedulers are implemented: step() is called when they are initialized, and then step() is supposed to be called again before the first epoch, which just seems redundant, as opposed to after the epoch (as the AllenNLP trainer does).

Just my opinion though.. it's an easy fix to change it back if you don't agree.

@epwalsh
Copy link
Member Author

epwalsh commented Jan 17, 2019

@matt-peters would it be easier if I broke this PR up? It might make sense to split this into 3 sequential PRs as follows:

  1. Move LR schedulers to their own directory, break out cosine, noam, and slanted triangular into their own files, do the same for the tests (so this only involves moving files)
  2. Implement the Scheduler and LearningRateScheduler abstractions and modify the existing LR schedulers to inherit from the new abstractions
  3. Implement a MomentumScheduler abstraction along with a couple useful concrete momentum schedulers

@matt-peters
Copy link
Contributor

I don't think it's necessary to break it up into pieces. But I'm worried about breaking backward compatibility, even if the pytorch behavior calling step() is a little wonky. Can you revert back to the old behavior?

@epwalsh
Copy link
Member Author

epwalsh commented Jan 17, 2019

Sounds good, just did!

Copy link
Contributor

@matt-peters matt-peters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good, I'll merge this next week.

@epwalsh epwalsh changed the title WIP: generalize LR scheduler and implement momentum schedulers Generalize LR scheduler and implement momentum schedulers Jan 19, 2019
@epwalsh epwalsh changed the title Generalize LR scheduler and implement momentum schedulers Generalize LR scheduler Jan 20, 2019
@matt-peters matt-peters merged commit 5ff923c into allenai:master Jan 28, 2019
@epwalsh epwalsh deleted the schedulers branch January 28, 2019 22:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants