Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! It's hard to review the PR as the diff is very large but it's mostly moving things around. Are they any changes to the existing schedulers or tests other then moving locations? We could also merge this before adding the momentum schedulers to break up the PRs.
@joelgrus should also take a look as this as it touches the trainer (although minimally).
@@ -274,7 +274,7 @@ def test_trainer_can_resume_with_lr_scheduler(self): | |||
num_epochs=4, serialization_dir=self.TEST_DIR) | |||
epoch, _ = new_trainer._restore_checkpoint() | |||
assert epoch == 2 | |||
assert new_trainer._learning_rate_scheduler.lr_scheduler.last_epoch == 1 | |||
assert new_trainer._learning_rate_scheduler.lr_scheduler.last_epoch == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this value change?
if self._initial_param_group_field not in group: | ||
raise KeyError(f"{self._initial_param_group_field} missing from param_groups[{i}]") | ||
self.base_values = [group[self._initial_param_group_field] for group in self.optimizer.param_groups] | ||
self.step(epoch=last_epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might account for the difference in the trainer test -- in the pytorch base torch.optim.lr_scheduler._LRScheduler
this line is self.step(last_epoch + 1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-peters yea, I think that the code / test was wrong before due to the way PyTorch LR schedulers are implemented. I left a note in the comments right above explaining it. I think GitHub is having some issues RN though so my latest commits are not showing up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commits are showing up now, it was just slow. Sorry I didn't see your comment about holding off on the momentum schedulers yet. But I still haven't added anything to the trainer class
@matt-peters so far there were only slight changes to the existing LR schedulers to account for a slightly different base class. But the behavior is unchanged (although I split up the tests, I didn't change anything significant), other than fixing (I believe) how PyTorch schedulers behave. Previously PyTorch schedulers were prematurely updating the learning rate an epoch early. |
Got it -- despite the pytorch base class prematurely updating the learning rate, if we change the behavior then does it break backward compatibility with any of the existing schedulers? Or were they already aware of the issue and worked around it? My concern is different behavior with a given scheduler before/after this change. |
I don't think this breaks backwards compatibility, insofar that the schedules from PyTorch LR schedulers will be the same but just shifted by one epoch. But IMHO the new shifted schedule is actually what is expected. I'm not really a fan of how PyTorch's LR schedulers are implemented: Just my opinion though.. it's an easy fix to change it back if you don't agree. |
@matt-peters would it be easier if I broke this PR up? It might make sense to split this into 3 sequential PRs as follows:
|
I don't think it's necessary to break it up into pieces. But I'm worried about breaking backward compatibility, even if the pytorch behavior calling |
Sounds good, just did! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good, I'll merge this next week.
This is the followup PR to feature request #2334.
So far I've implemented a base
Scheduler
andLearningRateScheduler
class, whereLearningRateScheduler
inherits fromScheduler
. I also refactored all existing LR schedulers to inherit fromLearningRateScheduler
, and cleaned up the wrappers for PyTorch LR schedulers. Still to do: implement a momentum scheduler and integrate into theTrainer
class.