New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lr_finder.py #42
Conversation
Thanks for the PR!
# what's not visible here comes from the lrfinder_mnist example
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.5)
lr_finder = LRFinder(model, optimizer, criterion, device="cuda")
lr_finder.range_test(trainloader, end_lr=10, num_iter=5, step_mode="exp") With
With
So the logic is definitely wrong in both cases, the history should start at 0.001 and end at 10 😕 |
@yongduek, @davidtvs Sorry for replying late.
For this statement, I want to mention an implementation detail in In Besides, the reason why the history did not start with the initial learning rate we set is that the history is used to save values changed after So that it is correct to change the order of the following two lines as you said: pytorch-lr-finder/torch_lr_finder/lr_finder.py Lines 203 to 204 in ed21825
, but we should keep the line of setting In conclusion, with the example provided by @davidtvs, the meaning of lr history you expected is actually like this:
, but the meaning of history recorded by current implementation of this package is:
For the point 2 you mentioned, I agree with @davidtvs's decision. We should care about the backward compatibility. |
I vaguely remember getting pretty confused by how the _LRSheduler works when I was writing the schedulers for this package and even now I still find the logic hard to follow for something that seems pretty simple. Anyway, rant over and it looks much better after 1.4.0. There's still something wrong besides the order of those two lines because for my experiment I locally fixed the order and the learning rate never reaches 10. The history @NaleRaphael posted as expected has 6 elements instead of 5; I would definitely expect 5 elements. It seems to me that the computation of the learning rates is incorrect. |
@davidtvs Maybe we should modify the line
If this is what we want, then the following revision should make it work: class ExponentialLR(_LRScheduler):
def get_lr(self):
# Note that we should handle the case when given `num_iter` is 1,
# it would trigger `ZeroDivisionError` here.
r = self.last_epoch / (self.num_iter - 1)
return [base_lr * (self.end_lr / base_lr) ** r for base_lr in self.base_lrs] A quick explanation:
With the execution order like this: for i in range(num_iter):
train()
eval()
optimizer.step()
scheduler.step() , optimizer's lr in the first iteration will be exactly the initial value we set. So that we can ensure the lr used in each iteration will be exactly the same as those ones saved in history. |
Sorry for the unclear description. What I want to explain in that comment is the history in Besides, there is one thing I was considering while writing that comment. As we known so far, |
Two things can be concluded: * The current computation of the exponential learning rate is incorrect in all version of PyTorch above 0.4.1 * PyTorch 1.4.0 introduced a different design for the LRScheduler; the fix for the exponential learning rate for version 0.4.1 is not the same as the one for 1.4.0 Setup for the above: 1. torch==0.4.1 torchvision==0.2.1 2. torch==1.4.0 torchvision==0.5.0
I made a test that replicates the issue and ran it with PyTorch v0.4.1 and v1.4.0, here's what I observed:
def get_lr(self):
curr_iter = self.last_epoch + 1
r = curr_iter / (self.num_iter - 1)
return [base_lr * (self.end_lr / base_lr) ** r for base_lr in self.base_lrs]
def get_lr(self):
r = self.last_epoch / (self.num_iter - 1)
return [base_lr * (self.end_lr / base_lr) ** r for base_lr in self.base_lrs] I'm thinking this means that there will be one last release with this fixed for v0.4.1 and after that, this package will support only PyTorch v1.4.0+. |
@davidtvs Thanks for your effort on this, I only tested it on PyTorch 1.3.0 before. 🙇 I found this issue has been addressed in PyTorch #7889, and it has been fixed in PyTorch v1.1.0 (see also this commit). In # source: https://github.com/pytorch/pytorch/blob/3749c58/torch/optim/lr_scheduler.py#L22-L23
# Given `last_epoch` is the default value -1.
self.step(last_epoch + 1) # after this line is executed, self.last_epoch = 0
self.last_epoch = last_epoch # after this line is executed, self.last_epoch = -1 In my opinion, we can keep supporting at least PyTorch v1.1.0+ on the master branch with the second approach, and perhaps create a new branch for PyTorch v0.4 with the first approach. UPDATE: One more thing about the fix. Since the denominator of |
Ya, I think we can just raise an exception when As to how we handle the learning rate computation, I thought about using the |
Great, so I think it's time to keep moving on this PR. |
Thanks for asking, and thanks for your effort for the open source. I would like to join the activity but have to rely on yours because of a tight schedule until the end of June. One thing I would like to mention is that because of the smoothing inside the LRFinder class, the loss-lr graph shows some lagged version of the one without smoothing, and I am not sure how much this affects the actual learning process afterwords. Well this must be beyond the scope of this project. |
@yongduek thanks for the PR at least now we know about the issue and it'll get fixed. Since I had already created a branch for this and made some changes there I'll finish it there so that this issue is fixed for the next release. It also brought to my attention that different PyTorch versions can break stuff silently so I'll start running different versions of PyTorch in the CI. |
Two things can be concluded: * The current computation of the exponential learning rate is incorrect in all version of PyTorch above 0.4.1 * PyTorch 1.4.0 introduced a different design for the LRScheduler; the fix for the exponential learning rate for version 0.4.1 is not the same as the one for 1.4.0 Setup for the above: 1. torch==0.4.1 torchvision==0.2.1 2. torch==1.4.0 torchvision==0.5.0
…#43, #42) * Add unit test related to #42 Two things can be concluded: * The current computation of the exponential learning rate is incorrect in all version of PyTorch above 0.4.1 * PyTorch 1.4.0 introduced a different design for the LRScheduler; the fix for the exponential learning rate for version 0.4.1 is not the same as the one for 1.4.0 Setup for the above: 1. torch==0.4.1 torchvision==0.2.1 2. torch==1.4.0 torchvision==0.5.0 * Fix learning rate computation in schedulers * Add PyTorch matrix to the CI job Also, added caching to make the CI job faster. * Raise ValueError for num_iter<=1 * Fix syntax error in CI yaml * Fix syntax error in CI yaml v2 * Fix CI job * The combo of py3.7 and torch 0.4.1 breaks type inference for torch.tensor with np.int64 * Allow CI to be skipped
I was looking for a pytorch lrfinder. Very nice and thank you for sharing them.
When
lr_finder.history[\'lr\']
was printed out, it did not start with the initial learning rate set in the optimizer declaration. So two things seem to be modified to make it happen.get_lr()
before append to history.get_last_lr()
to get the latest lr. This seems to be a recent change in pytorch.self.last_epoch
does not need to be incremented inExponentialLR
andLinearLR
super().__init__()
callsstep()
within _LRScheduler, thenstep()
callsget_lr()
to update the values. The results are saved in variables in the base class (_LRScheduler
).get_last_lr()
. This is reportedly the recent pytorch way of using it.