-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add… #8918
Conversation
@zhreshold could you help review? |
python/mxnet/optimizer.py
Outdated
elif (strategy == 'power2'): | ||
mult = 1.0 + (maxmult - 1) * (nup * nup) / (nwup * nwup) | ||
elif (strategy == 'power3'): | ||
mult = 1.0 + (maxmult - 1) * (nup * nup) / (nwup * nwup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Power3 is wrong
@ashokei See comments. Could you rebase and make pylint as well? |
@zhreshold thank you, will make requested changes. |
20deebf
to
4bbea48
Compare
@zhreshold i made the requested changes, please review. thank you, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments. The rest part LGTM and good to be merged once this issue resolved.
python/mxnet/lr_scheduler.py
Outdated
self.max_update = max_update | ||
self.power = pwr | ||
self.count = num_update | ||
if num_update <= max_update: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This this duplicate with line 173. I understand it is for resume training, but that should be handled in __call__
, see the example in MultiFactorScheduler. Therefore, num_update is not necessary in __init__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhreshold i removed the num_update duplicate line, can you please check.thanks.
python/mxnet/lr_scheduler.py
Outdated
|
||
""" | ||
|
||
def __init__(self, num_update, max_update, base_lr=0.01, pwr=2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_update
useless here
python/mxnet/lr_scheduler.py
Outdated
self.base_lr_orig = self.base_lr | ||
self.max_update = max_update | ||
self.power = pwr | ||
self.count = num_update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for self.count
python/mxnet/lr_scheduler.py
Outdated
if num_update <= self.max_update: | ||
self.base_lr = self.base_lr_orig * pow(1.0 - float(num_update) / float(self.max_update), | ||
self.power) | ||
self.count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhreshold thanks.. i see "count" is not being used, i removed it. Though, MultiFactorScheduler seems to be tracking count.
@szha test_io.test_LibSVMIter fails accasionally, we should host the data (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/news20.t.bz2 |
@ashokei Try rebase and trigger the CI once more. We can merge once passed. |
…ed in a Polynomial Decay learning rate scheduler. Modified the example image fit code to allow these options to be selectable.
a1717f6
to
860eaeb
Compare
@zhreshold all done, thanks! |
state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight | ||
weight = weight - state | ||
|
||
For details of the update algorithm see :class:`~mxnet.ndarray.lbsgd_update` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashokei @zhreshold
Where is lbsgd_update defined? I don't see it.
Please add proper reference to relevant papar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is the update method in LBSGD class. we can fix that to resolve to right method, the '_' is misleading.
the paper is here:
https://arxiv.org/pdf/1708.03888.pdf
self.adaptive = False | ||
self.admult = 1 # adaptation constant | ||
|
||
def create_state(self, index, weight): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this copied from SGD?
Why not inherit SGD instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashokei As suggested, could you change to inherit SGD and override create_state_multi_precision
, create_state
, update
, update_multi_precision
only if necessary. Seems like you are mixing multi_precision part into the normal one.
warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars' default : 'linear') | ||
warmup_epochs: unsigned, default: 5 | ||
batch_scale: unsigned, default: 1 (same as batch size*numworkers) | ||
updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use warmup_epochs and updates_per_epoch? Why not just warmup_updates?
Why should it have a default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it requires the epoch number to stop warming up, which does not depend on the number of updates.
warmup_epochs: unsigned, default: 5 | ||
batch_scale: unsigned, default: 1 (same as batch size*numworkers) | ||
updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.) | ||
begin_epoch: unsigned, default 0, starting epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's starting epoch? What would it do before start epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashokei please add more details describing the strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piiswrong, The begin_epoch flag is because our data scientist saw that it was possible to stop a training partway, save it, then resume it again. We wanted to be able to have that option, which required passing in what the starting epoch is to do the learning rate decay calculations correctly.
apache#8918) * Added in Large-Batch SGD with a warmup, and a LARS startegy. Also added in a Polynomial Decay learning rate scheduler. Modified the example image fit code to allow these options to be selectable. * Fix pylint issues * pylint fixes * remove duplicate num_update * remove unused count
Have you tested the accuracy of Resnet50(or AlexnetBN) in ImageNet using 'Lars' method ? |
@ashokei Similar question as above. How do I use this? I'm unable to train resnet50 with lbsgd. What configuration works? I have an effective batch size of about 20k across 20 worker nodes |
initializer = mx.init.Xavier( | ||
rnd_type='gaussian', factor_type="in", magnitude=2) | ||
# A limited number of optimizers have a warmup period | ||
has_warmup = {'lbsgd', 'lbnag'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like LBNAG doesn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaseadams509 can you please address above comments. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, lbnag was an optimizer we were experimenting with with incorporating the large-batch algorithm with Nesterov accelerated gradient. LBNAG was not showing the desired improvements, so we hadn't pushed it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaseadams509 How large batch sizes did you experiment with?
apache#8918) * Added in Large-Batch SGD with a warmup, and a LARS startegy. Also added in a Polynomial Decay learning rate scheduler. Modified the example image fit code to allow these options to be selectable. * Fix pylint issues * pylint fixes * remove duplicate num_update * remove unused count
apache#8918) * Added in Large-Batch SGD with a warmup, and a LARS startegy. Also added in a Polynomial Decay learning rate scheduler. Modified the example image fit code to allow these options to be selectable. * Fix pylint issues * pylint fixes * remove duplicate num_update * remove unused count
@ashokei were the comments above ever addressed in a follow-up pull request? Has See: |
Large-Batch SGD with a warmup, and a LARS strategy.
Description
Added in Large-Batch SGD with a warmup, and a LARS strategy. Also added in a Polynomial Decay learning rate scheduler. Modified the example image fit code to allow these options to be selectable.
Checklist
Essentials
make lint
)Changes
Comments