Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apex support for mixed precision training #183

Merged
merged 4 commits into from
May 28, 2019
Merged

Apex support for mixed precision training #183

merged 4 commits into from
May 28, 2019

Conversation

BloodAxe
Copy link
Contributor

@BloodAxe BloodAxe commented May 13, 2019

Makes mixed precision training as simple as possible:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

runner = SupervisedRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    ...)

Breaking changes (mostly internal, but still):

  1. Experiment.get_optimizer renamed to Experiment.get_optimizer_and_model and returns optimizer and model as a tuple. Motivation: Apex requires two objects to initialize - a model and optimizer. A model must be already on GPU and not wrapped with DataParallel/DistributedDataParallel. This change was made to have clear amp initialization in one place.
  2. UtilsFactory.prepare_model now does not wrap model with DataParallel since this will break Apex initialization. Instead, DataParallel applied in Experiment.get_optimizer_and_model if fp16 is not enabled and more than one GPU is available.

@Scitator
Copy link
Member

@BloodAxe proposal: add fp16 flag to SupervisedRunner .train and .infer with automatic model cast to fp16 mode. What do you think about it?

@BloodAxe
Copy link
Contributor Author

This need discussion. Apart from global enable/disable fp16 training mode flag, there are couple of flags that user may want to change (optimization level, flag to keep batch-norm in fp32, loss scaling flags, etc).
This would expand set of parameters in SupervisedRunner. That's my only objection so far.

@Scitator
Copy link
Member

@BloodAxe hmm, you are right...
then, what about fp16 flag and the following logic:

  • do nothing if model & optimizer are already fp16
  • turn default optimization level (01?), keep batch norm in fp32 and loss autoscaling (or scale by 128)
    ?

.travis.yml Outdated Show resolved Hide resolved
@BloodAxe
Copy link
Contributor Author

@Scitator I think it's still make sense to have fast&quick option to turn on fp16 mode for SupervisedRunner.
I'll fix remaining issues and add fp16 flag in a while.

@Scitator Scitator force-pushed the feature/apex branch 3 times, most recently from 7d483b8 to 3a664bb Compare May 28, 2019 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants