Apex support for mixed precision training #183

BloodAxe · 2019-05-13T15:59:10Z

Makes mixed precision training as simple as possible:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

runner = SupervisedRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    ...)

Breaking changes (mostly internal, but still):

Experiment.get_optimizer renamed to Experiment.get_optimizer_and_model and returns optimizer and model as a tuple. Motivation: Apex requires two objects to initialize - a model and optimizer. A model must be already on GPU and not wrapped with DataParallel/DistributedDataParallel. This change was made to have clear amp initialization in one place.
UtilsFactory.prepare_model now does not wrap model with DataParallel since this will break Apex initialization. Instead, DataParallel applied in Experiment.get_optimizer_and_model if fp16 is not enabled and more than one GPU is available.

Scitator · 2019-05-19T12:00:25Z

@BloodAxe proposal: add fp16 flag to SupervisedRunner .train and .infer with automatic model cast to fp16 mode. What do you think about it?

BloodAxe · 2019-05-19T12:07:01Z

This need discussion. Apart from global enable/disable fp16 training mode flag, there are couple of flags that user may want to change (optimization level, flag to keep batch-norm in fp32, loss scaling flags, etc).
This would expand set of parameters in SupervisedRunner. That's my only objection so far.

catalyst/dl/experiments/experiment.py

Scitator · 2019-05-19T14:01:28Z

@BloodAxe hmm, you are right...
then, what about fp16 flag and the following logic:

do nothing if model & optimizer are already fp16
turn default optimization level (01?), keep batch norm in fp32 and loss autoscaling (or scale by 128)
?

.travis.yml

BloodAxe · 2019-05-19T15:06:15Z

@Scitator I think it's still make sense to have fast&quick option to turn on fp16 mode for SupervisedRunner.
I'll fix remaining issues and add fp16 flag in a while.

BloodAxe requested review from Scitator, TezRomacH and hexfaker May 13, 2019 15:59

Scitator reviewed May 19, 2019

View reviewed changes

catalyst/dl/experiments/experiment.py Outdated Show resolved Hide resolved

Scitator reviewed May 19, 2019

View reviewed changes

.travis.yml Outdated Show resolved Hide resolved

Squash Apex support into single commit

aab3902

BloodAxe force-pushed the feature/apex branch from 1a3a0bc to aab3902 Compare May 20, 2019 18:25

BloodAxe and others added 2 commits May 24, 2019 23:17

Added fp16 argument for SupervisedExperiment

b99c664

merged, corrected & refactored

b7bd9b9

Scitator force-pushed the feature/apex branch 3 times, most recently from 7d483b8 to 3a664bb Compare May 28, 2019 07:58

fp16 -> distributed_params

c5c350c

Scitator force-pushed the feature/apex branch from 3a664bb to c5c350c Compare May 28, 2019 08:30

Scitator merged commit 1923bcd into master May 28, 2019

Scitator deleted the feature/apex branch May 28, 2019 18:46

janvainer mentioned this pull request Aug 10, 2020

How to do automatic mixed precision training? #914

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apex support for mixed precision training #183

Apex support for mixed precision training #183

BloodAxe commented May 13, 2019 •

edited

Loading

Scitator commented May 19, 2019

BloodAxe commented May 19, 2019

Scitator commented May 19, 2019

BloodAxe commented May 19, 2019

Apex support for mixed precision training #183

Apex support for mixed precision training #183

Conversation

BloodAxe commented May 13, 2019 • edited Loading

Scitator commented May 19, 2019

BloodAxe commented May 19, 2019

Scitator commented May 19, 2019

BloodAxe commented May 19, 2019

BloodAxe commented May 13, 2019 •

edited

Loading