New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate SGD #13346

Open
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@ptrendx
Contributor

ptrendx commented Nov 21, 2018

Description

Currently MXNet optimizers are invoked 1 weight at a time. This leads to a lot of synchronization overhead, as updates (especially for convolutions and batchnorm) tend to be small, but each one needs to by synchronized upon.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • ability to control update_on_kvstore value via environment variable MXNET_UPDATE_ON_KVSTORE (default is 1, which is consistent with the current behavior)
  • if update_on_kvstore is False, in the case of SGD optimizer it attempts to bundle updates of multiple weights together and launches a single kernel to perform them all, reducing the number of kernel calls and synchronizations.

Comments

  • Current test_sgd automatically uses the new code paths, so no new tests are needed.
  • Code does not support sparse arrays and it will fall back to not aggregated calls when it encounters sparse array in the bundle of weights/gradients

@ptrendx ptrendx requested review from anirudh2290 and szha as code owners Nov 21, 2018

@szha szha requested a review from eric-haibin-lin Nov 21, 2018

@stu1130

This comment has been minimized.

Contributor

stu1130 commented Nov 21, 2018

@mxnet-label-bot add [pr-awaiting-review]
Thanks for your contribution @ptrendx

@anirudhacharya

This comment has been minimized.

Contributor

anirudhacharya commented Nov 21, 2018

@ptrendx can you share a benchmark on SGD performance when MXNET_UPDATE_ON_KVSTORE is set for aggregate SGD vs when when it is not.

@ptrendx

This comment has been minimized.

Contributor

ptrendx commented Nov 21, 2018

This PR is part of upstreaming improvements to MXNet that are available in NVIDIA's NGC 18.11 MXNet container. I will use results from that container to show the impact once all the other improvements are in place. The benchmark shown is ResNet v1.5 training on single V100 32GB in DGX1-V, batch size 32.

  1. MXNET_UPDATE_ON_KVSTORE=1 (default)
root@dgx1v-loki-19:/opt/mxnet/example/image-classification# numactl --physcpubind=0-4 ./train_imagenet_runner -n 1 -b 32 --network resnet-v1b --disp-batches 50 -e 1 --no-val -s 12800                             
INFO:root:start with arguments Namespace(batch_size=32, batchnorm_eps=2e-05, batchnorm_layout='NHWC', batchnorm_mom=0.9, benchmark=0, bn_gamma_init0=False, brightness=0.4, contrast=0.4, conv_algo=-1, conv_layou$
='NHWC', custom_bn_off=0, dali_nvjpeg_memory_padding=16, dali_prefetch_queue=3, dali_threads=3, data_nthreads=40, data_train='/data/imagenet/train-480-val-256-recordio/train.rec', data_train_idx='/data/imagenet$
train-480-val-256-recordio/train.idx', data_val=None, data_val_idx='', disp_batches=50, dtype='float16', epoch_size=0, fill_value=127, force_tensor_core=0, fuse_bn_add_relu=1, fuse_bn_relu=1, gc_threshold=0.5, $
c_type='none', gpus='0', image_shape='4,224,224', initializer='default', input_layout='NCHW', kv_store='device', load_epoch=None, log='', logging_dir='logs', loss='', lr=0.0125, lr_factor=0.1, lr_step_epochs='3$
,60,80', macrobatch_size=0, max_crop_size=-1, max_random_area=1.0, max_random_aspect_ratio=1.33, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1.0, max_random_shear$
ratio=0.0, min_crop_size=-1, min_random_area=0.08, min_random_aspect_ratio=0.75, min_random_scale=1.0, model_prefix=None, mom=0.9, monitor=0, network='resnet-v1b-fl', num_classes=1000, num_epochs=1, num_example$
=12800, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0.0, pooling_layout='NHWC', profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=1, random_resized_crop=1, resize=256, r$
b_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0.4, save_period=1, seed=None, separ_val=False, set_data_aug_level=None, set_resnet_aug=None, test_io=0, top_k=0, use_dali=True, verbose=0, warmup_ep$
chs=5, warmup_strategy='linear', wd=0.0001)
/opt/mxnet/example/image-classification/common/dali.py:142: UserWarning: 12800 training examples will be used, although full training set contains 1281167 examples
  warnings.warn("{} training examples will be used, although full training set contains {} examples".format(args.num_examples, trainpipes[0].epoch_size("Reader")))
[17:04:56] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:119: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 t$
 disable)
INFO:root:Epoch[0] Batch [50]   Speed: 912.56 samples/sec lr:0.000313   accuracy=0.000613
INFO:root:Epoch[0] Batch [100]  Speed: 922.14 samples/sec lr:0.000625   accuracy=0.000625
INFO:root:Epoch[0] Batch [150]  Speed: 919.71 samples/sec lr:0.000937   accuracy=0.000625
INFO:root:Epoch[0] Batch [200]  Speed: 924.12 samples/sec lr:0.001250   accuracy=0.001875
INFO:root:Epoch[0] Batch [250]  Speed: 922.34 samples/sec lr:0.001563   accuracy=0.000625
INFO:root:Epoch[0] Batch [300]  Speed: 923.93 samples/sec lr:0.001875   accuracy=0.000625
INFO:root:Epoch[0] Batch [350]  Speed: 923.90 samples/sec lr:0.002188   accuracy=0.002500
INFO:root:Epoch[0] Train-accuracy=0.001276
INFO:root:Epoch[0] Time cost=15.579
  1. MXNET_UPDATE_ON_KVSTORE=0
    MXNET_OPTIMIZER_AGGREGATION_SIZE=1 (no aggregation)
    Speedup here comes from lack unnecessary (in single GPU case) broadcast call in the kvstore.
root@dgx1v-loki-19:/opt/mxnet/example/image-classification# numactl --physcpubind=0-4 ./train_imagenet_runner -n 1 -b 32 --network resnet-v1b --disp-batches 50 -e 1 --no-val -s 12800                             
INFO:root:start with arguments Namespace(batch_size=32, batchnorm_eps=2e-05, batchnorm_layout='NHWC', batchnorm_mom=0.9, benchmark=0, bn_gamma_init0=False, brightness=0.4, contrast=0.4, conv_algo=-1, conv_layout='NHWC', custom_bn_off=0, dali_nvjpeg_memory_padding=16, dali_prefetch_queue=3, dali_threads=3, data_nthreads=40, data_train='/data/imagenet/train-480-val-256-recordio/train.rec', data_train_idx='/data/imagenet/train-480-val-256-recordio/train.idx', data_val=None, data_val_idx='', disp_batches=50, dtype='float16', epoch_size=0, fill_value=127, force_tensor_core=0, fuse_bn_add_relu=1, fuse_bn_relu=1, gc_threshold=0.5, gc_type='none', gpus='0', image_shape='4,224,224', initializer='default', input_layout='NCHW', kv_store='device', load_epoch=None, log='', logging_dir='logs', loss='', lr=0.0125, lr_factor=0.1, lr_step_epochs='30,60,80', macrobatch_size=0, max_crop_size=-1, max_random_area=1.0, max_random_aspect_ratio=1.33, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1.0, max_random_shear_ratio=0.0, min_crop_size=-1, min_random_area=0.08, min_random_aspect_ratio=0.75, min_random_scale=1.0, model_prefix=None, mom=0.9, monitor=0, network='resnet-v1b-fl', num_classes=1000, num_epochs=1, num_examples=12800, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0.0, pooling_layout='NHWC', profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=1, random_resized_crop=1, resize=256, rgb_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0.4, save_period=1, seed=None, separ_val=False, set_data_aug_level=None, set_resnet_aug=None, test_io=0, top_k=0, use_dali=True, verbose=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
/opt/mxnet/example/image-classification/common/dali.py:142: UserWarning: 12800 training examples will be used, although full training set contains 1281167 examples
  warnings.warn("{} training examples will be used, although full training set contains {} examples".format(args.num_examples, trainpipes[0].epoch_size("Reader")))
[17:12:43] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:119: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [50]   Speed: 959.50 samples/sec lr:0.000313   accuracy=0.000613
INFO:root:Epoch[0] Batch [100]  Speed: 968.80 samples/sec lr:0.000625   accuracy=0.000625
INFO:root:Epoch[0] Batch [150]  Speed: 966.11 samples/sec lr:0.000937   accuracy=0.000625
INFO:root:Epoch[0] Batch [200]  Speed: 969.05 samples/sec lr:0.001250   accuracy=0.001875
INFO:root:Epoch[0] Batch [250]  Speed: 971.04 samples/sec lr:0.001563   accuracy=0.000625
INFO:root:Epoch[0] Batch [300]  Speed: 971.68 samples/sec lr:0.001875   accuracy=0.000625
INFO:root:Epoch[0] Batch [350]  Speed: 971.70 samples/sec lr:0.002188   accuracy=0.002500
INFO:root:Epoch[0] Train-accuracy=0.001276
INFO:root:Epoch[0] Time cost=14.874
  1. MXNET_UPDATE_ON_KVSTORE=0
    MXNET_OPTIMIZER_AGGREGATION_SIZE=4 (default in this PR)
root@dgx1v-loki-19:/opt/mxnet/example/image-classification# numactl --physcpubind=0-4 ./train_imagenet_runner -n 1 -b 32 --network resnet-v1b --disp-batches 50 -e 1 --no-val -s 12800
INFO:root:start with arguments Namespace(batch_size=32, batchnorm_eps=2e-05, batchnorm_layout='NHWC', batchnorm_mom=0.9, benchmark=0, bn_gamma_init0=False, brightness=0.4, contrast=0.4, conv_algo=-1, conv_layout='NHWC', custom_bn_off=0, dali_nvjpeg_memory_padding=16, dali_prefetch_queue=3, dali_threads=3, data_nthreads=40, data_train='/data/imagenet/train-480-val-256-recordio/train.rec', data_train_idx='/data/imagenet/train-480-val-256-recordio/train.idx', data_val=None, data_val_idx='', disp_batches=50, dtype='float16', epoch_size=0, fill_value=127, force_tensor_core=0, fuse_bn_add_relu=1, fuse_bn_relu=1, gc_threshold=0.5, gc_type='none', gpus='0', image_shape='4,224,224', initializer='default', input_layout='NCHW', kv_store='device', load_epoch=None, log='', logging_dir='logs', loss='', lr=0.0125, lr_factor=0.1, lr_step_epochs='30,60,80', macrobatch_size=0, max_crop_size=-1, max_random_area=1.0, max_random_aspect_ratio=1.33, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1.0, max_random_shear_ratio=0.0, min_crop_size=-1, min_random_area=0.08, min_random_aspect_ratio=0.75, min_random_scale=1.0, model_prefix=None, mom=0.9, monitor=0, network='resnet-v1b-fl', num_classes=1000, num_epochs=1, num_examples=12800, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0.0, pooling_layout='NHWC', profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=1, random_resized_crop=1, resize=256, rgb_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0.4, save_period=1, seed=None, separ_val=False, set_data_aug_level=None, set_resnet_aug=None, test_io=0, top_k=0, use_dali=True, verbose=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
/opt/mxnet/example/image-classification/common/dali.py:142: UserWarning: 12800 training examples will be used, although full training set contains 1281167 examples
  warnings.warn("{} training examples will be used, although full training set contains {} examples".format(args.num_examples, trainpipes[0].epoch_size("Reader")))
[17:14:43] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:119: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [50]   Speed: 1005.45 samples/sec lr:0.000313  accuracy=0.000613
INFO:root:Epoch[0] Batch [100]  Speed: 1020.27 samples/sec lr:0.000625  accuracy=0.000625
INFO:root:Epoch[0] Batch [150]  Speed: 1016.28 samples/sec lr:0.000937  accuracy=0.000625
INFO:root:Epoch[0] Batch [200]  Speed: 1020.46 samples/sec lr:0.001250  accuracy=0.001875
INFO:root:Epoch[0] Batch [250]  Speed: 1018.46 samples/sec lr:0.001563  accuracy=0.000625
INFO:root:Epoch[0] Batch [300]  Speed: 1020.25 samples/sec lr:0.001875  accuracy=0.000625
INFO:root:Epoch[0] Batch [350]  Speed: 1020.17 samples/sec lr:0.002188  accuracy=0.002500
INFO:root:Epoch[0] Train-accuracy=0.001276
INFO:root:Epoch[0] Time cost=14.256
  1. MXNET_UPDATE_ON_KVSTORE=0
    MXNET_OPTIMIZER_AGGREGATION_SIZE=60 (max possible)
root@dgx1v-loki-19:/opt/mxnet/example/image-classification# numactl --physcpubind=0-4 ./train_imagenet_runner -n 1 -b 32 --network resnet-v1b --disp-batches 50 -e 1 --no-val -s 12800
INFO:root:start with arguments Namespace(batch_size=32, batchnorm_eps=2e-05, batchnorm_layout='NHWC', batchnorm_mom=0.9, benchmark=0, bn_gamma_init0=False, brightness=0.4, contrast=0.4, conv_algo=-1, conv_layout='NHWC', custom_bn_off=0, dali_nvjpeg_memory_padding=16, dali_prefetch_queue=3, dali_threads=3, data_nthreads=40, data_train='/data/imagenet/train-480-val-256-recordio/train.rec', data_train_idx='/data/imagenet/train-480-val-256-recordio/train.idx', data_val=None, data_val_idx='', disp_batches=50, dtype='float16', epoch_size=0, fill_value=127, force_tensor_core=0, fuse_bn_add_relu=1, fuse_bn_relu=1, gc_threshold=0.5, gc_type='none', gpus='0', image_shape='4,224,224', initializer='default', input_layout='NCHW', kv_store='device', load_epoch=None, log='', logging_dir='logs', loss='', lr=0.0125, lr_factor=0.1, lr_step_epochs='30,60,80', macrobatch_size=0, max_crop_size=-1, max_random_area=1.0, max_random_aspect_ratio=1.33, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1.0, max_random_shear_ratio=0.0, min_crop_size=-1, min_random_area=0.08, min_random_aspect_ratio=0.75, min_random_scale=1.0, model_prefix=None, mom=0.9, monitor=0, network='resnet-v1b-fl', num_classes=1000, num_epochs=1, num_examples=12800, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0.0, pooling_layout='NHWC', profile_server_suffix='', profile_worker_suffix='', random_crop=0, random_mirror=1, random_resized_crop=1, resize=256, rgb_mean='123.68,116.779,103.939', rgb_std='1,1,1', saturation=0.4, save_period=1, seed=None, separ_val=False, set_data_aug_level=None, set_resnet_aug=None, test_io=0, top_k=0, use_dali=True, verbose=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
/opt/mxnet/example/image-classification/common/dali.py:142: UserWarning: 12800 training examples will be used, although full training set contains 1281167 examples
  warnings.warn("{} training examples will be used, although full training set contains {} examples".format(args.num_examples, trainpipes[0].epoch_size("Reader")))
[17:15:54] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:119: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [50]   Speed: 1035.09 samples/sec lr:0.000313  accuracy=0.000613
INFO:root:Epoch[0] Batch [100]  Speed: 1047.64 samples/sec lr:0.000625  accuracy=0.000625
INFO:root:Epoch[0] Batch [150]  Speed: 1042.25 samples/sec lr:0.000937  accuracy=0.000625
INFO:root:Epoch[0] Batch [200]  Speed: 1047.25 samples/sec lr:0.001250  accuracy=0.001875
INFO:root:Epoch[0] Batch [250]  Speed: 1045.58 samples/sec lr:0.001563  accuracy=0.000625
INFO:root:Epoch[0] Batch [300]  Speed: 1044.48 samples/sec lr:0.001875  accuracy=0.000625
INFO:root:Epoch[0] Batch [350]  Speed: 1045.78 samples/sec lr:0.002188  accuracy=0.002500
INFO:root:Epoch[0] Train-accuracy=0.001276
INFO:root:Epoch[0] Time cost=13.927

@ptrendx ptrendx requested a review from nswamy as a code owner Nov 21, 2018

ptrendx added some commits Nov 26, 2018

Fix
@lupesko

This comment has been minimized.

Contributor

lupesko commented Dec 5, 2018

Thanks for the contribution @ptrendx !
Adding @nswamy and @sandeep-krishnamurthy to help review/merge.

@@ -98,6 +99,9 @@ def dict_equ(a, b):
@with_seed()

This comment has been minimized.

@eric-haibin-lin

eric-haibin-lin Dec 10, 2018

Contributor

Is it not tested with test_trainer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment