[v1.x][KVStore]1Bit gradient compression #17952

shuo-ouyang · 2020-04-01T10:17:04Z

Description

Added 1bit gradient compression implementation which has similar speedup with 2bit compression. It works with a threshold, values in gradient above the threshold will be quantized to +1, while values below the threshold will be quantizaed to -1. Different with 2bit compression, this implementation of 1bit supports zero threshold. In addition, 1bit compression seems perform better than the current implentation of 2bit compression when batch size increase.

Important files to review

gradient_compression-inl.h
gradient_compression.cc

Test accuracy of ResNet110 on cifar10 with 1/2/4 workers

each worker is equipped with 4 Tesla V100 GPUs
training command:

python /mxnet/tools/launch.py --launcher ssh -H hosts -n 4 python train_cifar10.py \
--num-epochs 200 --mode hybrid --num-gpus 4 -j 8 --batch-size 128 --wd 0.0001 \
--lr 0.1 --lr-decay 0.1 --lr-decay-epoch 100,150 --model cifar_resnet20_v1 \
--kv-store dist_sync_device --gc-type 1bit --gc-threshold 0

1worker

2workers

4workers

4workers, update_on_kvstore=false

Result of 2-layer LSTM trained on PTB with 1 worker

train perplexity

test perplexity

Throughput

In this part, we use example/imageclassification/benchmark.py to evaluate the performance of gradient compression. We have tried our best to optimize the kernel function of gradient compression operations. In the original kernel function, each thread manipulates 32 bits (4 bytes) of compressed gradients, whereas in our kernel function, each thread writes 8 bits (1 byte) at once. The difference between the original and our implementations are follows.

original kernel implementation:

struct quantize_2bit {
  MSHADOW_XINLINE static void Map(int out_block_id,
                                  int original_size,
                                  float *out,
                                  float *grad,
                                  float *residual,
                                  const float neg_threshold,
                                  const float pos_threshold) {
    // this block contains the compressed representation of
    // upto 16 values starting from out_block_id*16
    float *compr_block = out + out_block_id;
    // init to 0
    *compr_block = 0;
    // start and end are indices in original grad array
    const int start = out_block_id << 4;
    const int end = (start + 16 <= original_size) ? start + 16 : original_size;
    // cast as char* to manipulate bits of float addresses
    char *block_ptr = reinterpret_cast < char * > (compr_block);
    // masks to set bits when value meets pos_threshold
    // 0xc0 is mask when value is to be represented by the first two bits in a char*
    // 0xc0 means first two bits are set to 11
    const uint8_t posbits[] = {0xc0, 0x30, 0x0c, 0x03};
    // masks to set bits when value meets neg_threshold
    const uint8_t negbits[] = {0x80, 0x20, 0x08, 0x02};
    for (int i = start; i < end; i++) {
      // adds offset to reach appropriate byte
      char *curr_byte = block_ptr + ((i - start) >> 2);
      // adds gradient to existing residual to get updated grad
      residual[i] += grad[i];
      if (residual[i] >= pos_threshold) {
        // set data to 11
        *curr_byte |= posbits[(i & 3)];
        // reduce residual by pos_threshold
        residual[i] -= pos_threshold;
      } else if (residual[i] <= neg_threshold) {
        // set data to 10
        *curr_byte |= negbits[(i & 3)];
        residual[i] -= neg_threshold;
      }
    }
  }
};

our kernel implementation:

struct quantize_2bit {
  MSHADOW_XINLINE static void Map(int out_byte_id,
                                  int original_size,
                                  float *out,
                                  float *grad,
                                  float *residual,
                                  const float neg_threshold,
                                  const float pos_threshold) {
    // this block contains the compressed representation of
    // upto 4 values starting from (char*)out + out_byte_id
    char *compr_byte = reinterpret_cast<char *>(out) + out_byte_id;
    // init to 0
    *compr_byte = 0;
    // start and end are indices in original grad array
    const int start = out_byte_id << 2;
    const int end = (start + 4 <= original_size) ? start + 4 : original_size;

    // masks to set bits when value meets pos_threshold
    // 0xc0 is mask when value is to be represented by the first two bits in a char*
    // 0xc0 means first two bits are set to 11
    const uint8_t posbits[] = {0xc0, 0x30, 0x0c, 0x03};
    // masks to set bits when value meets neg_threshold
    const uint8_t negbits[] = {0x80, 0x20, 0x08, 0x02};
    for (int i = start; i < end; i++) {
      // adds gradient to existing residual to get updated grad
      residual[i] += grad[i];
      if (residual[i] >= pos_threshold) {
        // set data to 11
        *compr_byte |= posbits[(i & 3)];
        // reduce residual by pos_threshold
        residual[i] -= pos_threshold;
      } else if (residual[i] <= neg_threshold) {
        // set data to 10
        *compr_byte |= negbits[(i & 3)];
        residual[i] -= neg_threshold;
      }
    }
  }
};

Our optimized implementation performs well when use multiple GPUs on a single machine. Here is a benchmark test of the VGG16 model on a 4 GPUs machine (unit: samples/sec).

VGG16	original	our
onebit	~626	~752
twobit	~705	~761
baseline	~564

However, our implementation works not well when the compression operator is lunched on the CPU, especially when we use multiple nodes and set kvstore=dist_sync. Under such circumstance, our new implementation may lead to a little throughput reduction. IMHO, one possible solution is adopting different kernel functions for CPU and GPU respectively. In other words, we can use the original kernel for CPU compression operation while the new kernel for GPU operation (still working in process...).

The following are benchmark tests of AlexNet and Resnet-50 on a cluster with at most 4 nodes.

Alexnet, batch size=128 on each GPU, dist_sync_device

ResNet-50, batch size=128 on each GPU, dist_sync_device

more performance test will be released recently...

Related issue or pr

signum with grad compression #9558
2bit gradient compression #8662

Reference

Seide F, Fu H, Droppo J, et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns

Acknowledgment

Thanks to HiNA group for providing the experiments testbed.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mxnet-bot · 2020-04-01T10:17:09Z

Hey @shuo-ouyang , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-gpu, miscellaneous, clang, unix-cpu, centos-cpu, sanity, website, windows-gpu, windows-cpu, unix-gpu, edge]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

wkcn

Great Work! LGTM : )
Thanks for your contribution!

shuo-ouyang · 2020-04-04T05:31:58Z

@mxnet-bot run ci [unix-gpu, centos-gpu]

mxnet-bot · 2020-04-04T05:32:07Z

Jenkins CI successfully triggered : [unix-gpu, centos-gpu]

shuo-ouyang · 2020-04-04T09:24:19Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2020-04-04T09:24:28Z

Jenkins CI successfully triggered : [unix-gpu]

shuo-ouyang · 2020-04-04T12:15:17Z

@mxnet-label-bot add [KVStore, Distributed]

wkcn · 2020-04-04T14:30:02Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2020-04-04T14:30:08Z

Jenkins CI successfully triggered : [unix-gpu]

shuo-ouyang · 2020-04-05T00:27:05Z

@wkcn Thanks for your help and support. To be honest, I am not familiar with jenkins workflow, and I dont know why some checks are failed when only one whitespace was deleted in code comments.

wkcn · 2020-04-05T01:35:57Z

@shuo-ouyang It is not related to the code. The reason is that the CI is unstable now.

shuo-ouyang · 2020-04-05T02:26:47Z

@wkcn I get it, thanks!

wkcn · 2020-04-07T11:06:00Z

Hi @rahul003 , could you please help take a review? Thank you!

shuo-ouyang · 2020-04-08T01:37:01Z

@mxnet-bot run ci [centos-gpu]

mxnet-bot · 2020-04-08T01:37:06Z

Jenkins CI successfully triggered : [centos-gpu]

wkcn · 2020-04-24T16:17:49Z

Hi @eric-haibin-lin and @szha , could you please help take a reivew?

Thank you!

szha · 2020-05-23T19:25:28Z

@mxnet-bot run ci [all]

mxnet-bot · 2020-05-23T19:25:45Z

Jenkins CI successfully triggered : [edge, centos-gpu, windows-cpu, centos-cpu, windows-gpu, clang, unix-cpu, miscellaneous, website, unix-gpu, sanity]

wkcn · 2020-06-10T11:38:57Z

Hi @eric-haibin-lin , could you please help take a review?
The PR is related to gradient compression.

Thank you!

wkcn · 2020-06-28T05:06:07Z

Hi @szha , could the PR be merged ?
There seems to be no other committer reviewing this PR, and the PR has been blocked for about 3 months.
The author provides the detailed training result, and I had reviewed the code.

szha · 2020-06-28T05:32:52Z

@wkcn since @eric-haibin-lin recently updated the kvstore interface, it would be best to have a review from him.

@eric-haibin-lin can you help?

eric-haibin-lin

Great work. Shall we add tests in https://github.com/apache/incubator-mxnet/blob/cb54a4a99463b23b8abaa2629661954c4ba3c60b/tests/nightly/dist_sync_kvstore.py#L93 and https://github.com/apache/incubator-mxnet/blob/7caffa65e30f37e70796ba165ac5a4265e64974e/tests/nightly/test_kvstore.py#L34 ?

eric-haibin-lin

What is the end2end speedup when using the 1bit compressor?
The accuracy drop on resnet20 is also very significant..

shuo-ouyang · 2021-02-26T14:28:35Z

@eric-haibin-lin
Very sorry for delayed response...The experiment results are updated, please check it.

src/kvstore/gradient_compression.cc

szha · 2021-03-04T21:07:47Z

The results show that the 2-bit compression results in test perplexity higher than 1-bit compression, which seems surprising. Any explanation on why it's the case?

shuo-ouyang · 2021-03-10T11:04:09Z

The reason may be that the onebit quantizer can exactly capture the direction (sign) of each element in gradients when the threshold is 0, whereas twobit quantizer cannot due to its limination (threshold != 0). By the way, the previous results are trained with additional code modification about error feedback part, which may lead to counterintuitive results. We have corrected them and uploaded the right results.

szha

Thanks for the explanation. The change looks good to me.

shuo-ouyang requested a review from szha as a code owner April 1, 2020 10:17

wkcn approved these changes Apr 1, 2020

View reviewed changes

lanking520 added Distributed KVStore labels Apr 4, 2020

wkcn added the pr-awaiting-review PR is waiting for code review label Apr 13, 2020

szha requested a review from eric-haibin-lin May 23, 2020 19:25

eric-haibin-lin suggested changes Jun 28, 2020

View reviewed changes

shuo-ouyang changed the title ~~1bit gradient compression~~ [WIP]1Bit gradient compression Jul 21, 2020

eric-haibin-lin added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Jul 28, 2020

eric-haibin-lin reviewed Aug 9, 2020

View reviewed changes

shuo-ouyang force-pushed the gradient-compression branch from db71b9c to 674b3b4 Compare February 25, 2021 13:24

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 25, 2021

shuo-ouyang force-pushed the gradient-compression branch from 674b3b4 to 4710211 Compare February 25, 2021 13:56

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 25, 2021

szha reviewed Mar 2, 2021

View reviewed changes

src/kvstore/gradient_compression.cc Outdated Show resolved Hide resolved

szha reviewed Mar 2, 2021

View reviewed changes

src/kvstore/gradient_compression.cc Show resolved Hide resolved

1bit gradient compression implementation

9512cd3

shuo-ouyang force-pushed the gradient-compression branch from 4710211 to 9512cd3 Compare March 4, 2021 15:20

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Mar 4, 2021

szha requested a review from eric-haibin-lin March 4, 2021 17:56

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Mar 4, 2021

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Mar 10, 2021

szha approved these changes Mar 10, 2021

View reviewed changes

lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Mar 10, 2021

szha merged commit 5aceafc into apache:master Mar 10, 2021

shuo-ouyang mentioned this pull request Mar 11, 2021

[BUGFIX][v1.x] Fix a bug for 1bit quantizer #20007

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x][KVStore]1Bit gradient compression #17952

[v1.x][KVStore]1Bit gradient compression #17952

shuo-ouyang commented Apr 1, 2020 •

edited

mxnet-bot commented Apr 1, 2020

wkcn left a comment

shuo-ouyang commented Apr 4, 2020

mxnet-bot commented Apr 4, 2020

shuo-ouyang commented Apr 4, 2020

mxnet-bot commented Apr 4, 2020

shuo-ouyang commented Apr 4, 2020

wkcn commented Apr 4, 2020

mxnet-bot commented Apr 4, 2020

shuo-ouyang commented Apr 5, 2020

wkcn commented Apr 5, 2020

shuo-ouyang commented Apr 5, 2020

wkcn commented Apr 7, 2020

shuo-ouyang commented Apr 8, 2020

mxnet-bot commented Apr 8, 2020

wkcn commented Apr 24, 2020

szha commented May 23, 2020

mxnet-bot commented May 23, 2020

wkcn commented Jun 10, 2020

wkcn commented Jun 28, 2020

szha commented Jun 28, 2020

eric-haibin-lin left a comment •

edited

eric-haibin-lin left a comment •

edited

shuo-ouyang commented Feb 26, 2021

szha commented Mar 4, 2021

shuo-ouyang commented Mar 10, 2021

szha left a comment

[v1.x][KVStore]1Bit gradient compression #17952

[v1.x][KVStore]1Bit gradient compression #17952

Conversation

shuo-ouyang commented Apr 1, 2020 • edited

Description

Important files to review

Test accuracy of ResNet110 on cifar10 with 1/2/4 workers

1worker

2workers

4workers

4workers, update_on_kvstore=false

Result of 2-layer LSTM trained on PTB with 1 worker

train perplexity

test perplexity

Throughput

Alexnet, batch size=128 on each GPU, dist_sync_device

ResNet-50, batch size=128 on each GPU, dist_sync_device

Related issue or pr

Reference

Acknowledgment

Checklist

Essentials

Changes

Comments

mxnet-bot commented Apr 1, 2020

wkcn left a comment

Choose a reason for hiding this comment

shuo-ouyang commented Apr 4, 2020

mxnet-bot commented Apr 4, 2020

shuo-ouyang commented Apr 4, 2020

mxnet-bot commented Apr 4, 2020

shuo-ouyang commented Apr 4, 2020

wkcn commented Apr 4, 2020

mxnet-bot commented Apr 4, 2020

shuo-ouyang commented Apr 5, 2020

wkcn commented Apr 5, 2020

shuo-ouyang commented Apr 5, 2020

wkcn commented Apr 7, 2020

shuo-ouyang commented Apr 8, 2020

mxnet-bot commented Apr 8, 2020

wkcn commented Apr 24, 2020

szha commented May 23, 2020

mxnet-bot commented May 23, 2020

wkcn commented Jun 10, 2020

wkcn commented Jun 28, 2020

szha commented Jun 28, 2020

eric-haibin-lin left a comment • edited

Choose a reason for hiding this comment

eric-haibin-lin left a comment • edited

Choose a reason for hiding this comment

shuo-ouyang commented Feb 26, 2021

szha commented Mar 4, 2021

shuo-ouyang commented Mar 10, 2021

szha left a comment

Choose a reason for hiding this comment

shuo-ouyang commented Apr 1, 2020 •

edited

eric-haibin-lin left a comment •

edited

eric-haibin-lin left a comment •

edited