Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x][KVStore]1Bit gradient compression #17952

Merged
merged 1 commit into from Mar 10, 2021

Conversation

shuo-ouyang
Copy link
Contributor

@shuo-ouyang shuo-ouyang commented Apr 1, 2020

Description

Added 1bit gradient compression implementation which has similar speedup with 2bit compression. It works with a threshold, values in gradient above the threshold will be quantized to +1, while values below the threshold will be quantizaed to -1. Different with 2bit compression, this implementation of 1bit supports zero threshold. In addition, 1bit compression seems perform better than the current implentation of 2bit compression when batch size increase.

Important files to review

gradient_compression-inl.h
gradient_compression.cc

Test accuracy of ResNet110 on cifar10 with 1/2/4 workers

each worker is equipped with 4 Tesla V100 GPUs
training command:

python /mxnet/tools/launch.py --launcher ssh -H hosts -n 4 python train_cifar10.py \
--num-epochs 200 --mode hybrid --num-gpus 4 -j 8 --batch-size 128 --wd 0.0001 \
--lr 0.1 --lr-decay 0.1 --lr-decay-epoch 100,150 --model cifar_resnet20_v1 \
--kv-store dist_sync_device --gc-type 1bit --gc-threshold 0 

1worker

resnet20_test

2workers

resnet20_test

4workers

resnet20_test

4workers, update_on_kvstore=false

resnet20_test

Result of 2-layer LSTM trained on PTB with 1 worker

train perplexity

lstm_train

test perplexity

lstm_test

Throughput

In this part, we use example/imageclassification/benchmark.py to evaluate the performance of gradient compression. We have tried our best to optimize the kernel function of gradient compression operations. In the original kernel function, each thread manipulates 32 bits (4 bytes) of compressed gradients, whereas in our kernel function, each thread writes 8 bits (1 byte) at once. The difference between the original and our implementations are follows.

original kernel implementation:

struct quantize_2bit {
  MSHADOW_XINLINE static void Map(int out_block_id,
                                  int original_size,
                                  float *out,
                                  float *grad,
                                  float *residual,
                                  const float neg_threshold,
                                  const float pos_threshold) {
    // this block contains the compressed representation of
    // upto 16 values starting from out_block_id*16
    float *compr_block = out + out_block_id;
    // init to 0
    *compr_block = 0;
    // start and end are indices in original grad array
    const int start = out_block_id << 4;
    const int end = (start + 16 <= original_size) ? start + 16 : original_size;
    // cast as char* to manipulate bits of float addresses
    char *block_ptr = reinterpret_cast < char * > (compr_block);
    // masks to set bits when value meets pos_threshold
    // 0xc0 is mask when value is to be represented by the first two bits in a char*
    // 0xc0 means first two bits are set to 11
    const uint8_t posbits[] = {0xc0, 0x30, 0x0c, 0x03};
    // masks to set bits when value meets neg_threshold
    const uint8_t negbits[] = {0x80, 0x20, 0x08, 0x02};
    for (int i = start; i < end; i++) {
      // adds offset to reach appropriate byte
      char *curr_byte = block_ptr + ((i - start) >> 2);
      // adds gradient to existing residual to get updated grad
      residual[i] += grad[i];
      if (residual[i] >= pos_threshold) {
        // set data to 11
        *curr_byte |= posbits[(i & 3)];
        // reduce residual by pos_threshold
        residual[i] -= pos_threshold;
      } else if (residual[i] <= neg_threshold) {
        // set data to 10
        *curr_byte |= negbits[(i & 3)];
        residual[i] -= neg_threshold;
      }
    }
  }
};

our kernel implementation:

struct quantize_2bit {
  MSHADOW_XINLINE static void Map(int out_byte_id,
                                  int original_size,
                                  float *out,
                                  float *grad,
                                  float *residual,
                                  const float neg_threshold,
                                  const float pos_threshold) {
    // this block contains the compressed representation of
    // upto 4 values starting from (char*)out + out_byte_id
    char *compr_byte = reinterpret_cast<char *>(out) + out_byte_id;
    // init to 0
    *compr_byte = 0;
    // start and end are indices in original grad array
    const int start = out_byte_id << 2;
    const int end = (start + 4 <= original_size) ? start + 4 : original_size;

    // masks to set bits when value meets pos_threshold
    // 0xc0 is mask when value is to be represented by the first two bits in a char*
    // 0xc0 means first two bits are set to 11
    const uint8_t posbits[] = {0xc0, 0x30, 0x0c, 0x03};
    // masks to set bits when value meets neg_threshold
    const uint8_t negbits[] = {0x80, 0x20, 0x08, 0x02};
    for (int i = start; i < end; i++) {
      // adds gradient to existing residual to get updated grad
      residual[i] += grad[i];
      if (residual[i] >= pos_threshold) {
        // set data to 11
        *compr_byte |= posbits[(i & 3)];
        // reduce residual by pos_threshold
        residual[i] -= pos_threshold;
      } else if (residual[i] <= neg_threshold) {
        // set data to 10
        *compr_byte |= negbits[(i & 3)];
        residual[i] -= neg_threshold;
      }
    }
  }
};

Our optimized implementation performs well when use multiple GPUs on a single machine. Here is a benchmark test of the VGG16 model on a 4 GPUs machine (unit: samples/sec).

VGG16 original our
onebit ~626 ~752
twobit ~705 ~761
baseline ~564

However, our implementation works not well when the compression operator is lunched on the CPU, especially when we use multiple nodes and set kvstore=dist_sync. Under such circumstance, our new implementation may lead to a little throughput reduction. IMHO, one possible solution is adopting different kernel functions for CPU and GPU respectively. In other words, we can use the original kernel for CPU compression operation while the new kernel for GPU operation (still working in process...).

The following are benchmark tests of AlexNet and Resnet-50 on a cluster with at most 4 nodes.

Alexnet, batch size=128 on each GPU, dist_sync_device

alexnet-speed

ResNet-50, batch size=128 on each GPU, dist_sync_device

resnet50-speed

more performance test will be released recently...

Related issue or pr

signum with grad compression #9558
2bit gradient compression #8662

Reference

Seide F, Fu H, Droppo J, et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns

Acknowledgment

Thanks to HiNA group for providing the experiments testbed.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@shuo-ouyang shuo-ouyang requested a review from szha as a code owner April 1, 2020 10:17
@mxnet-bot
Copy link

Hey @shuo-ouyang , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-gpu, miscellaneous, clang, unix-cpu, centos-cpu, sanity, website, windows-gpu, windows-cpu, unix-gpu, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

Copy link
Member

@wkcn wkcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work! LGTM : )
Thanks for your contribution!

@shuo-ouyang
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu, centos-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, centos-gpu]

@shuo-ouyang
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@shuo-ouyang
Copy link
Contributor Author

@mxnet-label-bot add [KVStore, Distributed]

@wkcn
Copy link
Member

wkcn commented Apr 4, 2020

@mxnet-bot run ci [unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu]

@shuo-ouyang
Copy link
Contributor Author

@wkcn Thanks for your help and support. To be honest, I am not familiar with jenkins workflow, and I dont know why some checks are failed when only one whitespace was deleted in code comments.

@wkcn
Copy link
Member

wkcn commented Apr 5, 2020

@shuo-ouyang It is not related to the code. The reason is that the CI is unstable now.

@shuo-ouyang
Copy link
Contributor Author

@wkcn I get it, thanks!

@wkcn
Copy link
Member

wkcn commented Apr 7, 2020

Hi @rahul003 , could you please help take a review? Thank you!

@shuo-ouyang
Copy link
Contributor Author

@mxnet-bot run ci [centos-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-gpu]

@wkcn wkcn added the pr-awaiting-review PR is waiting for code review label Apr 13, 2020
@wkcn
Copy link
Member

wkcn commented Apr 24, 2020

Hi @eric-haibin-lin and @szha , could you please help take a reivew?

Thank you!

@szha
Copy link
Member

szha commented May 23, 2020

@mxnet-bot run ci [all]

@szha szha requested a review from eric-haibin-lin May 23, 2020 19:25
@mxnet-bot
Copy link

Jenkins CI successfully triggered : [edge, centos-gpu, windows-cpu, centos-cpu, windows-gpu, clang, unix-cpu, miscellaneous, website, unix-gpu, sanity]

@wkcn
Copy link
Member

wkcn commented Jun 10, 2020

Hi @eric-haibin-lin , could you please help take a review?
The PR is related to gradient compression.

Thank you!

@wkcn
Copy link
Member

wkcn commented Jun 28, 2020

Hi @szha , could the PR be merged ?
There seems to be no other committer reviewing this PR, and the PR has been blocked for about 3 months.
The author provides the detailed training result, and I had reviewed the code.

@szha
Copy link
Member

szha commented Jun 28, 2020

@wkcn since @eric-haibin-lin recently updated the kvstore interface, it would be best to have a review from him.

@eric-haibin-lin can you help?

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shuo-ouyang shuo-ouyang changed the title 1bit gradient compression [WIP]1Bit gradient compression Jul 21, 2020
@eric-haibin-lin eric-haibin-lin added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Jul 28, 2020
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the end2end speedup when using the 1bit compressor?
The accuracy drop on resnet20 is also very significant..

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 25, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 25, 2021
@shuo-ouyang
Copy link
Contributor Author

@eric-haibin-lin
Very sorry for delayed response...The experiment results are updated, please check it.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Mar 4, 2021
@lanking520 lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Mar 4, 2021
@szha
Copy link
Member

szha commented Mar 4, 2021

The results show that the 2-bit compression results in test perplexity higher than 1-bit compression, which seems surprising. Any explanation on why it's the case?

@shuo-ouyang
Copy link
Contributor Author

The reason may be that the onebit quantizer can exactly capture the direction (sign) of each element in gradients when the threshold is 0, whereas twobit quantizer cannot due to its limination (threshold != 0). By the way, the previous results are trained with additional code modification about error feedback part, which may lead to counterintuitive results. We have corrected them and uploaded the right results.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Mar 10, 2021
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The change looks good to me.

@lanking520 lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Mar 10, 2021
@szha szha merged commit 5aceafc into apache:master Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Distributed KVStore pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants