[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591

ctcyang · 2018-07-06T20:25:50Z

Description

Single machine All Reduce Topology-aware Communication

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Proposed communication method shows speed-up compared to both existing methods (parameter server and NCCL) on small batch sizes for ResNet-50, VGG-16, Inception-v3 and AlexNet.
Communication method queries the single-machine multi-GPU link topology, and determines a suitable communication pattern to use.
Use feature by MXNET_KVSTORE_USETREE=1, default is 0.
Add knobs for tuning this communication method.
In future, will add auto-tuner to automatically choose between single-machine communication protocols (parameter server, NCCL, method proposed here).

Comments

Design Proposal
There are comments from several reviewers are here: [MXNET-331] Single machine All Reduce Topology-aware Communication #11357

…ight when num_gpus <= 8

…bator-mxnet into feature_multirootv9

…to feature_multirootv9

…oblem, add header guard

…se PCI-E as fallback for GPUs that are not linked by NVLink

…merge

ctcyang · 2018-07-20T18:31:48Z

Yeah, I'm blocked by the test. I can't replicate it on local machine, but I can replicate it on Docker image.

…merge

ctcyang · 2018-07-23T17:10:41Z

@haojin2 @eric-haibin-lin @rahul003 Trying to get this into 1.3 release

haojin2 · 2018-07-23T18:06:07Z

src/kvstore/comm_tree.h

+            if (dest_id != topo_id) {
+              CopyFromTo(buf_from.merged[merged_row],
+                  &(buf_dest.copy_buf[merged_row][is_dest-1]),
+                  priority);


Align the lines like:

CopyFromTo(buf_..., &(buf_..., priority);

rahul003 · 2018-07-23T18:06:17Z

tests/cpp/kvstore/gpu_topology_test.cc

+
+// ComputeTreesTest with backtracking
+// TODO(carlyang): comment out test for now
+/*TEST(GpuTopology, TestComputeTrees1) {


What's wrong with these tests? Do they not work?

They used to segfault only on CI, but now they should be fine. I fixed an off-by-1 bug.

rahul003 · 2018-07-23T18:15:36Z

tests/python/gpu/test_nccl.py

+    def __exit__(self, ptype, value, trace):
+        os.environ[self._key] = self._prev_val
+
+def test_device_pushpull():


Why is this test in the file test_nccl?

rahul003 · 2018-07-23T18:22:43Z

src/kvstore/comm_tree.h

+  CommDeviceTree() {
+    inited_ = false;
+    gpuarray_bound_ = dmlc::GetEnv("MXNET_KVSTORE_GPUARRAY_BOUND", 10000000);
+    backtrack_ = dmlc::GetEnv("MXNET_KVSTORE_BACKTRACK", 0);


Could you document these environment variables in https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md

Added documentation.

rahul003 · 2018-07-23T18:29:09Z

src/kvstore/comm_tree.h

+    std::vector<float> link_matrix(devs_.size()*devs_.size());
+    GetP2PWeight(devs_, &link_matrix);
+    if (backtrack_)
+      LOG(WARNING) << "Using Backtracking to generate trees";


This should be LOG(INFO)

Many other places as well where this change ought to be done

rahul003 · 2018-07-23T18:31:16Z

src/kvstore/gpu_topology.h

+  }
+}
+
+// Performs partition on each existing partition in graph W if partition has


Thanks for the great comments for the functions, but could you fix comment style to how it's standard in the codebase. Here's an example https://github.com/apache/incubator-mxnet/blob/b4156da26cfe741619227ae726872b1255194900/src/kvstore/kvstore_utils.h#L37

rahul003 · 2018-07-23T18:34:57Z

src/kvstore/comm_tree.h

+  std::vector<Context> devs_;
+
+  /// \brief Highest numbered device
+  int max_dev_;


I tried to see where this variable is used to ensure that cases when gpus '1,5,3,7' are given work. But it looks like this variable is not used? Please remove this then

rahul003 · 2018-07-23T18:50:34Z

src/kvstore/comm_tree.h

+      //   dev_id: 4 2 3 1 7 5 0
+      // and generated an n_gpus x n_gpus link topology matrix:
+      //
+      // 1) The reduction trees are saved as indices on 0, 1, ..., n_gpus


Could you clarify how many are generated

ctcyang · 2018-07-23T19:50:03Z

@Roshrini For keeping track of PR

…dout

rahul003 · 2018-07-24T17:39:05Z

docs/faq/env_var.md

+  - Values: 0(false) or 1(true) ```(default=0)```
+  - If true and MXNET_KVSTORE_USETREE is set to 1, MXNet will log the reduction trees that have been generated.
+
+* MXNET_KVSTORE_GPUARRAY_BOUND


I realize it says multiple trees, but could you call out that this is for tree kvstore? especially because we have a similar variable MXNET_KVSTORE_BIGARRAY_BOUND

rahul003 · 2018-07-24T17:39:13Z

docs/faq/env_var.md

+  - When the array size is bigger than this threshold and MXNET_KVSTORE_USETREE is set to 1, multiple trees are used to load balance the big gradient being communicated in order to better saturate link bandwidth.
+
+* MXNET_KVSTORE_BACKTRACK
+  - Values: 0(false) or 1(true) ```(Default=0)


Formatting issue

…pdated) (apache#11591) * add multiroot all-reduce communication pattern * fix bug with UpdateWeight * fix PCI-E links appearing in weight matrix bug * optimization to skip CopyFromTo in ReduceInner gains a bit of throughput * remove unnecessary if statement * Add tests * add more tests, 6 tests left to add * get rid of some dead code * Add comments * Add randomized tests for backtrack and kernighan-lin * Fix Postprocess * Add switch for first valid tree when num_gpus > 8, and for maximum weight when num_gpus <= 8 * Kernighan-Lin seems to find better trees * get rid of printfs * change defaults * inherit from CommDevice instead of Comm * Fix lint errors * Add Python test using MXNET_KVSTORE_USETREE, fix CMake compilation problem, add header guard * fix lint errors * better header guard that works for tests * get rid of unused variable warning * retrigger jenkins * resolve 2 comments * address comment using Class to do test, get rid of extraneous test, use PCI-E as fallback for GPUs that are not linked by NVLink * address comments * fix a few bugs * get rid of printfs * get rid of print * Comment out test for now * fix 2 more bugs * fix segfault * change PrintVector, PrintTopo, PrintMatrix to LOG(INFO) instead of stdout * Fix code alignment * get rid of todo * Make changes to env variable names to indicate they are TREE-related * Add note saying when ARRAY_BOUND env var takes effect

Carl Yang added 29 commits June 4, 2018 03:51

add multiroot all-reduce communication pattern

9678143

fix bug with UpdateWeight

d5e51d6

fix PCI-E links appearing in weight matrix bug

0708dbc

optimization to skip CopyFromTo in ReduceInner gains a bit of throughput

5590920

remove unnecessary if statement

4f8f58b

Add tests

908534a

add more tests, 6 tests left to add

25cbbdc

get rid of some dead code

310ee4d

Add comments

9cce8ea

Add randomized tests for backtrack and kernighan-lin

4d2790d

Fix Postprocess

b5b42bc

Add switch for first valid tree when num_gpus > 8, and for maximum we…

6327ceb

…ight when num_gpus <= 8

Kernighan-Lin seems to find better trees

8694fe7

get rid of printfs

c6cd67a

change defaults

7466c4d

Merge branch 'feature_multirootv9' of https://github.com/ctcyang/incu…

153ec0b

…bator-mxnet into feature_multirootv9

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

7c61b6c

…to feature_multirootv9

inherit from CommDevice instead of Comm

cc935a2

Fix lint errors

ba60aaa

Add Python test using MXNET_KVSTORE_USETREE, fix CMake compilation pr…

972e9c0

…oblem, add header guard

fix lint errors

6627dcf

better header guard that works for tests

4de89a7

get rid of unused variable warning

317c66b

retrigger jenkins

c364fd3

resolve 2 comments

3241d71

address comment using Class to do test, get rid of extraneous test, u…

bd926bf

…se PCI-E as fallback for GPUs that are not linked by NVLink

resolve merge conflicts

0e1a704

Merge remote-tracking branch 'apache/master' into feature_multirootv9

47b0b63

Merge remote-tracking branch 'apache/master' into feature_multirootv9…

781a7fe

…merge

ctcyang mentioned this pull request Jul 6, 2018

[MXNET-331] Single machine All Reduce Topology-aware Communication #11357

Closed

8 tasks

Carl Yang added 11 commits July 20, 2018 20:22

Merge remote-tracking branch 'apache/master' into feature_multirootv9…

24b9c62

…merge

Merge remote-tracking branch 'apache/master' into feature_multirootv9…

7d0da7b

…merge

fix a few bugs

18c1700

get rid of printfs

c65a620

Merge branch 'feature_multirootv9merge3' into feature_multirootv9

a70b1b8

Merge remote-tracking branch 'apache/master' into feature_multirootv9

263a4cb

get rid of print

628ba6e

Merge branch 'feature_multirootv9' into feature_multirootv9merge

b3f3235

Comment out test for now

a0e1366

fix 2 more bugs

63fd14e

Merge branch 'feature_multirootv9merge3' into feature_multirootv9merge

6c0bff8

haojin2 reviewed Jul 23, 2018

View reviewed changes

rahul003 reviewed Jul 23, 2018

View reviewed changes

Carl Yang added 3 commits July 23, 2018 23:42

fix segfault

9f5c24a

change PrintVector, PrintTopo, PrintMatrix to LOG(INFO) instead of st…

9cc24d0

…dout

Merge branch 'feature_multiv9merge4' into feature_multirootv9merge

691d5ac

ctcyang requested a review from szha as a code owner July 24, 2018 00:51

Carl Yang added 2 commits July 24, 2018 00:52

Fix code alignment

67b0db0

get rid of todo

c8ebb87

rahul003 reviewed Jul 24, 2018

View reviewed changes

Carl Yang added 2 commits July 24, 2018 11:56

Make changes to env variable names to indicate they are TREE-related

5f7da5e

Add note saying when ARRAY_BOUND env var takes effect

16b8fb4

eric-haibin-lin merged commit fe07d50 into apache:master Jul 24, 2018

ctcyang deleted the feature_multirootv9merge branch July 24, 2018 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591

ctcyang commented Jul 6, 2018 •

edited

Loading

ctcyang commented Jul 20, 2018

ctcyang commented Jul 23, 2018

haojin2 Jul 23, 2018

rahul003 Jul 23, 2018

ctcyang Jul 24, 2018

rahul003 Jul 23, 2018

rahul003 Jul 23, 2018

ctcyang Jul 24, 2018

rahul003 Jul 23, 2018

rahul003 Jul 23, 2018

rahul003 Jul 23, 2018

rahul003 Jul 23, 2018

rahul003 Jul 23, 2018

ctcyang commented Jul 23, 2018

rahul003 Jul 24, 2018

rahul003 Jul 24, 2018

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591

[MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) #11591

Conversation

ctcyang commented Jul 6, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

ctcyang commented Jul 20, 2018

ctcyang commented Jul 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctcyang commented Jul 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctcyang commented Jul 6, 2018 •

edited

Loading