Improve sparse pull performance for gluon trainer #11429

eric-haibin-lin · 2018-06-27T20:48:07Z

Description

introduce gpu priority queue for row sparse pull operations
add ignore_sparse option to kv.pull, which improves hybrid blocks with dense weight and sparse gradient

@leezu @rahul003

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

eric-haibin-lin · 2018-06-27T20:49:20Z

@haojin2

eric-haibin-lin · 2018-06-28T17:17:48Z

@junrushao1994 pls help review engine code. Thanks!

junrushao · 2018-06-29T15:19:42Z

The engine code looks good to me

szha · 2018-06-29T18:02:36Z

python/mxnet/gluon/trainer.py

-                update_on_kvstore = False
+            if 'dist' in kvstore.type:
+                # kv.pull(row_sparse_grad) is not supported for dist kvstore
+                update_on_kvstore = self._contains_sparse_weight or self._contains_sparse_grad


from the comment I'm guessing you meant not self._contains_sparse_weight and not self._contains_sparse_grad

This is intended. kv.pull(row_sparse_grad) is not supported for dist kvstore, so we want to set update_on_kvstore = True if there's sparse grad.

szha · 2018-06-29T18:05:18Z

how does the performance look? @eric-haibin-lin @leezu

eric-haibin-lin · 2018-07-01T18:08:49Z

@szha For dense weight with sparse grad, this PR pulls sparse grad instead of dense weight. Reduces gpu2gpu copy time from 60ms to less than 1ms.

haojin2

LGTM

piiswrong · 2018-07-02T17:15:09Z

include/mxnet/c_api.h

 * \return 0 when success, -1 when failure happens
 */
 MXNET_DLL int MXKVStorePull(KVStoreHandle handle,
                            mx_uint num,
                            const int* keys,
                            NDArrayHandle* vals,
-                            int priority);
+                            int priority,
+                            bool ignore_sparse = true);


C API doesn't support default value

piiswrong · 2018-07-02T17:15:13Z

include/mxnet/c_api.h

 * \return 0 when success, -1 when failure happens
 */
 MXNET_DLL int MXKVStorePullEx(KVStoreHandle handle,
                              mx_uint num,
                              const char** keys,
                              NDArrayHandle* vals,
-                              int priority);
+                              int priority,
+                              bool ignore_sparse = true);


Added extra CAPIs instead of adding default value to this one

piiswrong · 2018-07-02T17:15:27Z

include/mxnet/engine.h

@@ -84,6 +84,8 @@ enum class FnProperty {
  kCopyToGPU,
  /*! \brief Prioritized sync operation on CPU */
  kCPUPrioritized,
+  /*! \brief Prioritized sync operation on GPU */
+  kGPUPrioritized,


Add it at the end

Moved to the end.

rahul003 · 2018-07-03T21:28:05Z

python/mxnet/gluon/trainer.py

-                raise RuntimeError("Cannot set update_on_kvstore to False when sparse "
-                                   "gradients and/or sparse weights are present for "
-                                   "Parameter '%s'."%param.name)
+                raise RuntimeError("Cannot set update_on_kvstore to False when sparse weights "


Does this have to be a error, or can it be a warning and automatically use update_on_kvstore ?

Also, shouldn't this be outside the if contains_sparse_weight condition?

By default update_on_kvstore is None. It's only set if user provides a value on purpose. I think an explicit err is better, since we cannot satisfy user's original intent.
If user set update_on_kvstore to False and the model contains no sparse weight, it's totally fine. Why should this be outside the if condition?

rahul003 · 2018-07-03T21:56:42Z

python/mxnet/kvstore.py

-        For `RowSparseNDArray` values, this call is ignored,
-        please use ``row_sparse_pull`` instead.
+        pull with `RowSparseNDArray` is not supported for dist kvstore.
+        Please use ``row_sparse_pull`` instead.


Should ignore_sparse be defaulted to false to be consistent with previous behavior?

Previous behavior is to always ignore sparse. So it's consistent

eric-haibin-lin · 2018-07-04T22:15:22Z

@piiswrong @rahul003 pls review again, thanks.

* clip sparse grad. fix _reduce for rowsparse param * fix kvstore init for local kv * trigger * pull with ignore sparse * rsp pull with priority * add doc; * fix bug in sparse kvstore * +kvstore test * add dist kvstore test * enhance dist kv test * fix lint * fix lint * CR comments

eric-haibin-lin added 14 commits June 13, 2018 23:02

clip sparse grad. fix _reduce for rowsparse param

f0f7bd6

fix kvstore init for local kv

3e6c4c2

trigger

62b2c6a

Merge remote-tracking branch 'upstream/master' into sparse-fix

1548d03

Merge remote-tracking branch 'upstream/master' into sparse-fix

762f816

pull with ignore sparse

4cafce1

rsp pull with priority

932bf49

add doc;

1a6dc36

fix bug in sparse kvstore

6f38f75

Merge remote-tracking branch 'upstream/master' into sparse-fix

0b9be78

Merge remote-tracking branch 'upstream/master' into sparse-fix

a533c23

+kvstore test

7281797

add dist kvstore test

a834826

enhance dist kv test

2622924

eric-haibin-lin requested a review from szha as a code owner June 27, 2018 20:48

eric-haibin-lin added 2 commits June 27, 2018 20:52

fix lint

46ff1a0

fix lint

47b143d

eric-haibin-lin requested a review from piiswrong June 28, 2018 20:49

Merge remote-tracking branch 'upstream/master' into sparse-fix

3d8d666

szha reviewed Jun 29, 2018

View reviewed changes

haojin2 approved these changes Jul 1, 2018

View reviewed changes

piiswrong suggested changes Jul 2, 2018

View reviewed changes

CR comments

a2b1cc9

rahul003 reviewed Jul 3, 2018

View reviewed changes

rahul003 approved these changes Jul 5, 2018

View reviewed changes

piiswrong approved these changes Jul 9, 2018

View reviewed changes

eric-haibin-lin merged commit 266de6b into apache:master Jul 9, 2018

This was referenced Jul 10, 2018

Fix nccl compilation error #11623

Merged

Fix dist kvstore for trainer and flaky dist kvstore test #11633

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sparse pull performance for gluon trainer #11429

Improve sparse pull performance for gluon trainer #11429

eric-haibin-lin commented Jun 27, 2018

eric-haibin-lin commented Jun 27, 2018

eric-haibin-lin commented Jun 28, 2018

junrushao commented Jun 29, 2018

szha Jun 29, 2018

eric-haibin-lin Jul 1, 2018

szha commented Jun 29, 2018

eric-haibin-lin commented Jul 1, 2018 •

edited

Loading

haojin2 left a comment

piiswrong Jul 2, 2018

piiswrong Jul 2, 2018

eric-haibin-lin Jul 3, 2018

piiswrong Jul 2, 2018

eric-haibin-lin Jul 4, 2018

rahul003 Jul 3, 2018

rahul003 Jul 3, 2018

eric-haibin-lin Jul 4, 2018

rahul003 Jul 3, 2018

eric-haibin-lin Jul 4, 2018

eric-haibin-lin commented Jul 4, 2018

Improve sparse pull performance for gluon trainer #11429

Improve sparse pull performance for gluon trainer #11429

Conversation

eric-haibin-lin commented Jun 27, 2018

Description

Checklist

Essentials

Changes

Comments

eric-haibin-lin commented Jun 27, 2018

eric-haibin-lin commented Jun 28, 2018

junrushao commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha commented Jun 29, 2018

eric-haibin-lin commented Jul 1, 2018 • edited Loading

haojin2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented Jul 4, 2018

eric-haibin-lin commented Jul 1, 2018 •

edited

Loading