[MXNET-374] handle row_sparse weight in parameter and trainer #11001

eric-haibin-lin · 2018-05-19T00:10:06Z

Description

@piiswrong @szha @ZiyueHuang @haojin2 @safrooze please review.

added row_sparse stype to parameter
added trainer reference in parameter
added API to fetch row-sparse-data from parameter
In trainer, separated kvstore creation and parameter initialization in kvstore into two functions: _init_kv and _init_params
added check for loading parameters when trainer's kvstore is present.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

eric-haibin-lin · 2018-05-19T00:15:10Z

python/mxnet/gluon/block.py

+                if stype != 'default':
+                    raise ValueError("Cannot create a HybridBlock with Parameter '%s' " \
+                                     "because its storage type is %s. Please consider " \
+                                     "using a SparseBlock instead."%(param.name, stype))


PR for sparse block will be created separately after this one is merged.

"please consider using" -> "please use"

haojin2 · 2018-05-19T06:53:40Z

tests/python/unittest/test_gluon.py


    p.reset_ctx(ctx=[mx.cpu(1), mx.cpu(2)])
    assert p.list_ctx() == [mx.cpu(1), mx.cpu(2)]

 @with_seed()
 def test_sparse_parameter():
-    p = gluon.Parameter('weight', shape=(10, 10), grad_stype='row_sparse')
+    p = gluon.Parameter('weight', shape=(10, 10), stype='row_sparse', grad_stype='row_sparse')
    p.initialize(init='xavier', ctx=[mx.cpu(0), mx.cpu(1)])


Seems like constraining the contexts to cpu is causing test failures on GPU, is this a necessary thing?

szha · 2018-05-19T20:21:35Z

python/mxnet/gluon/parameter.py

-            "grad_stype for Parameter '%s' must be one of 'default', 'row_sparse', or 'csr'," \
-            " but got '%s'" % (name, grad_stype)
+        # sparse related storage type information
+        valid_stypes = ['default', 'row_sparse', 'csr']


might as well make it a set.

Only has 3 elements. I don't think this makes any real difference

szha · 2018-05-19T20:22:57Z

python/mxnet/gluon/parameter.py

+        """ Set the trainer this parameter is associated with. """
+        if self._trainer and self._trainer is not trainer:
+            raise RuntimeError(
+                "Failed to set the trainer for Parameter '%s' to %s because it was set to %s. " \


How can user detach a parameter's association with a trainer without exiting python?

Updated. Users can just call _set_trainer(None). I don't think this will be used by common users, hence it remains private

piiswrong · 2018-05-22T18:20:27Z

python/mxnet/gluon/block.py

    """
    def __init__(self, prefix=None, params=None):
+        # check if any parameter is row_sparse
+        if isinstance(params, ParameterDict):


This check shouldn't be done here.
Parameters are only added to the current block when self.params.get is called.

Removed. Will the checks in param.list_data() and param.data() be sufficient?

piiswrong · 2018-05-22T19:28:26Z

python/mxnet/gluon/parameter.py

+            raise RuntimeError(
+                "Failed to set the trainer for Parameter '%s' to %s because it was set to %s. " \
+                "More than one trainers for a single Parameter is not supported." %(
+                    self.name, str(trainer), str(self._trainer)))


what does str(trainer) show? It's likely not meaningful to users

This is a breaking change.
Suppose users want to use sgd to train 10 epochs and then switch to ADAM, this would prevent that.

Now only throws exception for rowsparse param

piiswrong · 2018-05-22T19:41:53Z

python/mxnet/gluon/parameter.py

+        """ Get row_sparse data from row_sparse parameters based on row_id. """
+        # get row sparse params based on row ids
+        if not isinstance(row_id, ndarray.NDArray):
+            raise TypeError("Cannot get 'row_sparse' Parameter %s with %s type. "


"row_id must have NDArray type, but %s is given"

piiswrong · 2018-05-22T19:42:29Z

python/mxnet/gluon/parameter.py

+                            "NDArray type is expected." % (self.name, type(row_id)))
+        if not self._trainer:
+            raise RuntimeError("Cannot get row_sparse data for Parameter '%s' when no " \
+                               "Trainer is created with it."%self.name)


What if user want to train with single device?

For single device, we will encourage the user to use normal hybrid blocks with sparse_grad=True. There's no need to use rowsparse weight.
Even if the user choose to use rowsparse weight, a kvstore is created for the rowsparse param and the code still works.

piiswrong · 2018-05-22T19:44:33Z

python/mxnet/gluon/parameter.py

        """(Re)initializes by loading from data."""
+        if self._trainer and self._trainer._kv_initialized and self._trainer._update_on_kvstore:
+            raise RuntimeError("Cannot (Re)initialize Parameter '%s' when its Trainer " \
+                               "already initialized the parameter on KVStore."%(self.name))


message is cryptic. The reason is multi device training and update_on_kvstore is true.
error message should describe the reason and suggest a solution

Updated message.

piiswrong · 2018-05-22T19:49:04Z

python/mxnet/gluon/parameter.py

@@ -396,11 +485,25 @@ def data(self, ctx=None):
        -------
        NDArray on ctx
        """
+        if self._stype != 'default':
+            raise ValueError("Cannot return a copy of Parameter '%s' on ctx %s via data() " \


These should be UserError?

Maybe I should change to RuntimeError? There's UserWarning but I am not aware of UserError

piiswrong · 2018-05-22T21:24:23Z

python/mxnet/gluon/trainer.py

            self._params.append(param)
+            self._params_to_init.append(param)
+            param._set_trainer(self)


do we need to set_trainer when stype='default' and update_on_kvstore=False?

piiswrong · 2018-05-22T21:25:49Z

python/mxnet/gluon/trainer.py

@@ -109,38 +117,54 @@ def _init_optimizer(self, optimizer, optimizer_params):
        self._updaters = [opt.get_updater(self._optimizer) \
                            for _ in self._contexts]

+    def _init_params(self):
+        """ Initialize parameters in the KVStore. Parameters whose


Wrong format

piiswrong · 2018-05-22T21:27:37Z

python/mxnet/gluon/trainer.py

+                                     "when KVStore is not initialized."
+        params_to_init = []
+        if self._kvstore:
+            params = [param for param in self._params_to_init \


better to use for loop and if/else here

piiswrong · 2018-05-22T21:30:26Z

python/mxnet/gluon/trainer.py

@@ -191,6 +224,8 @@ def step(self, batch_size, ignore_stale_grad=False):
        """
        if not self._kv_initialized:
            self._init_kvstore()
+        if self._params_to_init:


I don't quite understand this. If there are uninitialized parameters, wouldn't step fail?

I moved the logics of kv.init(param) from _init_kvstore to _init_params. _params_to_init refers to params that are not initialized on kvstore.

…#11001) * + rsp parameter * draft * Fix optimizer pickle * refactor and document * add test for save load with cast_stype * refactor trainer tests * add test * add back test * raise error for load params * add comment * remove print * fix doc * CR comments * CR comments * change error * remove cast stype * fix test * add reset kvstore to trainer * lint * add test to CI * add more checks

eric-haibin-lin added 11 commits May 16, 2018 06:12

+ rsp parameter

2863a1f

draft

e3d20c7

Fix optimizer pickle

ad672a7

refactor and document

674d374

add test for save load with cast_stype

6db6e29

refactor trainer tests

6f0f403

add test

8db0499

merge

60d9f16

add back test

83009bc

raise error for load params

cf006c8

add comment

4e9ab9c

eric-haibin-lin requested a review from szha as a code owner May 19, 2018 00:10

eric-haibin-lin added 2 commits May 19, 2018 00:10

remove print

a991e98

fix doc

468b599

eric-haibin-lin commented May 19, 2018

View reviewed changes

haojin2 reviewed May 19, 2018

View reviewed changes

szha reviewed May 19, 2018

View reviewed changes

CR comments

0f70344

eric-haibin-lin requested a review from piiswrong May 22, 2018 03:41

piiswrong suggested changes May 22, 2018

View reviewed changes

eric-haibin-lin added 4 commits May 23, 2018 00:56

CR comments

ff9bf84

change error

bee6774

remove cast stype

077b7a5

fix test

6038fe9

eric-haibin-lin force-pushed the sparse-block branch from 857dfd5 to 6038fe9 Compare May 23, 2018 04:42

eric-haibin-lin added 4 commits May 24, 2018 21:24

add reset kvstore to trainer

70de567

lint

12a8b59

add test to CI

2a06884

Merge remote-tracking branch 'upstream/master' into sparse-block

fbcf15d

add more checks

01b3e4d

piiswrong merged commit 482e50b into apache:master May 29, 2018

eric-haibin-lin deleted the sparse-block branch September 18, 2018 23:33

geoalgo mentioned this pull request Nov 29, 2018

Model loading became very slow after #11001 #13454

Open

geoalgo mentioned this pull request Sep 3, 2021

MXNet slowness when creating large graph #17565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-374] handle row_sparse weight in parameter and trainer #11001

[MXNET-374] handle row_sparse weight in parameter and trainer #11001

eric-haibin-lin commented May 19, 2018 •

edited

Loading

eric-haibin-lin May 19, 2018

szha May 19, 2018

haojin2 May 19, 2018

eric-haibin-lin May 21, 2018

szha May 19, 2018

eric-haibin-lin May 21, 2018 •

edited

Loading

szha May 19, 2018

eric-haibin-lin May 21, 2018

piiswrong May 22, 2018

eric-haibin-lin May 22, 2018

piiswrong May 22, 2018

piiswrong May 22, 2018

eric-haibin-lin May 23, 2018

piiswrong May 22, 2018

piiswrong May 22, 2018

eric-haibin-lin May 23, 2018

piiswrong May 22, 2018

eric-haibin-lin May 23, 2018

piiswrong May 22, 2018

eric-haibin-lin May 23, 2018

piiswrong May 22, 2018

piiswrong May 22, 2018

piiswrong May 22, 2018

piiswrong May 22, 2018

eric-haibin-lin May 22, 2018

[MXNET-374] handle row_sparse weight in parameter and trainer #11001

[MXNET-374] handle row_sparse weight in parameter and trainer #11001

Conversation

eric-haibin-lin commented May 19, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin May 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented May 19, 2018 •

edited

Loading

eric-haibin-lin May 21, 2018 •

edited

Loading