update_on_kvstore error setting with multiple machines #9557

yuewu001 · 2018-01-25T02:52:38Z

When I was training with multiple machines, i found that model.py:_create_kvstore function sets update_on_kvstore to True. In the gluon interface (trainer.py), i found the following code:

 if 'dist' in kvstore.type:
    update_on_kvstore = False
    for i, param in enumerate(self._params):
        param_arrays = param.list_data()
        kvstore.init(i, param_arrays[0])
        kvstore.pull(i, param_arrays, priority=-i)

while in module.py, update_on_kvstore is not set to False.

Is this a bug?

Besides, the gluon interfaces pull all param_arrarys whatever update_on_kvstore is. But in the python interface (model.py), only when update_on_kvstore is True, the params are pulled. Any reasons?

def _initialize_kvstore(kvstore, param_arrays, arg_params, param_names, update_on_kvstore):
    """Initialize kvstore"""
    for idx, param_on_devs in enumerate(param_arrays):
        name = param_names[idx]
        kvstore.init(name,  #arg_params[name])

        if update_on_kvstore:
            kvstore.pull(name, param_on_devs, priority=-idx)

Overall, how to save_optimizer_states properly in distributed training?

sandeep-krishnamurthy · 2018-02-27T18:58:35Z

@rahul003 @piiswrong - You want to take a look at this?

rahul003 · 2018-03-01T00:17:53Z

It doesn't look like a bug per se. The update is still done after pulling from the kvstore. I'm not sure if there is a reason why the update is done here rather than on the kvstore

Regarding the second point, that is just the init. And update_on_kvstore is only false when kvstore is local. Maybe there is one pull that can be saved, need to look into that.

yuewu001 · 2018-03-09T02:13:40Z

Thank you for your reply. Do you have any ideas on how to save_optimizer_states property in distributed training, especially when update_on_kvstore is True?

sandeep-krishnamurthy · 2018-08-29T16:55:11Z

@yuewu001

Trainer is now fixed and uses the same logic as in Module. [Here in Trainer](- https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/trainer.py#L188) is where you create the KVStore using model._create_kvstore. However, please note that, it is not recommended to use KVStore when grad is sparse, hence, it will be set to false when grad is sparse.
To save optimizer states, you can use trainer.save_states(file_name). https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/trainer.py#L376

Resolving the issue. Please reopen, if your still have questions/issues.

sandeep-krishnamurthy added Distributed KVStore Gluon labels Feb 27, 2018

sandeep-krishnamurthy closed this as completed Aug 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update_on_kvstore error setting with multiple machines #9557

update_on_kvstore error setting with multiple machines #9557

yuewu001 commented Jan 25, 2018 •

edited

sandeep-krishnamurthy commented Feb 27, 2018

rahul003 commented Mar 1, 2018

yuewu001 commented Mar 9, 2018

sandeep-krishnamurthy commented Aug 29, 2018

update_on_kvstore error setting with multiple machines #9557

update_on_kvstore error setting with multiple machines #9557

Comments

yuewu001 commented Jan 25, 2018 • edited

sandeep-krishnamurthy commented Feb 27, 2018

rahul003 commented Mar 1, 2018

yuewu001 commented Mar 9, 2018

sandeep-krishnamurthy commented Aug 29, 2018

yuewu001 commented Jan 25, 2018 •

edited