Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

update_on_kvstore error setting with multiple machines #9557

Closed
yuewu001 opened this issue Jan 25, 2018 · 4 comments
Closed

update_on_kvstore error setting with multiple machines #9557

yuewu001 opened this issue Jan 25, 2018 · 4 comments

Comments

@yuewu001
Copy link

yuewu001 commented Jan 25, 2018

When I was training with multiple machines, i found that model.py:_create_kvstore function sets update_on_kvstore to True. In the gluon interface (trainer.py), i found the following code:

 if 'dist' in kvstore.type:
    update_on_kvstore = False
    for i, param in enumerate(self._params):
        param_arrays = param.list_data()
        kvstore.init(i, param_arrays[0])
        kvstore.pull(i, param_arrays, priority=-i)

while in module.py, update_on_kvstore is not set to False.

Is this a bug?

Besides, the gluon interfaces pull all param_arrarys whatever update_on_kvstore is. But in the python interface (model.py), only when update_on_kvstore is True, the params are pulled. Any reasons?

def _initialize_kvstore(kvstore, param_arrays, arg_params, param_names, update_on_kvstore):
    """Initialize kvstore"""
    for idx, param_on_devs in enumerate(param_arrays):
        name = param_names[idx]
        kvstore.init(name,  #arg_params[name])

        if update_on_kvstore:
            kvstore.pull(name, param_on_devs, priority=-idx)

Overall, how to save_optimizer_states properly in distributed training?

@sandeep-krishnamurthy
Copy link
Contributor

@rahul003 @piiswrong - You want to take a look at this?

@rahul003
Copy link
Member

rahul003 commented Mar 1, 2018

It doesn't look like a bug per se. The update is still done after pulling from the kvstore. I'm not sure if there is a reason why the update is done here rather than on the kvstore

Regarding the second point, that is just the init. And update_on_kvstore is only false when kvstore is local. Maybe there is one pull that can be saved, need to look into that.

@yuewu001
Copy link
Author

yuewu001 commented Mar 9, 2018

Thank you for your reply. Do you have any ideas on how to save_optimizer_states property in distributed training, especially when update_on_kvstore is True?

@sandeep-krishnamurthy
Copy link
Contributor

@yuewu001

  1. Trainer is now fixed and uses the same logic as in Module. [Here in Trainer](- https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/trainer.py#L188) is where you create the KVStore using model._create_kvstore. However, please note that, it is not recommended to use KVStore when grad is sparse, hence, it will be set to false when grad is sparse.
  2. To save optimizer states, you can use trainer.save_states(file_name). https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/trainer.py#L376

Resolving the issue. Please reopen, if your still have questions/issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants