You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
When I was training with multiple machines, i found that model.py:_create_kvstore function sets update_on_kvstore to True. In the gluon interface (trainer.py), i found the following code:
while in module.py, update_on_kvstore is not set to False.
Is this a bug?
Besides, the gluon interfaces pull all param_arrarys whatever update_on_kvstore is. But in the python interface (model.py), only when update_on_kvstore is True, the params are pulled. Any reasons?
It doesn't look like a bug per se. The update is still done after pulling from the kvstore. I'm not sure if there is a reason why the update is done here rather than on the kvstore
Regarding the second point, that is just the init. And update_on_kvstore is only false when kvstore is local. Maybe there is one pull that can be saved, need to look into that.
Thank you for your reply. Do you have any ideas on how to save_optimizer_states property in distributed training, especially when update_on_kvstore is True?
When I was training with multiple machines, i found that model.py:_create_kvstore function sets update_on_kvstore to True. In the gluon interface (trainer.py), i found the following code:
while in module.py, update_on_kvstore is not set to False.
Is this a bug?
Besides, the gluon interfaces pull all param_arrarys whatever update_on_kvstore is. But in the python interface (model.py), only when update_on_kvstore is True, the params are pulled. Any reasons?
Overall, how to save_optimizer_states properly in distributed training?
The text was updated successfully, but these errors were encountered: