Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

distributed kvstore bug in MXNet #12713

Closed
eric-haibin-lin opened this issue Oct 1, 2018 · 6 comments · Fixed by #14377
Closed

distributed kvstore bug in MXNet #12713

eric-haibin-lin opened this issue Oct 1, 2018 · 6 comments · Fixed by #14377

Comments

@eric-haibin-lin
Copy link
Member

I'm using distributed kvstore with Gluon trainer. I found the two following bugs:

  1. Initializing trainer = gluon.Trainer(update_on_kvstore=True) doesn't work. Inspecting trainer._update_on_kvstore shows that the value is still set to False.

  2. When distributed kvstore is used, by default gluon.Trainer doesn't work with mx.optimizer.LRScheduler if a worker has more than 1 GPU. To be more specific, the trainer updates once per GPU, the LRScheduler object is shared across GPUs and get a wrong update count.

This means one cannot train imagenet classification using resnet with gluon trainer.

@vrakesh
Copy link
Contributor

vrakesh commented Oct 1, 2018

@eric-haibin-lin Thank you for reporting the issue,
@mxnet-label-bot [Bug, Gluon]

@ragavvenkatesan
Copy link

+1

@sandeep-krishnamurthy
Copy link
Contributor

Working on this. Will update my findings.

@lupesko
Copy link
Contributor

lupesko commented Nov 5, 2018

@sandeep-krishnamurthy is this issue fixed with #12786 ?
If so - please close the issue or comment.

@sandeep-krishnamurthy
Copy link
Contributor

  • Initializing trainer = gluon.Trainer(update_on_kvstore=True) doesn't work. Inspecting trainer._update_on_kvstore shows that the value is still set to False.

This is fixed.

  • When distributed kvstore is used, by default gluon.Trainer doesn't work with mx.optimizer.LRScheduler if a worker has more than 1 GPU. To be more specific, the trainer updates once per GPU, the LRScheduler object is shared across GPUs and get a wrong update count.

This needs to be fixed.

@lebeg
Copy link
Contributor

lebeg commented Nov 12, 2018

As mentioned in #12786 the fix for the 1st problem has issues on the v1.3.x release branch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants