Correct init idx2name on multi-CPU/GPU training for FeedForward by Answeror · Pull Request #2366 · apache/mxnet

Answeror · 2016-06-07T11:45:06Z

This patch is related to #2327 and #2337, which fix a bug of idx2name in multi-GPU environment for Module.
Note, all previous codes which involved lr_mult and wd_mult in multi-GPU environment may suffer from this bug,
for example, finetuning with lr_mult of lower layers fixed to zero on multi-GPU will give incorrect result.

With this patch, the ResNet example works on multi-GPU setting:

2016-06-07 11:19:56,313 Node[0] Start training with [gpu(0), gpu(1)]
2016-06-07 11:20:22,748 Node[0] Epoch[0] Batch [50]     Speed: 2504.86 samples/sec      Train-accuracy=0.122969
2016-06-07 11:20:25,337 Node[0] Epoch[0] Batch [100]    Speed: 2472.41 samples/sec      Train-accuracy=0.176094
2016-06-07 11:20:27,888 Node[0] Epoch[0] Batch [150]    Speed: 2509.07 samples/sec      Train-accuracy=0.225781
2016-06-07 11:20:30,465 Node[0] Epoch[0] Batch [200]    Speed: 2483.77 samples/sec      Train-accuracy=0.249531
2016-06-07 11:20:33,081 Node[0] Epoch[0] Batch [250]    Speed: 2446.37 samples/sec      Train-accuracy=0.286250
2016-06-07 11:20:35,754 Node[0] Epoch[0] Batch [300]    Speed: 2394.48 samples/sec      Train-accuracy=0.324062
2016-06-07 11:20:38,324 Node[0] Epoch[0] Batch [350]    Speed: 2490.45 samples/sec      Train-accuracy=0.337344
2016-06-07 11:20:40,480 Node[0] Epoch[0] Resetting Data Iterator
2016-06-07 11:20:40,480 Node[0] Epoch[0] Time cost=20.615
2016-06-07 11:20:40,541 Node[0] Saved checkpoint to "cifar10/resnet-0001.params"
2016-06-07 11:20:41,888 Node[0] Epoch[0] Validation-accuracy=0.398240
2016-06-07 11:20:44,471 Node[0] Epoch[1] Batch [50]     Speed: 2505.95 samples/sec      Train-accuracy=0.379688
2016-06-07 11:20:47,206 Node[0] Epoch[1] Batch [100]    Speed: 2339.58 samples/sec      Train-accuracy=0.409062
2016-06-07 11:20:49,807 Node[0] Epoch[1] Batch [150]    Speed: 2460.86 samples/sec      Train-accuracy=0.412656
2016-06-07 11:20:52,377 Node[0] Epoch[1] Batch [200]    Speed: 2491.13 samples/sec      Train-accuracy=0.445469
2016-06-07 11:20:55,055 Node[0] Epoch[1] Batch [250]    Speed: 2390.02 samples/sec      Train-accuracy=0.449375
2016-06-07 11:20:57,744 Node[0] Epoch[1] Batch [300]    Speed: 2379.72 samples/sec      Train-accuracy=0.468750
2016-06-07 11:21:00,305 Node[0] Epoch[1] Batch [350]    Speed: 2499.19 samples/sec      Train-accuracy=0.472500
2016-06-07 11:21:02,550 Node[0] Epoch[1] Resetting Data Iterator
2016-06-07 11:21:02,550 Node[0] Epoch[1] Time cost=20.662
2016-06-07 11:21:02,611 Node[0] Saved checkpoint to "cifar10/resnet-0002.params"
2016-06-07 11:21:03,633 Node[0] Epoch[1] Validation-accuracy=0.507612

This patch is related to apache#2327 and apache#2337, which fix a bug of idx2name in multi-GPU environment for Module. Note, all previous codes which involved lr_mult and wd_mult in multi-GPU environment may suffer from this bug, for example, finetuning with lr_mult of lower layers fixed to zero on multi-GPU will give incorrect result.

piiswrong · 2016-06-07T18:19:39Z

example/image-classification/symbol_resnet.py

        fix_gamma=False,
        momentum=bn_momentum,
        # Same with https://github.com/soumith/cudnn.torch/blob/master/BatchNormalization.lua
-        eps=1e-5


change to 2e-5 instead of removing

piiswrong · 2016-06-08T05:04:25Z

@tqchen good for merging after test finishes

This was referenced Jun 7, 2016

low acc when using multi-gpus in examples/image-classification/train_cifar10_resnet.py #2327

Closed

Exactly reproduce 56 layers ResNet on CIFAR10 #2046

Merged

piiswrong reviewed Jun 7, 2016
View reviewed changes

Answeror added 2 commits June 8, 2016 02:33

Refine ResNet example

4110111

Merge branch 'master' into bug/idx2name-in-feedforward

70486b2

Merge branch 'master' into bug/idx2name-in-feedforward

a6d2b64

tqchen merged commit db6f6a9 into apache:master Jun 8, 2016

Answeror mentioned this pull request Jun 10, 2016

bug in symbol_resnet.py #2354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct init idx2name on multi-CPU/GPU training for FeedForward#2366

Correct init idx2name on multi-CPU/GPU training for FeedForward#2366
tqchen merged 4 commits intoapache:masterfrom
Answeror:bug/idx2name-in-feedforward

Answeror commented Jun 7, 2016

Uh oh!

piiswrong Jun 7, 2016

Uh oh!

piiswrong commented Jun 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Answeror commented Jun 7, 2016

Uh oh!

piiswrong Jun 7, 2016

Choose a reason for hiding this comment

Uh oh!

piiswrong commented Jun 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants