Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Correct init idx2name on multi-CPU/GPU training for FeedForward#2366

Merged
tqchen merged 4 commits intoapache:masterfrom
Answeror:bug/idx2name-in-feedforward
Jun 8, 2016
Merged

Correct init idx2name on multi-CPU/GPU training for FeedForward#2366
tqchen merged 4 commits intoapache:masterfrom
Answeror:bug/idx2name-in-feedforward

Conversation

@Answeror
Copy link
Contributor

@Answeror Answeror commented Jun 7, 2016

This patch is related to #2327 and #2337, which fix a bug of idx2name in multi-GPU environment for Module.
Note, all previous codes which involved lr_mult and wd_mult in multi-GPU environment may suffer from this bug,
for example, finetuning with lr_mult of lower layers fixed to zero on multi-GPU will give incorrect result.

With this patch, the ResNet example works on multi-GPU setting:

2016-06-07 11:19:56,313 Node[0] Start training with [gpu(0), gpu(1)]
2016-06-07 11:20:22,748 Node[0] Epoch[0] Batch [50]     Speed: 2504.86 samples/sec      Train-accuracy=0.122969
2016-06-07 11:20:25,337 Node[0] Epoch[0] Batch [100]    Speed: 2472.41 samples/sec      Train-accuracy=0.176094
2016-06-07 11:20:27,888 Node[0] Epoch[0] Batch [150]    Speed: 2509.07 samples/sec      Train-accuracy=0.225781
2016-06-07 11:20:30,465 Node[0] Epoch[0] Batch [200]    Speed: 2483.77 samples/sec      Train-accuracy=0.249531
2016-06-07 11:20:33,081 Node[0] Epoch[0] Batch [250]    Speed: 2446.37 samples/sec      Train-accuracy=0.286250
2016-06-07 11:20:35,754 Node[0] Epoch[0] Batch [300]    Speed: 2394.48 samples/sec      Train-accuracy=0.324062
2016-06-07 11:20:38,324 Node[0] Epoch[0] Batch [350]    Speed: 2490.45 samples/sec      Train-accuracy=0.337344
2016-06-07 11:20:40,480 Node[0] Epoch[0] Resetting Data Iterator
2016-06-07 11:20:40,480 Node[0] Epoch[0] Time cost=20.615
2016-06-07 11:20:40,541 Node[0] Saved checkpoint to "cifar10/resnet-0001.params"
2016-06-07 11:20:41,888 Node[0] Epoch[0] Validation-accuracy=0.398240
2016-06-07 11:20:44,471 Node[0] Epoch[1] Batch [50]     Speed: 2505.95 samples/sec      Train-accuracy=0.379688
2016-06-07 11:20:47,206 Node[0] Epoch[1] Batch [100]    Speed: 2339.58 samples/sec      Train-accuracy=0.409062
2016-06-07 11:20:49,807 Node[0] Epoch[1] Batch [150]    Speed: 2460.86 samples/sec      Train-accuracy=0.412656
2016-06-07 11:20:52,377 Node[0] Epoch[1] Batch [200]    Speed: 2491.13 samples/sec      Train-accuracy=0.445469
2016-06-07 11:20:55,055 Node[0] Epoch[1] Batch [250]    Speed: 2390.02 samples/sec      Train-accuracy=0.449375
2016-06-07 11:20:57,744 Node[0] Epoch[1] Batch [300]    Speed: 2379.72 samples/sec      Train-accuracy=0.468750
2016-06-07 11:21:00,305 Node[0] Epoch[1] Batch [350]    Speed: 2499.19 samples/sec      Train-accuracy=0.472500
2016-06-07 11:21:02,550 Node[0] Epoch[1] Resetting Data Iterator
2016-06-07 11:21:02,550 Node[0] Epoch[1] Time cost=20.662
2016-06-07 11:21:02,611 Node[0] Saved checkpoint to "cifar10/resnet-0002.params"
2016-06-07 11:21:03,633 Node[0] Epoch[1] Validation-accuracy=0.507612

This patch is related to apache#2327 and apache#2337, which fix a bug of idx2name in multi-GPU environment for Module.
Note, all previous codes which involved lr_mult and wd_mult in multi-GPU environment may suffer from this bug,
for example, finetuning with lr_mult of lower layers fixed to zero on multi-GPU will give incorrect result.
fix_gamma=False,
momentum=bn_momentum,
# Same with https://github.com/soumith/cudnn.torch/blob/master/BatchNormalization.lua
eps=1e-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to 2e-5 instead of removing

@piiswrong
Copy link
Contributor

@tqchen good for merging after test finishes

@tqchen tqchen merged commit db6f6a9 into apache:master Jun 8, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants