New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MXNET-614] Adding Synchronized Batch Normalization #11502

Merged
merged 38 commits into from Jul 14, 2018

Conversation

Projects
None yet
7 participants
@zhanghang1989
Contributor

zhanghang1989 commented Jun 29, 2018

Description

Adding Synchronized Batch Normalization
Thanks @eric-haibin-lin for great help!

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@zhanghang1989 zhanghang1989 requested a review from szha as a code owner Jun 29, 2018

@zhanghang1989 zhanghang1989 changed the title from [MXNET-614] Adding Synchronized Batch Normalization to [MXNET-614] [WIP] Adding Synchronized Batch Normalization Jun 30, 2018

@zhanghang1989 zhanghang1989 referenced this pull request Jun 30, 2018

Closed

[MXNET-246] operators for Synchronized BatchNorm #10303

1 of 8 tasks complete
@zhanghang1989

This comment has been minimized.

Contributor

zhanghang1989 commented Jun 30, 2018

Help Wanted for passing the CI Test!!

@zhanghang1989 zhanghang1989 changed the title from [MXNET-614] [WIP] Adding Synchronized Batch Normalization to [MXNET-614] [Help Wanted for CI Test] Adding Synchronized Batch Normalization Jun 30, 2018

@zhanghang1989 zhanghang1989 changed the title from [MXNET-614] [Help Wanted for CI Test] Adding Synchronized Batch Normalization to [MXNET-614] Adding Synchronized Batch Normalization Jul 2, 2018

'ndev': num_devices, 'key': self.prefix}
def _get_num_devices(self):
# Caution: if not using all the GPUs, please mannually set num_devices

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018

Member

add the warning to docstring rather than showing a comment here

#include <dmlc/logging.h>
#include <dmlc/parameter.h>
#include <mxnet/operator.h>
# include <condition_variable>

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018

Member

space between # and include?

template<class T>
class SharedND {
private:
int nDev;

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018

Member

convention for variables is xxx_ for private members

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018

Member

and camel for functions, which is correct right now

std::lock_guard<std::mutex> lock(mutex_);
auto it = registry_.find(key);
if (it != registry_.end()) return it->second;
T *newT = new T(ndev);

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018

Member

memory is not released pointed by these raw pointers

This comment has been minimized.

@RogerChern

This comment has been minimized.

RogerChern commented Jul 8, 2018

Set Rank and Barrier in forward and backward as separate variables won't resolve the deadlock issue. I suggest instead we postfix their key parameter with "forward" and "backward".

@RogerChern

This comment has been minimized.

RogerChern commented Jul 8, 2018

The gradients of beta and gamma are summed over only a single device, which deviates from both BatchNormV1 and cuDNN BatchNorm, while the gradient of input being summed over multiple devices. Can somebody shed some light on this issue?

@RogerChern

This comment has been minimized.

RogerChern commented Jul 8, 2018

After addressed the issues mentioned, I got exactly the same (max 1e-5 atol) outputs and gradients for BatchNormV1/cuDNN BatchNorm on a single device and SyncBatchNorm on two devices.

@zhanghang1989 zhanghang1989 requested a review from anirudh2290 as a code owner Jul 10, 2018

@zhanghang1989

This comment has been minimized.

Contributor

zhanghang1989 commented Jul 10, 2018

Thanks @RogerChern ! The comments in deconstruction function is really helpful.

@zhanghang1989

This comment has been minimized.

Contributor

zhanghang1989 commented Jul 11, 2018

@eric-haibin-lin

some minor suggestions

_assert_tensor_close(_find_bn(bn1).running_var.data(ctx_list[0]),
_find_bn(bn2).running_var.data(ctx_list[0]))
input2grad = mx.nd.concat(*[output.grad.as_in_context(input.context) for output in inputs2], dim=0)
#print('input1.grad', input1.grad)

This comment has been minimized.

@eric-haibin-lin

eric-haibin-lin Jul 12, 2018

Contributor

Remove unused code

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018

Contributor

Yeah, Will do. Thx

_assert_tensor_close(input1.grad, input2grad)
def test_sync_batchnorm():
def get_num_devices():

This comment has been minimized.

@eric-haibin-lin

eric-haibin-lin Jul 12, 2018

Contributor

There's test_utils.list_gpus()

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018

Contributor

That is slightly different. list_gpus() doesn’t consider CUDA_VISIBLE_DEVICES

@@ -1909,6 +1909,91 @@ def test_context_num_gpus():
# Test that num_gpus reports at least one GPU, as the test is run on a GPU host.
assert mx.context.num_gpus() > 0
def _check_batchnorm_result(input, num_devices=1, cuda=False):
from mxnet.gluon.utils import split_and_load
def _assert_tensor_close(a, b, atol=1e-3, rtol=1e-3):

This comment has been minimized.

@eric-haibin-lin

eric-haibin-lin Jul 12, 2018

Contributor

will assert_almost_equal do?

}
~SharedND() {
mshadow::FreeSpace(&mean_);

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018

Member

check for data_inited_ before freeing memory

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018

Contributor

I Agree. Will make the changes. Thx

}
}
T* Retrieve(mshadow::Shape<1> shape, int index) {

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018

Member

need doc for these member functions

~GlobalShared() {
for (auto it = registry_.begin(); it != registry_.end(); it++) {
T *ptr = it->second;
delete ptr;

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018

Member

again, you have to guarantee deleting valid pointer, since you didn't init them in the constructor, but in a public function

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018

Contributor

If not inited, the map should be empty

}
~GlobalSharedRank() {
for (auto it = registry_.begin(); it != registry_.end(); it++) {
T *ptr = it->second;

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018

Member

same here

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018

Contributor

If not inited, the hash map should be empty

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018

Member

ok, should be fine

mshadow::Shape2(5, mean.shape_[0]), s);
Tensor<xpu, 1> gmean = workspace[0];
Tensor<xpu, 1> gvar = workspace[1];
// Tensor<xpu, 1> tmp = workspace[2];

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018

Member

remove unused

@zhreshold

This comment has been minimized.

Member

zhreshold commented Jul 12, 2018

Comments added. The rest LGTM now.

@eric-haibin-lin eric-haibin-lin merged commit 3ae4331 into apache:master Jul 14, 2018

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details
@eric-haibin-lin

This comment has been minimized.

Contributor

eric-haibin-lin commented Jul 14, 2018

@indhub FYI

@miteshyh

This comment has been minimized.

miteshyh commented Jul 17, 2018

SyncBatchNorm class doesn't seem to be available from mxnet-cu91 nightly. Its visible for regular mxnet nightly. Are these changes merged fully?

@eric-haibin-lin

This comment has been minimized.

Contributor

eric-haibin-lin commented Jul 17, 2018

@miteshyh mxnet-cu91 is for stable release. SyncBatchNorm will only appear in nightly distribution via --pre

@szha

This comment has been minimized.

Member

szha commented Jul 17, 2018

@miteshyh would you be able to update and use cu92? I heard from @bhavinthaker that nvidia discontinued support for cu91 so we intend to do the same.

@miteshyh

This comment has been minimized.

miteshyh commented Jul 20, 2018

Thanks @szha , I down graded to cu90 as cu92 doesn't have clean support on my hardware yet, and it works.

However while I train ADE20K with GluonCV I get "socket.error: [Errno 111] Connection refused" after a few (@551) iterations, I have raised a separate issue for the same. And this happens with/without SyncBatchNorm.

dmlc/gluon-cv#215

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

[MXNET-614] Adding Synchronized Batch Normalization (apache#11502)
* sync batch norm

* global rank and barrier

* lint

* cpplint

* pylint

* doc

* add ref

* customized barrier

* cpplint

* get rid of pthread

* address comments

* warning

* pylint

* gpu unitest

* gpu 0

* mv to cpu test

* Revert "mv to cpu test"

This reverts commit 24543c9.

* ndev = 2

* debuging

* sum prod

* lint

* contrib, ngpu

* code style

* code style

* forward backward

* test

* cpu test

* fix deconstruction

* doc indent

* doc

* doc

* address comments

* typo

* asnumpy
@jianchao-li

This comment has been minimized.

jianchao-li commented Sep 24, 2018

Set Rank and Barrier in forward and backward as separate variables won't resolve the deadlock issue. I suggest instead we postfix their key parameter with "forward" and "backward".

Hello, @RogerChern. I also met a deadlock issue while training PSPNet on gluon-cv. For the "key parameter" you mentioned above, do you mean the one in this line? Could you please share more details about the fix? Thank you.

@zhanghang1989

This comment has been minimized.

Contributor

zhanghang1989 commented Sep 24, 2018

Please set the ndev to the number of gpus used. In gluoncv, please pass the parameter --ngpus 4 if you are using 4 gpus.

@jianchao-li

This comment has been minimized.

jianchao-li commented Sep 24, 2018

Hello, @zhanghang1989. Thank you for your reply. I will try it tomorrow morning and update the result with you.

Update

Hello, @zhanghang1989. I am not quite sure about whether you suggested me to explicitly set --ngpus 4. Actually I have only 4 GPUs on the machine and the default value of ngpus is len(mx.test_utils.list_gpus()), which actually returned 4 in my case. The logs of print(args) also convinced me about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment