Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MXNET-614] Adding Synchronized Batch Normalization #11502

Merged
merged 38 commits into from Jul 14, 2018

Conversation

@zhanghang1989
Copy link
Contributor

@zhanghang1989 zhanghang1989 commented Jun 29, 2018

Description

Adding Synchronized Batch Normalization
Thanks @eric-haibin-lin for great help!

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here
@zhanghang1989 zhanghang1989 requested a review from szha as a code owner Jun 29, 2018
@zhanghang1989 zhanghang1989 changed the title [MXNET-614] Adding Synchronized Batch Normalization [MXNET-614] [WIP] Adding Synchronized Batch Normalization Jun 30, 2018
@zhanghang1989 zhanghang1989 mentioned this pull request Jun 30, 2018
1 of 8 tasks complete
doc
@zhanghang1989
Copy link
Contributor Author

@zhanghang1989 zhanghang1989 commented Jun 30, 2018

Help Wanted for passing the CI Test!!

@zhanghang1989 zhanghang1989 changed the title [MXNET-614] [WIP] Adding Synchronized Batch Normalization [MXNET-614] [Help Wanted for CI Test] Adding Synchronized Batch Normalization Jun 30, 2018
@zhanghang1989 zhanghang1989 changed the title [MXNET-614] [Help Wanted for CI Test] Adding Synchronized Batch Normalization [MXNET-614] Adding Synchronized Batch Normalization Jul 2, 2018
'ndev': num_devices, 'key': self.prefix}

def _get_num_devices(self):
# Caution: if not using all the GPUs, please mannually set num_devices

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018
Member

add the warning to docstring rather than showing a comment here

#include <dmlc/logging.h>
#include <dmlc/parameter.h>
#include <mxnet/operator.h>
# include <condition_variable>

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018
Member

space between # and include?

template<class T>
class SharedND {
private:
int nDev;

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018
Member

convention for variables is xxx_ for private members

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018
Member

and camel for functions, which is correct right now

std::lock_guard<std::mutex> lock(mutex_);
auto it = registry_.find(key);
if (it != registry_.end()) return it->second;
T *newT = new T(ndev);

This comment has been minimized.

@zhreshold

zhreshold Jul 2, 2018
Member

memory is not released pointed by these raw pointers

This reverts commit 24543c9.
@zhanghang1989 zhanghang1989 requested a review from anirudh2290 as a code owner Jul 10, 2018
doc
@zhanghang1989
Copy link
Contributor Author

@zhanghang1989 zhanghang1989 commented Jul 10, 2018

Thanks @RogerChern ! The comments in deconstruction function is really helpful.

doc
@zhanghang1989
Copy link
Contributor Author

@zhanghang1989 zhanghang1989 commented Jul 11, 2018

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

some minor suggestions

_assert_tensor_close(_find_bn(bn1).running_var.data(ctx_list[0]),
_find_bn(bn2).running_var.data(ctx_list[0]))
input2grad = mx.nd.concat(*[output.grad.as_in_context(input.context) for output in inputs2], dim=0)
#print('input1.grad', input1.grad)

This comment has been minimized.

@eric-haibin-lin

eric-haibin-lin Jul 12, 2018
Member

Remove unused code

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018
Author Contributor

Yeah, Will do. Thx

_assert_tensor_close(input1.grad, input2grad)

def test_sync_batchnorm():
def get_num_devices():

This comment has been minimized.

@eric-haibin-lin

eric-haibin-lin Jul 12, 2018
Member

There's test_utils.list_gpus()

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018
Author Contributor

That is slightly different. list_gpus() doesn’t consider CUDA_VISIBLE_DEVICES

@@ -1909,6 +1909,91 @@ def test_context_num_gpus():
# Test that num_gpus reports at least one GPU, as the test is run on a GPU host.
assert mx.context.num_gpus() > 0

def _check_batchnorm_result(input, num_devices=1, cuda=False):
from mxnet.gluon.utils import split_and_load
def _assert_tensor_close(a, b, atol=1e-3, rtol=1e-3):

This comment has been minimized.

@eric-haibin-lin

eric-haibin-lin Jul 12, 2018
Member

will assert_almost_equal do?

}

~SharedND() {
mshadow::FreeSpace(&mean_);

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018
Member

check for data_inited_ before freeing memory

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018
Author Contributor

I Agree. Will make the changes. Thx

}
}

T* Retrieve(mshadow::Shape<1> shape, int index) {

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018
Member

need doc for these member functions

~GlobalShared() {
for (auto it = registry_.begin(); it != registry_.end(); it++) {
T *ptr = it->second;
delete ptr;

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018
Member

again, you have to guarantee deleting valid pointer, since you didn't init them in the constructor, but in a public function

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018
Author Contributor

If not inited, the map should be empty

}
~GlobalSharedRank() {
for (auto it = registry_.begin(); it != registry_.end(); it++) {
T *ptr = it->second;

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018
Member

same here

This comment has been minimized.

@zhanghang1989

zhanghang1989 Jul 12, 2018
Author Contributor

If not inited, the hash map should be empty

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018
Member

ok, should be fine

mshadow::Shape2(5, mean.shape_[0]), s);
Tensor<xpu, 1> gmean = workspace[0];
Tensor<xpu, 1> gvar = workspace[1];
// Tensor<xpu, 1> tmp = workspace[2];

This comment has been minimized.

@zhreshold

zhreshold Jul 12, 2018
Member

remove unused

@zhreshold
Copy link
Member

@zhreshold zhreshold commented Jul 12, 2018

Comments added. The rest LGTM now.

@eric-haibin-lin eric-haibin-lin merged commit 3ae4331 into apache:master Jul 14, 2018
1 check passed
1 check passed
continuous-integration/jenkins/pr-merge This commit looks good
Details
@eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Jul 14, 2018

@indhub FYI

@miteshyh
Copy link

@miteshyh miteshyh commented Jul 17, 2018

SyncBatchNorm class doesn't seem to be available from mxnet-cu91 nightly. Its visible for regular mxnet nightly. Are these changes merged fully?

@eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Jul 17, 2018

@miteshyh mxnet-cu91 is for stable release. SyncBatchNorm will only appear in nightly distribution via --pre

@szha
Copy link
Member

@szha szha commented Jul 17, 2018

@miteshyh would you be able to update and use cu92? I heard from @bhavinthaker that nvidia discontinued support for cu91 so we intend to do the same.

@miteshyh
Copy link

@miteshyh miteshyh commented Jul 20, 2018

Thanks @szha , I down graded to cu90 as cu92 doesn't have clean support on my hardware yet, and it works.

However while I train ADE20K with GluonCV I get "socket.error: [Errno 111] Connection refused" after a few (@551) iterations, I have raised a separate issue for the same. And this happens with/without SyncBatchNorm.

dmlc/gluon-cv#215

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* sync batch norm

* global rank and barrier

* lint

* cpplint

* pylint

* doc

* add ref

* customized barrier

* cpplint

* get rid of pthread

* address comments

* warning

* pylint

* gpu unitest

* gpu 0

* mv to cpu test

* Revert "mv to cpu test"

This reverts commit 24543c9.

* ndev = 2

* debuging

* sum prod

* lint

* contrib, ngpu

* code style

* code style

* forward backward

* test

* cpu test

* fix deconstruction

* doc indent

* doc

* doc

* address comments

* typo

* asnumpy
@jianchao-li
Copy link

@jianchao-li jianchao-li commented Sep 24, 2018

Set Rank and Barrier in forward and backward as separate variables won't resolve the deadlock issue. I suggest instead we postfix their key parameter with "forward" and "backward".

Hello, @RogerChern. I also met a deadlock issue while training PSPNet on gluon-cv. For the "key parameter" you mentioned above, do you mean the one in this line? Could you please share more details about the fix? Thank you.

@zhanghang1989
Copy link
Contributor Author

@zhanghang1989 zhanghang1989 commented Sep 24, 2018

Please set the ndev to the number of gpus used. In gluoncv, please pass the parameter --ngpus 4 if you are using 4 gpus.

@jianchao-li
Copy link

@jianchao-li jianchao-li commented Sep 24, 2018

Hello, @zhanghang1989. Thank you for your reply. I will try it tomorrow morning and update the result with you.

Update

Hello, @zhanghang1989. I am not quite sure about whether you suggested me to explicitly set --ngpus 4. Actually I have only 4 GPUs on the machine and the default value of ngpus is len(mx.test_utils.list_gpus()), which actually returned 4 in my case. The logs of print(args) also convinced me about this.

@pengwangucla
Copy link

@pengwangucla pengwangucla commented Apr 17, 2019

HI Hang, I used your sync_bn implementation for mxnet symbol. However, it reduced the performance of my network. I wonder whether you have ever tried with symbol API with your sync_bn other than gluon. Thanks

@zhanghang1989
Copy link
Contributor Author

@zhanghang1989 zhanghang1989 commented Apr 17, 2019

Asked here #8458 (comment)

@ngunauj
Copy link

@ngunauj ngunauj commented Jul 26, 2019

How to use it?

@zhanghang1989
Copy link
Contributor Author

@zhanghang1989 zhanghang1989 commented Jul 26, 2019

How to use it?

#11502 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

9 participants
You can’t perform that action at this time.