Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync_bn #44

Open
MJITG opened this issue Sep 4, 2019 · 4 comments
Open

sync_bn #44

MJITG opened this issue Sep 4, 2019 · 4 comments

Comments

@MJITG
Copy link

MJITG commented Sep 4, 2019

作者你好,我正在我的一个立体匹配的工程里用sync_bn。但是在使用syncbn的时候跑总是跑几个iter就卡住(显存仍然占用,但什么反应都没有),我现在的环境是ubuntu+RTX 2080Ti+cuda10,syncbn用法也是参照你的代码,请问是怎么回事
#23 我似乎跟作者你在这里面的情况一样

@feihuzhang
Copy link
Owner

It's probably a hidden bug in sync_bn. While I currently could not repeat the bug on my own ubuntu+RTX 2080+cuda10.
I need some time to fix it.

Temporary solutions:

  1. change the sync_bn to un-syncbn "torch.nn.BatchNorm"
  2. try to replace the sync_bn with another third-party implementation:
    refer to Model stuck in the forward pass #35

@feihuzhang
Copy link
Owner

It's a hidden bug for sync_bn. Here is a temporarily possible solution:
Adding some useless layers sometimes helps avoid the training freezing problem.
For example, uncomment or add useless such a layer to SGABlock (about line 257 in "GANet_deep.py")
self.conv_refine = BasicConv(8, 8, is_3d=True, kernel_size=1, padding=1)

@Walims
Copy link

Walims commented Sep 30, 2019

作者你好,我尝试添加了一些无用的layer,但是还是会卡在前几个iter,请问还有什么方法可以解决这个问题吗?

@feihuzhang
Copy link
Owner

Solution:
Replace the sync-bn with another version.

  1. Install NVIDIA-Apex package https://github.com/NVIDIA/apex
    $ git clone https://github.com/NVIDIA/apex
    $ cd apex
    $ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  2. Revise the "GANet_deep.py":
    add import apex
    change all BatchNorm2d and BatchNorm3d to apex.parallel.SyncBatchNorm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants