Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[BUGFIX] Fix segfault at training exit#21182

Merged
josephevans merged 2 commits intoapache:v1.9.xfrom
DickJC123:pr_fix_segfault_at_exit
Feb 28, 2023
Merged

[BUGFIX] Fix segfault at training exit#21182
josephevans merged 2 commits intoapache:v1.9.xfrom
DickJC123:pr_fix_segfault_at_exit

Conversation

@DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Feb 17, 2023

Description

This provides an updated fix for issue #19379. To understand this fix, a bit of history is needed:

  • At some point (possibly the arrival of ubuntu 20.04) the destruction order at program exit of the CUDA context and MXNet's singleton engine became non-deterministic. When the engine destruction occurred after the CUDA context destruction, a segfault would occur due to the release of CUDA resources to a non-existent context.
  • @ptrendx supplied the fix of not destroying MXNet's Stream objects at exit in master PR Remove cleanup on side threads #19378, which also was back-ported to v1.x.
  • Since that time, improvements to CUDA have made it no longer susceptible to the problem, starting possibly with CUDA 11.2. CUDA 10.2 and 11.0 are confirmed to be susceptible.
  • A different issue GPU memory leak when using gluon.data.DataLoader with num_workers>0 #20959 was found to be due to the lack of CUDA resource cleanup at exit. As a result, @ptrendx's PR was reverted (on the v1.9.x branch) here [v1.9.x] Revert #20916 #20998. Due to the improvements in CUDA, most users did not experience the return of the original segfault-at-exit problem.
  • But clearly, for users still on CUDA 10.2 and 11.0, the segfault behavior has returned (see recent posts to issue Better handling of the engine destruction #19379).

This PR resupplies the segfault fix of not destroying MXNet's Stream objects only to the parent process (detected by shutdown_phase_ == true). If the main process is exiting, the CUDA runtime will take care of the release of resources. This PR does not effect the release of resources in the child threads (shutdown_phase == false). Thus, the destruction of Streams in the dataloader child processes are not affected and the data memory leak problem of #20959 will not resurface.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Regarding testing, I was first able to repro the segfault on a DGX1V (8 V100's) running the mxnetci/build.ubuntu_gpu_cu110 container with the command sequence:

$ HOROVOD_WITH_MXNET=1 pip install horovod[mxnet]
$ python3 /work/mxnet/example/distributed_training-horovod/module_mnist.py

It is possible the segfault only occurs with horovod use. I verified that the segfault occurred consistently with MXNET_ENGINE_TYPE set to all 3 possibilities: ThreadedEnginePerDevice, ThreadedEngine, and NaiveEngine. After the fix of the PR was applied, the segfault did not appear for any of the engines.

@mxnet-bot
Copy link

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-gpu, windows-cpu, website, miscellaneous, centos-gpu, windows-gpu, edge, clang, sanity, centos-cpu, unix-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 17, 2023
Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @DickJC123 !

@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 17, 2023
Copy link
Contributor

@waytrue17 waytrue17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959

@DickJC123
Copy link
Contributor Author

I went back to the repro python program given in #20959 and have verified that this PR does not inadvertently reintroduce the memory leak. Output produced:

num_workers: 0 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 0 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 0 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 0 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 0 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 4 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 8 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.  

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 22, 2023
@DickJC123
Copy link
Contributor Author

Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959

To clarify @waytrue17, are you asking for the example in #20959 repackaged into a 'dataloader leak unittest'? Regarding a unittest targeting the segfault issue, do you know if the CI builds have horovod installed (a requirement)? Do we have multi-GPU testing in CI?

@waytrue17
Copy link
Contributor

Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959

To clarify @waytrue17, are you asking for the example in #20959 repackaged into a 'dataloader leak unittest'? Regarding a unittest targeting the segfault issue, do you know if the CI builds have horovod installed (a requirement)? Do we have multi-GPU testing in CI?

Seems the CI installs horovod only in master at here, and it was disabled due to some hanging issue.
Approved the PR since we have verified the fix with local test. Lets bring it in and unblock the customers.

@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 23, 2023
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 27, 2023
@chinakook
Copy link
Contributor

LGTM

@josephevans josephevans merged commit 76d73db into apache:v1.9.x Feb 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

pr-awaiting-review PR is waiting for code review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants