[BUGFIX] Fix segfault at training exit by DickJC123 · Pull Request #21182 · apache/mxnet

DickJC123 · 2023-02-17T22:23:56Z

Description

This provides an updated fix for issue #19379. To understand this fix, a bit of history is needed:

At some point (possibly the arrival of ubuntu 20.04) the destruction order at program exit of the CUDA context and MXNet's singleton engine became non-deterministic. When the engine destruction occurred after the CUDA context destruction, a segfault would occur due to the release of CUDA resources to a non-existent context.
@ptrendx supplied the fix of not destroying MXNet's Stream objects at exit in master PR Remove cleanup on side threads #19378, which also was back-ported to v1.x.
Since that time, improvements to CUDA have made it no longer susceptible to the problem, starting possibly with CUDA 11.2. CUDA 10.2 and 11.0 are confirmed to be susceptible.
A different issue GPU memory leak when using gluon.data.DataLoader with num_workers>0 #20959 was found to be due to the lack of CUDA resource cleanup at exit. As a result, @ptrendx's PR was reverted (on the v1.9.x branch) here [v1.9.x] Revert #20916 #20998. Due to the improvements in CUDA, most users did not experience the return of the original segfault-at-exit problem.
But clearly, for users still on CUDA 10.2 and 11.0, the segfault behavior has returned (see recent posts to issue Better handling of the engine destruction #19379).

This PR resupplies the segfault fix of not destroying MXNet's Stream objects only to the parent process (detected by shutdown_phase_ == true). If the main process is exiting, the CUDA runtime will take care of the release of resources. This PR does not effect the release of resources in the child threads (shutdown_phase == false). Thus, the destruction of Streams in the dataloader child processes are not affected and the data memory leak problem of #20959 will not resurface.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Regarding testing, I was first able to repro the segfault on a DGX1V (8 V100's) running the mxnetci/build.ubuntu_gpu_cu110 container with the command sequence:

$ HOROVOD_WITH_MXNET=1 pip install horovod[mxnet]
$ python3 /work/mxnet/example/distributed_training-horovod/module_mnist.py

It is possible the segfault only occurs with horovod use. I verified that the segfault occurred consistently with MXNET_ENGINE_TYPE set to all 3 possibilities: ThreadedEnginePerDevice, ThreadedEngine, and NaiveEngine. After the fix of the PR was applied, the segfault did not appear for any of the engines.

mxnet-bot · 2023-02-17T22:24:01Z

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-gpu, windows-cpu, website, miscellaneous, centos-gpu, windows-gpu, edge, clang, sanity, centos-cpu, unix-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

samskalicky

Thanks for the quick fix @DickJC123 !

waytrue17

Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959

DickJC123 · 2023-02-22T23:06:00Z

I went back to the repro python program given in #20959 and have verified that this PR does not inadvertently reintroduce the memory leak. Output produced:

num_workers: 0 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 0 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 0 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 0 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 0 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 4 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 4 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.                                      num_workers: 8 epoch 0: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 1: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 2: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 3: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.
num_workers: 8 epoch 4: current gpu memory 1.334716796875 GB, Total gpu memory 15.78173828125 GB.

DickJC123 · 2023-02-22T23:40:21Z

Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959

To clarify @waytrue17, are you asking for the example in #20959 repackaged into a 'dataloader leak unittest'? Regarding a unittest targeting the segfault issue, do you know if the CI builds have horovod installed (a requirement)? Do we have multi-GPU testing in CI?

waytrue17 · 2023-02-23T00:19:30Z

Thanks for the fix! Shall we add a test to cover it? Maybe something like the example in #20959

To clarify @waytrue17, are you asking for the example in #20959 repackaged into a 'dataloader leak unittest'? Regarding a unittest targeting the segfault issue, do you know if the CI builds have horovod installed (a requirement)? Do we have multi-GPU testing in CI?

Seems the CI installs horovod only in master at here, and it was disabled due to some hanging issue.
Approved the PR since we have verified the fix with local test. Lets bring it in and unblock the customers.

chinakook · 2023-02-28T08:23:43Z

LGTM

Don't explicitly release CUDA resources at main Python process exit

f56ad03

mseth10 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 17, 2023

DickJC123 requested review from ptrendx and waytrue17 February 17, 2023 22:25

samskalicky reviewed Feb 17, 2023

View reviewed changes

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 17, 2023

waytrue17 reviewed Feb 18, 2023

View reviewed changes

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 22, 2023

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 23, 2023

waytrue17 approved these changes Feb 23, 2023

View reviewed changes

Merge remote-tracking branch 'mxnet/v1.9.x' into pr_fix_segfault_at_exit

6ee0014

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 27, 2023

josephevans approved these changes Feb 28, 2023

View reviewed changes

josephevans merged commit 76d73db into apache:v1.9.x Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Fix segfault at training exit#21182

[BUGFIX] Fix segfault at training exit#21182
josephevans merged 2 commits intoapache:v1.9.xfrom
DickJC123:pr_fix_segfault_at_exit

DickJC123 commented Feb 17, 2023 •

edited

Loading

Uh oh!

mxnet-bot commented Feb 17, 2023

Uh oh!

samskalicky left a comment

Uh oh!

waytrue17 left a comment

Uh oh!

DickJC123 commented Feb 22, 2023

Uh oh!

DickJC123 commented Feb 22, 2023

Uh oh!

waytrue17 commented Feb 23, 2023

Uh oh!

chinakook commented Feb 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

DickJC123 commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Uh oh!

mxnet-bot commented Feb 17, 2023

Uh oh!

samskalicky left a comment

Choose a reason for hiding this comment

Uh oh!

waytrue17 left a comment

Choose a reason for hiding this comment

Uh oh!

DickJC123 commented Feb 22, 2023

Uh oh!

DickJC123 commented Feb 22, 2023

Uh oh!

waytrue17 commented Feb 23, 2023

Uh oh!

chinakook commented Feb 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

DickJC123 commented Feb 17, 2023 •

edited

Loading