[BUGFIX] Fix segfault at training exit#21182
Conversation
|
Hey @DickJC123 , Thanks for submitting the PR
CI supported jobs: [unix-gpu, windows-cpu, website, miscellaneous, centos-gpu, windows-gpu, edge, clang, sanity, centos-cpu, unix-cpu] Note: |
samskalicky
left a comment
There was a problem hiding this comment.
Thanks for the quick fix @DickJC123 !
|
I went back to the repro python program given in #20959 and have verified that this PR does not inadvertently reintroduce the memory leak. Output produced: |
To clarify @waytrue17, are you asking for the example in #20959 repackaged into a 'dataloader leak unittest'? Regarding a unittest targeting the segfault issue, do you know if the CI builds have horovod installed (a requirement)? Do we have multi-GPU testing in CI? |
Seems the CI installs horovod only in master at here, and it was disabled due to some hanging issue. |
|
LGTM |
Description
This provides an updated fix for issue #19379. To understand this fix, a bit of history is needed:
This PR resupplies the segfault fix of not destroying MXNet's Stream objects only to the parent process (detected by
shutdown_phase_ == true). If the main process is exiting, the CUDA runtime will take care of the release of resources. This PR does not effect the release of resources in the child threads (shutdown_phase == false). Thus, the destruction of Streams in the dataloader child processes are not affected and the data memory leak problem of #20959 will not resurface.Checklist
Essentials
Regarding testing, I was first able to repro the segfault on a DGX1V (8 V100's) running the mxnetci/build.ubuntu_gpu_cu110 container with the command sequence:
It is possible the segfault only occurs with horovod use. I verified that the segfault occurred consistently with MXNET_ENGINE_TYPE set to all 3 possibilities: ThreadedEnginePerDevice, ThreadedEngine, and NaiveEngine. After the fix of the PR was applied, the segfault did not appear for any of the engines.