-
Notifications
You must be signed in to change notification settings - Fork 6.8k
GPU memory leak when using gluon.data.DataLoader with num_workers>0 #20959
Comments
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. |
Some additional resources I've found:
|
one additional finding is that the memory leak happens with the default thread_pool option set as False (a.k.a leak when using the multiprocessing.Pool), if I switch to use ThreadPool, there is no memory leak any more! This could be a good indicate for the issue in shared memory. |
@ptrendx not that its your fault, but you had the most recent commit in the dataloader code: https://github.com/apache/incubator-mxnet/blob/v1.9.x/python/mxnet/gluon/data/dataloader.py#L654-L657 do you have any thoughts here about this memory leak issue? |
We are checking this issue. Thanks for the feedback @ann-qin-lu. It does look like more of multiprocessing package related. |
@ann-qin-lu I did repro the same behavior as you mentioned in your comments. We don't have the conclusion now. If you need the workaround, use the thread_pool might be a good choice. Thanks. |
Hi @TristonC, thanks a ton for looking into the issue. I tried with thread_pool option, and it did work without memory leak. However, since the thread_pool option is slow in preparing the data, I do observe the increased E2E latency (mostly increased during validation time). My production use cases are very sensitive to the training time, and we'd still like to explore the option for multiprocess.Pool (assume the memory leak issue can be resolved soon). Do you have any hunch about what changes in Cuda/Cudnn that might lead to this issue? |
@ann-qin-lu Not that I have been aware of. I will seek help from related NVIDIA teams. Stay tuned. |
Tested the above script on two mxnet nightly builds:
This may indicate that the issue was introduced by the commits in between 02/27 to 03/01. Possibly: 8041c0d |
After more deep dives, this issue is actually not caused by cuda upgrade from 10 to 11, but introduced by this specific commit: Remove cleanup on side threads, which skips the cuda deinitialization when destructing engine. I've confirmed that after reverting this commit, the memory leak issue is gone. I'll work with MXNet team to see if this commit should be reverted in both MxNet master and 1.9 branch. (actually another user reported similar memory issue when using the multiprocessing and tried to revert this commit). Here is the open issue for better handling the engine destruction, which needs to be addressed first if the above workaround will be reverted. |
@TristonC False alarm on the Cuda version. Thanks a lot for your help! |
Thanks @ann-qin-lu for your update. I will address issue issue with the MXNet team soon. |
I wonder you might be interested in digger little deeper @ann-qin-lu. It seems current gluon dataloader using fork to start a worker process in multiprocessing.Pool(..) function call (as it is default in Unix-like system). It might be a problem for this issue as the child process inherit everything from its parent process. It might be a good idea to use spawn instead of using fork this function. Unfortunately, I ran into a issue that blocks my test of multiprocessing.get_context('spawn').Pool(...) . Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/opt/mxnet/python/mxnet/gluon/data/dataloader.py", line 58, in rebuild_ndarray
return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype))
File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 193, in _new_from_shared_mem
check_call(_LIB.MXNDArrayCreateFromSharedMemEx(
File "/opt/mxnet/python/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "../src/storage/./cpu_shared_storage_manager.h", line 179
MXNetError: Check failed: ptr != ((void *) -1) (0xffffffffffffffff vs. 0xffffffffffffffff) : Failed to map shared memory. mmap failed with error Permission denied |
It looks like the memory leak in the above script is due to instantiating multiple dataloader objects in the for loop. Having one dataloader object seems to mitigate the issue:
|
Hi @waytrue17, thanks for sharing above info. Yep, by skipping recreating dataloader for each epoch does prevent such issue, but in my use case, I need to shard the big dataset into smaller ones in each epoch, and therefore data loader needs to be created multiple times. Side question: Could you share more insights about how this workaround commit, which skips the clean up gpu memory in Naive Engine, affects the usage pattern of dataloader? |
@TristonC I think your error is due to the fact that Dataloader uses shared memory to hold the dataset. I am not sure if using |
The workaround skips the clean up for all engines, not just the NaiveEngine. So, the general problem here is that when you create the dataloader, it creates a pool of workers by forking the main process, which creates a copy of everything, including the engine and the resources held by it. Then the forked process destroys this copy of the engine to become a much leaner dataloader worker. This would normally destroy the stream engine uses, but with the workaround commit in place, the destruction of the stream does not happen. Now, the problem is that CUDA does not in fact survive forking and the fact that it seems to work is just a lucky coincidence. That is why the spawn method should be used to fix the dataloader - with that the worker processes do not inherit anything from the parent and start from a clean state - with nothing copied to destroy. |
Hi @ptrendx, thanks a lot for the explanation! Now I get a much clear picture of what's going wrong. If the actually RC is that "CUDA does not in fact survive forking", does it mean multiprocessing with Just a quick summary for the two approaches we discussed:
|
We recommend the first solution that @ann-qin-lu proposed.
|
@DickJC123 will help to follow up this issue. |
Description
GPU memory leak when using gluon.data.DataLoader after upgrading Cuda-11.1/Cudnn-8.2.x (also tested with latest Cuda11.5+CuDnn8.3.x but still leaking). Minimal code to repro attached below.
No memory leak with older Cuda version (Cuda-10.1 + CuDnn-7.6.5).
Error Message
gpu memory keeps increasing during training.
To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
What have you tried to solve it?
Environment
We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3
Environment Information
The text was updated successfully, but these errors were encountered: