New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unmanaged (Old) memory hanging #6232
Comments
Have you tried using the new Active Memory Manager? When I do:
with your gist notebook on my local machine, I get the following plot. I'm using dask and distributed 2022.01.0 FYI: cc @crusaderky |
@josuemtzmo your notebook gist doesn't seem to reproduce the issue. With the same number of workers, threads per worker, and memory limit, for me it finishes in 3.5s instead of your 2m5s (reading from local SSD). Does the generator code for test.nc in the gist produce the full sized data? Or did you make it smaller before you published it?
|
This is barely more than just xarray and all its dependencies. $ python
>>> import xarray, psutil
>>> psutil.Process().memory_info().rss / 2**20
126.58203125
What magnitude of leak are we talking about? Each worker holds its logs in a fixed-size, fairly long deque, so it's normal to have a bit of increase towards the beginning. |
Thanks for the suggestions. @hayesgb I had no idea of the Active Memory Manager. Although it improves a bit by reducing the memory usage of the cluster, the issue persists (see figures below). @crusaderky I ran the notebook in both my local computer (MacOS) and the server referred in the issue and the results are the same. Furthermore, I run only the code provided in #6241 and the issue is still there for both my local and server machines. Run on SLES linux (server) using dask 2022.4.1 on python 3.10.4 (I created a new env form scratch to test this) On my personal computer (MacOS ARM chip) using dask 2022.4.1 on python 3.9.12 (new env too) Furthermore, if I rerun the same code in the same cluster, the unmanaged (old) memory continues to increase... (two runs with a 15 second pause in between them) |
The substantial growth of unmanaged memory on MacOSX is a well known issue: #5840 On SLES, what your plots are showing me is that, in the middle of a computataion, the unmanaged memory is ~10GiB, or ~1.4GiB per worker. Considering that it includes the heap of the tasks and the network buffers, this is perfectly normal and healthy. If it's too much for you, you need to reduce your chunk size.
Run it 10 times and I expect it to plateau. To recap:
|
P.S. I notice that your SLES notebook lost the |
After running several times, the
Here is the memory usage graph for 10 runs, effectively it plateaus: The drop at the end occurs when the manual trim is performed. Perhaps,I guess I expected the memory to be released more often. |
To clarify: are you doing this in Linux? On MacOS that variable won't do anything. |
To clarify, I'm testing the |
[EDIT]: false alarm. See pangeo-data/rechunker#100 (comment) for solution to my problem. Previous message below for reference. Just to add my usecase here. On Linux and trying all the solutions I could gather here, I have exactly the same problem with huge memory leaks in even a "fit in memory" dask problem: https://nbviewer.ipython.org/gist/fmaussion/5212e3155256e84e53d033e61085ca30 |
@crusaderky's health/not healthy list would be a pretty great addition to the docs! |
1500 * 1500 * 8 = 17 MiB, which is much larger than the MALLOC_TRIM_THRESHOLD_ of 16 kib - so, at face value, this is not expected.
This definitely must not happen, but it's unrelated with memory management. Could you open a reproducer, with logs and scheduler dump, in a different ticket? |
we are facing this issue, tried the snippet shared earlier: from dask.distributed import get_client
import dask.array as da
from distributed.diagnostics import MemorySampler
client = get_client('10.10.10.1:8786')
import ctypes
def trim_memory() -> int:
libc = ctypes.CDLL("libc.so.6")
return libc.malloc_trim(0)
client.amm.start()
a = da.random.random((20_000, 20_000), chunks=(2_000, 2_000))
b = (a @ a.T).mean()
ms = MemorySampler()
with ms.sample("With AMM"):
b.compute()
client.run(trim_memory)
ms.plot(figsize=(15,10), grid=True) The memory increases on workers from around 20GB to 50 GB and keeps growing steadily. a = da.random.random((200_000, 200_000), chunks=(20_000, 20_000)) with above, the memory usage crossed 80GB and stayed there, even with I have tried setting This is an issue for us, because we run dags which do significantly more processing than the above snippet and the memory utilization of the cluster crosses 250 GB. We have a cluster capacity of around 2 TB(memory). I tried the suggestions in this issue and also tried basic jemalloc tuning but nothing has helped so far. I would like to understand if we can quickly address this. |
@n3rV3 does the unmanaged memory stay there when there are no tasks running at all? |
Q: Does the unmanaged memory stay there when there are no tasks running at all? Q: Does it stay there after you release all your futures and persisted collections (so that your managed memory becomes 0)? Q: what's your dask version? Q: what's the capacity per worker? Q: what kind of cluster does your client connect to? (dask-kubernetes, coiled, in-house...?)
Q: on what OS do the workers run? |
The dashboard must show managed memory = 0 in the top left corner. Once you have no keys in memory: how much memory per node are you talking about, at rest? If it's significant (>2 GiB), you are likely suffering from a memory leak, potentially something that dask is not responsible for. What libraries are you calling from your tasks? |
What happened:
I am using Xarray and Dask to analyse large datasets ~100s of Gb, to reduce their dimensions from f(x,y,z,t) to f(t) by doing averages or sums. Several times I've encountered that Unmanaged (Old) memory hangs in the Cluster memory until it kills the client.
MemorySampler for one of my analysis for a spatial average, as show in the graph, the cluster memory never decreases towards it's pre-run state (It looks like a memory leak, but the computation is only dataset.mean()).
I've tried setting
"MALLOC_TRIM_THRESHOLD_"
to16284
and0
and the issue persists.I've also tried manually trimming memory:
and although, this usually only frees <1Gb and the cluster unmanaged memory remains in the order of 10s of Gb.
In order to better report this issue, I've attached an example that leaves around 1GB of unmanaged (old) memory (I'm not sure if this is related to issue dask/dask#3530)
https://gist.github.com/josuemtzmo/3620e01c9caf88c18109809c52d77180
After lazy loading a test dataset, the prices memory is ~159.01 MiB using
psutil
.After running the computation; a mean on space (i.e. f(t,x,y).mean(('x','y')), the process memory is ~256.83 MiB using
psutil
.This increase in memory is expected, however the cluster memory remains in 1.27GiB, where each of the workers has around 180MiB of Unmanaged (Old) memory hanging.
Running the manual trim memory, the cluster memory only decreases the cluster memory from 1.26GiB to 1.23GiB. If the same code is run several times, the unmanaged memory continues to increase.
What you expected to happen:
Environment:
Thanks!
The text was updated successfully, but these errors were encountered: