Memory leak in Table indices #16089

taldcroft · 2024-02-21T21:50:33Z

Description

Repeatedly accessing an indexed Table causes memory use to grow in an unexpected and undesired way. In a real-world application on a large table this was causing memory use to exceed 18 Gb. Removing the table index and repeating the access code kept memory use below 1 Gb. We used memray to see memory climbing continuously during a loop which repeatedly accessed elements of an indexed table.

Expected behavior

Memory use should remain approximately constant after the first access.

How to Reproduce

The following should reproduce the problem. You can use a package like memray to monitor memory or just watch a process monitor for memory use of the Python process. For me this starts with about 180 Mb of memory after the first table t is created. After running this memory use is around 1 Gb, while I would expect something under 400 Mb.

import numpy as np
from astropy.table import MaskedColumn
from astropy.table.table_helpers import simple_table
from astropy.time import Time
from tqdm import tqdm

size = 250000
t = simple_table(size=size, cols=26)
idxs = Time(np.random.randint(0, size // 20, size=size), format="cxcsec").isot
t["idx"] = MaskedColumn(idxs)  # THIS IS THE PROBLEM
t.add_index(["idx"])

idxs = np.random.choice(t["idx"], size=100, replace=False)
for idx in tqdm(idxs):
    star_obs = t[t["idx"] == idx]

Versions

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import astropy; print("astropy", astropy.__version__)
import numpy; print("Numpy", numpy.__version__)
import erfa; print("pyerfa", erfa.__version__)
try:
    import scipy
    print("Scipy", scipy.__version__)
except ImportError:
    print("Scipy not installed")
try:
    import matplotlib
    print("Matplotlib", matplotlib.__version__)
except ImportError:
    print("Matplotlib not installed")

macOS-14.2.1-x86_64-i386-64bit
Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:27:35) [Clang 14.0.6 ]
astropy 5.3.1
Numpy 1.23.5
pyerfa 2.0.0.1
Scipy 1.10.0
Matplotlib 3.6.3

The text was updated successfully, but these errors were encountered:

neutrinoceros · 2024-02-22T12:32:15Z

I can reproduce this (on macOS too), so that's a start.
First, I've attempted to run garbage collection every 10th iterations of the loop (just because it's easy to test): no change, so it seems safe to conclude that the problem isn't trivial (maybe some unreachable reference cycles are generated, so garbage-collection isn't completely out of the picture).

Memray indeed helps visualise a slow but steady growth in resident memory, while the heap size stays at bay.
I found that with memray's default behaviour, you don't get much more details (most of the allocations are not traced).
Enabling the --native flag allows to trace allocations from C/C++ extensions, which seems to be what we want here, however, there are a number of known limitations for this on macOS:

quoting https://bloomberg.github.io/memray/native_mode.html

For the best native mode experience, we recommend running your program on Linux using an interpreter and libraries built with as much debugging information as possible.

in general native mode symbolification will not work well in macOS.

While trying to use it, I was also unfortunate enough to stumble upon what now looks like a CPython bug (reported and discussed at bloomberg/memray#553). Switching to Python 3.12.2 resolved the problem so I'm now able to get a first view of native allocations. Here's the script I'm using (with t.py containing the MWE from @taldcroft)

# t.sh
set -euxo pipefail
rm report.bin memray-flamegraph-report.html || true
python -m memray run -o report.bin --native t.py
python -m memray flamegraph report.bin
open memray-flamegraph-report.html

This took me long enough to figure out, which is why I'm reporting at an early stage. I will now try to actually inspect the profile and see if it contains enough information to find the bug (or get a sense of where to look more closely).

If it doesn't suffice, running with CPython + numpy + astropy all compiled with debug symbols would be necessary; however I know that for numpy this is significantly simpler on Linux (macOS is supposed to be supported too but I never could get anything from it).
I was planning to set up a Linux VM at some point, this may be the excuse I've been waiting for. Although, before I do that, I want to ask @MridulS if he happens to be in a better starting position to try this.

neutrinoceros · 2024-02-22T13:38:18Z

update: the basic profile I get with --native but no-debug interpreter + numpy isn't really giving anything useful. What I don't understand yet is that it seems to only be tracing heap allocations, but that's not where the leak is seen.

taldcroft · 2024-02-22T13:49:32Z

My suspicion is that the index is generating some kind of reference cycle that doesn't get collected. I'm not sure how to approach this but I did see a package called refcycle that might be useful.

taldcroft · 2024-02-22T14:07:26Z

@neutrinoceros - one thing I just thought about .. you might try using the SCEngine sorting engine (instead of the default sorted array) and see if the leak persists. That might help localize the problem. See : https://docs.astropy.org/en/stable/table/indexing.html#engines

MridulS · 2024-02-22T14:23:04Z

I'm trying to dig into the code here, and yes another +1 I guess on this issue :)

Plus SCEngine seems to make it worse!

MridulS · 2024-02-22T17:49:39Z

Alright I did some more digging in with memray and it doesn't seem like setting the index makes any difference (?)
I ran the following script and the I have attached the resident and heap size graph.

from astropy.table.table_helpers import simple_table
import numpy as np

t = simple_table(size=250000, cols=26)
# t.add_index('a')
for _ in range(2_000):
   t[np.random.randint(240000, size=250000)]

Without index:

With index: (it just took a lot more time)

Is there any other minimal example I can try playing around with?

taldcroft · 2024-02-22T18:51:40Z

@MridulS - can you try with t2 = t[np.random.randint(240000, size=250000)]. I know that should not make a difference, but in my testing it seems like it did. This sort of testing is always tricky and maybe I got confused.

taldcroft · 2024-02-22T18:54:58Z

@MridulS - if you are still having trouble reproducing I can try adapting the original problem. Of course it is more complicated and depends on data I can't export. In that case removing the index line changed memory use from 18 Gb to 1 Gb.

MridulS · 2024-02-23T07:23:26Z

I am not able to properly reproduce this atleast with memray diagnostics. I would assume if I ran this bit of code for a lot more time, the resident memory usage should keep on increasing (?) It seems to level off a certain time.

In #16089 (comment) I was using just htop to look at memory usage of the ipython process and the memory peaking was visible, but I guess I didn't have proper quantitative diagnostic for it.

I tried to tweak the script a bit more and now I hardly see anything off?

from astropy.table.table_helpers import simple_table
import numpy as np

t = simple_table(size=2500000, cols=26)
t.add_index('a')
slice_ = np.random.randint(240000, size=2500000)

for _ in range(400):
  t2 = t[slice_]

Maybe I am missing something here @neutrinoceros any thoughts?

pllim · 2024-02-23T14:59:55Z

Any chance this is hardware specific? This was reported in OP:

macOS-14.2.1-x86_64-i386-64bit
Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:27:35) [Clang 14.0.6 ]

taldcroft · 2024-02-23T18:14:49Z

It's been seen in the wild on both Linux and Mac.

neutrinoceros · 2024-02-26T09:08:14Z

Thanks @MridulS for taking a look !

I tried to tweak the script a bit more and now I hardly see anything off?

The main difference that I see here is that instead of generating new selectors at each iterations, we just generate one and repeatedly access in a loop. I also do not see a leak within this version of the script, though I don't have a clear idea yet what difference it would make.

I would assume if I ran this bit of code for a lot more time, the resident memory usage should keep on increasing (?) It seems to level off a certain time.

Even though Python's garbage collector is, understandably, not really efficient at collecting dead reference cycles, I believe it does catches up with them on a somewhat lower frequency than regular cycles, so maybe that's why the leak will saturate at some point ? (this is still just guessing)

Anyway, I think proper inspection will indeed require heavy machinery. I'll make a serious attempt at it this week !

taldcroft · 2024-02-26T10:43:58Z

@neutrinoceros @MridulS - since you are not reproducing the really troublesome leak we saw, why don't you hold off on further debugging of this until I get you a better example.

neutrinoceros · 2024-02-26T11:48:35Z

Works for me.

taldcroft · 2024-02-26T14:14:48Z

@neutrinoceros @MridulS - sorry for not understanding the problem a little better before opening this, but I think the example (now updated in the description) is reliably generating the memory leak. I see it consume about 2.4 Gb each time I run the final for loop.

The missing ingredient before was the MaskedColumn, but I also made this more like our original example. I'm not sure exactly what is important but maybe having a wider string column as the index column is contributing to more memory use.

taldcroft · 2024-02-26T14:17:37Z

One obvious mitigation is to have add_index not allow using a MaskedColumn.

neutrinoceros · 2024-02-26T15:02:36Z

Thanks a lot ! I confirm that the updated example leaks very efficiently on my machine too. I also confirm that using

t["idx"] = MaskedColumn(idxs).data.data

avoids the leak.

neutrinoceros · 2024-02-26T15:07:10Z

Also tried

t["idx"] = np.ma.MaskedArray(idxs)

And this still reproduces the leak, so I'm starting to suspect that the bug lives in numpy.

MridulS · 2024-02-26T15:08:01Z

Ohhhh, this produces a beautiful leak.

mhvk · 2024-02-27T01:40:59Z

@neutrinoceros - assigning a MaskedArray to a table gives a MaskedColumn - so no joy there...

neutrinoceros · 2024-02-27T07:46:24Z

Ah, of course. Then there's a chance we're just producing reference cycles in the Python layer. I'll dig in this direction today !

pablogsal · 2024-02-27T17:23:35Z

👋 Hey, memray author and Python core dev here. I run your example with native symbols and debug info and this is what I get:

So looks like most of the memory (2.1GB) is allocated in astropy/table/column.py:529. You can easily do the following to get these using docker:

$ docker run  --platform linux/amd64 -v $PWD:/src  --rm -it ubuntu:latest bash
$ apt-get install python3-numpy debuginfod python3-pip
$ python3 -m pip install astropy memray
$ export DEBUGINFOD_URLS="https://debuginfod.ubuntu.com" 
$ python3 -m memray example.py --trace-python-allocators --native
$ python3 -m memray flamegraph output.bin --leaks

pablogsal · 2024-02-27T17:25:42Z

If you want to give the refcycle theory a go you can use https://docs.python.org/3/library/gc.html#gc.set_debug with https://docs.python.org/3/library/gc.html#gc.DEBUG_LEAK to confirm what are cycles.

neutrinoceros · 2024-02-29T13:22:39Z

thanks a ton @pablogsal !

@MridulS @taldcroft just to keep you guys tuned, I haven't forgotten about this. I'm just prioritising #16070 for now, but this one is the next item on my list !

taldcroft added Bug table labels Feb 21, 2024

taldcroft assigned neutrinoceros Feb 21, 2024

pllim added the Performance label Feb 21, 2024

MridulS mentioned this issue Feb 22, 2024

MAINT: The __init__ cols argument should be set to None in TableColumns #16092

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in Table indices #16089

Memory leak in Table indices #16089

taldcroft commented Feb 21, 2024 •

edited

neutrinoceros commented Feb 22, 2024

neutrinoceros commented Feb 22, 2024

taldcroft commented Feb 22, 2024

taldcroft commented Feb 22, 2024

MridulS commented Feb 22, 2024

MridulS commented Feb 22, 2024

taldcroft commented Feb 22, 2024

taldcroft commented Feb 22, 2024

MridulS commented Feb 23, 2024

pllim commented Feb 23, 2024

taldcroft commented Feb 23, 2024

neutrinoceros commented Feb 26, 2024

taldcroft commented Feb 26, 2024

neutrinoceros commented Feb 26, 2024

taldcroft commented Feb 26, 2024 •

edited

taldcroft commented Feb 26, 2024

neutrinoceros commented Feb 26, 2024 •

edited

neutrinoceros commented Feb 26, 2024

MridulS commented Feb 26, 2024

mhvk commented Feb 27, 2024

neutrinoceros commented Feb 27, 2024

pablogsal commented Feb 27, 2024

pablogsal commented Feb 27, 2024 •

edited

neutrinoceros commented Feb 29, 2024

Memory leak in Table indices #16089

Memory leak in Table indices #16089

Comments

taldcroft commented Feb 21, 2024 • edited

Description

Expected behavior

How to Reproduce

Versions

neutrinoceros commented Feb 22, 2024

neutrinoceros commented Feb 22, 2024

taldcroft commented Feb 22, 2024

taldcroft commented Feb 22, 2024

MridulS commented Feb 22, 2024

MridulS commented Feb 22, 2024

taldcroft commented Feb 22, 2024

taldcroft commented Feb 22, 2024

MridulS commented Feb 23, 2024

pllim commented Feb 23, 2024

taldcroft commented Feb 23, 2024

neutrinoceros commented Feb 26, 2024

taldcroft commented Feb 26, 2024

neutrinoceros commented Feb 26, 2024

taldcroft commented Feb 26, 2024 • edited

taldcroft commented Feb 26, 2024

neutrinoceros commented Feb 26, 2024 • edited

neutrinoceros commented Feb 26, 2024

MridulS commented Feb 26, 2024

mhvk commented Feb 27, 2024

neutrinoceros commented Feb 27, 2024

pablogsal commented Feb 27, 2024

pablogsal commented Feb 27, 2024 • edited

neutrinoceros commented Feb 29, 2024

taldcroft commented Feb 21, 2024 •

edited

taldcroft commented Feb 26, 2024 •

edited

neutrinoceros commented Feb 26, 2024 •

edited

pablogsal commented Feb 27, 2024 •

edited