io.ascii.read() uses way too much memory #13092

Gabriel-p · 2022-04-09T14:33:54Z

Description

io.ascii.read() uses too much memory

Expected behavior

A similar footprint as that of pandas

Actual behavior

Uses way more memory to load the exact same file compared to pandas.read_csv()

Steps to Reproduce

Use a somewhat large file (>500 Mb)
Load using ascii.read()
Load using pd.read_csv()
Compare how much memory each process employed

file = "some_large_file"
data = pd.read_csv(file, index_col=False)
data = ascii.read(file)

System Details

Linux-5.5.0-050500-generic-x86_64-with-glibc2.17
Python 3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0]
Numpy 1.21.2
pyerfa 2.0.0
astropy 5.0
Scipy 1.7.3
Matplotlib 3.5.0

The text was updated successfully, but these errors were encountered:

pllim · 2022-04-14T03:10:36Z

Do you have an example file and the benchmarks? Thanks!

Gabriel-p · 2022-04-14T21:12:24Z

Data file

The pd.read_csv() call uses less than 2 Gb of memory in my system to load the data file. If I call ascii.read() I have to manually kill the process after almost 5 Gb are used by it and the data file is still not read.

import pandas as pd
from astropy.io import ascii

path = "upk_53.dat"
data = ascii.read(path, delimiter=' ')
data = pd.read_csv(path, delimiter=' ')

neutrinoceros · 2024-05-02T14:32:13Z

The example file given about appears to be lost to the limbs of the cloud.
However, the issue is simple enough to demonstrate with fake data:

# write.py
import numpy as np

arr = np.ones((32*1024, 1024), dtype="float64") # about a GB on disk
np.savetxt("/tmp/test.csv", arr)

# read_pandas.py
import pandas as pd

pd.read_csv("/tmp/test.csv", index_col=False)

# read_astropy.py
from astropy.io import ascii

ascii.read("/tmp/test.csv")

using memray, I observe that pandas is indeed pretty efficient (consuming ~1.2 Gb or RAM for a ~800Mb file), while astropy takes, indeed, way too much (about 16Gb) for the same file.

memray also pinpoints the problem to the following line:

astropy/astropy/io/ascii/core.py

Line 336 in cec24e8

lines = table.splitlines()

Here, we consume about 8Gb of memory to construct a temporary list of strings each time we run it, and for reasons that are not yet cleat to me, it seems that 2 instances of that list co-exist at some point, and there go our 16Gb.
I do stress that these lists are temporary, and are garbarge-collected after the reading process is over, so it seems likely that we could avoid that excessive memory consumption by using an iterator instead of a full blown list. I'm going to give it a try.

neutrinoceros · 2024-05-02T16:36:35Z

Writing a lazy line generator is simple enough, the difficult part is to refactor the internals of io.ascii because many places expect lines not to be consumed after the first loop, as a generator does. I'll give it some more time soon (hopefully tomorrow unless something comes up).

Gabriel-p · 2024-05-02T17:29:32Z

IS there a a need for this module to be around? Why not just let pandas handle the data IO?

pllim · 2024-05-02T17:41:14Z

Why not just let pandas handle the data IO?

pandas does not integrate with all the astropy features, e.g., ECSV, Quantity, etc.

astrofrog · 2024-05-02T18:56:54Z

One way to mitigate this would be to add the ability for io.ascii to load data in chunks and then vstack them at the end?

astrofrog · 2024-05-02T18:58:11Z

Alternatively what if we allowed lines to be a list like object which lazily loads data as needed/accessed? We might be able to then not modify a lot of existing code?

saimn · 2024-05-03T15:48:56Z

We already have chunks loading (#6458), and the memory issue is not new (#7871, #3334).
It would be better to use a generator indeed and load lines when needed, but that's not working with - as far as I remember, looked at that long ago - probably headers and guessing mode and few other things where the first lines are used multiple times. Might be fixable but not easy to do given the number of readers.

neutrinoceros · 2024-05-03T17:53:46Z

Thanks for pointing these out @saimn !
I've started experimenting with a generator and indeed, the first hurdle to overcome is that some lines need to be read more than once, but I'll be thinking about it.

Gabriel-p added the Bug label Apr 9, 2022

pllim added Performance io.ascii labels Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io.ascii.read() uses way too much memory #13092

io.ascii.read() uses way too much memory #13092

Gabriel-p commented Apr 9, 2022

pllim commented Apr 14, 2022

Gabriel-p commented Apr 14, 2022 •

edited

neutrinoceros commented May 2, 2024

neutrinoceros commented May 2, 2024

Gabriel-p commented May 2, 2024

pllim commented May 2, 2024

astrofrog commented May 2, 2024

astrofrog commented May 2, 2024

saimn commented May 3, 2024

neutrinoceros commented May 3, 2024

io.ascii.read() uses way too much memory #13092

io.ascii.read() uses way too much memory #13092

Comments

Gabriel-p commented Apr 9, 2022

Description

Expected behavior

Actual behavior

Steps to Reproduce

System Details

pllim commented Apr 14, 2022

Gabriel-p commented Apr 14, 2022 • edited

neutrinoceros commented May 2, 2024

neutrinoceros commented May 2, 2024

Gabriel-p commented May 2, 2024

pllim commented May 2, 2024

astrofrog commented May 2, 2024

astrofrog commented May 2, 2024

saimn commented May 3, 2024

neutrinoceros commented May 3, 2024

Gabriel-p commented Apr 14, 2022 •

edited