Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.ascii.read() uses way too much memory #13092

Open
Gabriel-p opened this issue Apr 9, 2022 · 10 comments
Open

io.ascii.read() uses way too much memory #13092

Gabriel-p opened this issue Apr 9, 2022 · 10 comments

Comments

@Gabriel-p
Copy link
Contributor

Description

io.ascii.read() uses too much memory

Expected behavior

A similar footprint as that of pandas

Actual behavior

Uses way more memory to load the exact same file compared to pandas.read_csv()

Steps to Reproduce

  1. Use a somewhat large file (>500 Mb)
  2. Load using ascii.read()
  3. Load using pd.read_csv()
  4. Compare how much memory each process employed
file = "some_large_file"
data = pd.read_csv(file, index_col=False)
data = ascii.read(file)

System Details

Linux-5.5.0-050500-generic-x86_64-with-glibc2.17
Python 3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0]
Numpy 1.21.2
pyerfa 2.0.0
astropy 5.0
Scipy 1.7.3
Matplotlib 3.5.0
@Gabriel-p Gabriel-p added the Bug label Apr 9, 2022
@pllim
Copy link
Member

pllim commented Apr 14, 2022

Do you have an example file and the benchmarks? Thanks!

@Gabriel-p
Copy link
Contributor Author

Gabriel-p commented Apr 14, 2022

Data file

The pd.read_csv() call uses less than 2 Gb of memory in my system to load the data file. If I call ascii.read() I have to manually kill the process after almost 5 Gb are used by it and the data file is still not read.

import pandas as pd
from astropy.io import ascii

path = "upk_53.dat"
data = ascii.read(path, delimiter=' ')
data = pd.read_csv(path, delimiter=' ')

@neutrinoceros
Copy link
Contributor

The example file given about appears to be lost to the limbs of the cloud.
However, the issue is simple enough to demonstrate with fake data:

# write.py
import numpy as np

arr = np.ones((32*1024, 1024), dtype="float64") # about a GB on disk
np.savetxt("/tmp/test.csv", arr)
# read_pandas.py
import pandas as pd

pd.read_csv("/tmp/test.csv", index_col=False)
# read_astropy.py
from astropy.io import ascii

ascii.read("/tmp/test.csv")

using memray, I observe that pandas is indeed pretty efficient (consuming ~1.2 Gb or RAM for a ~800Mb file), while astropy takes, indeed, way too much (about 16Gb) for the same file.

memray also pinpoints the problem to the following line:

lines = table.splitlines()

Here, we consume about 8Gb of memory to construct a temporary list of strings each time we run it, and for reasons that are not yet cleat to me, it seems that 2 instances of that list co-exist at some point, and there go our 16Gb.
I do stress that these lists are temporary, and are garbarge-collected after the reading process is over, so it seems likely that we could avoid that excessive memory consumption by using an iterator instead of a full blown list. I'm going to give it a try.

@neutrinoceros
Copy link
Contributor

Writing a lazy line generator is simple enough, the difficult part is to refactor the internals of io.ascii because many places expect lines not to be consumed after the first loop, as a generator does. I'll give it some more time soon (hopefully tomorrow unless something comes up).

@Gabriel-p
Copy link
Contributor Author

IS there a a need for this module to be around? Why not just let pandas handle the data IO?

@pllim
Copy link
Member

pllim commented May 2, 2024

Why not just let pandas handle the data IO?

pandas does not integrate with all the astropy features, e.g., ECSV, Quantity, etc.

@astrofrog
Copy link
Member

One way to mitigate this would be to add the ability for io.ascii to load data in chunks and then vstack them at the end?

@astrofrog
Copy link
Member

Alternatively what if we allowed lines to be a list like object which lazily loads data as needed/accessed? We might be able to then not modify a lot of existing code?

@saimn
Copy link
Contributor

saimn commented May 3, 2024

We already have chunks loading (#6458), and the memory issue is not new (#7871, #3334).
It would be better to use a generator indeed and load lines when needed, but that's not working with - as far as I remember, looked at that long ago - probably headers and guessing mode and few other things where the first lines are used multiple times. Might be fixable but not easy to do given the number of readers.

@neutrinoceros
Copy link
Contributor

Thanks for pointing these out @saimn !
I've started experimenting with a generator and indeed, the first hurdle to overcome is that some lines need to be read more than once, but I'll be thinking about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants