-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io.ascii.read() uses way too much memory #13092
Comments
Do you have an example file and the benchmarks? Thanks! |
The
|
The example file given about appears to be lost to the limbs of the cloud. # write.py
import numpy as np
arr = np.ones((32*1024, 1024), dtype="float64") # about a GB on disk
np.savetxt("/tmp/test.csv", arr) # read_pandas.py
import pandas as pd
pd.read_csv("/tmp/test.csv", index_col=False) # read_astropy.py
from astropy.io import ascii
ascii.read("/tmp/test.csv") using memray, I observe that pandas is indeed pretty efficient (consuming ~1.2 Gb or RAM for a ~800Mb file), while astropy takes, indeed, way too much (about 16Gb) for the same file. memray also pinpoints the problem to the following line: astropy/astropy/io/ascii/core.py Line 336 in cec24e8
Here, we consume about 8Gb of memory to construct a temporary list of strings each time we run it, and for reasons that are not yet cleat to me, it seems that 2 instances of that list co-exist at some point, and there go our 16Gb. I do stress that these lists are temporary, and are garbarge-collected after the reading process is over, so it seems likely that we could avoid that excessive memory consumption by using an iterator instead of a full blown list. I'm going to give it a try. |
Writing a lazy line generator is simple enough, the difficult part is to refactor the internals of |
IS there a a need for this module to be around? Why not just let pandas handle the data IO? |
|
One way to mitigate this would be to add the ability for io.ascii to load data in chunks and then vstack them at the end? |
Alternatively what if we allowed lines to be a list like object which lazily loads data as needed/accessed? We might be able to then not modify a lot of existing code? |
We already have chunks loading (#6458), and the memory issue is not new (#7871, #3334). |
Thanks for pointing these out @saimn ! |
Description
io.ascii.read()
uses too much memoryExpected behavior
A similar footprint as that of
pandas
Actual behavior
Uses way more memory to load the exact same file compared to
pandas.read_csv()
Steps to Reproduce
ascii.read()
pd.read_csv()
System Details
The text was updated successfully, but these errors were encountered: