New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FITS: Performance issue #4259
Comments
Note, that I tried if gunzipping the file makes any difference, and it doesn't, AFAICS. |
Weird--the data loaded immediately for me. You'll have to tell me more about the Python version and the platform. |
Just to make sure I have the right data, is the table in question supposed to have only one row? |
How do you define row? |
Just look at ['PRODUCT_ID',
'COLLECTION_ID',
'BUNDLE_ID',
'CODE_SVN_REVISION',
'ANC_SVN_REVISION',
'PRODUCT_CREATION_DATE',
'OBSERVATION_TYPE',
'MISSION_PHASE',
'TARGET_NAME',
'ORBIT_SEGMENT',
'ORBIT_NUMBER',
'SOLAR_LONGITUDE',
'GRATING_SELECT',
'KEYHOLE_SELECT',
'BIN_PATTERN_INDEX',
'CADENCE',
'INT_TIME',
'DUTY_CYCLE',
'CHANNEL',
'WAVELENGTH',
'WAVELENGTH_WIDTH',
'KERNELS'] |
It's really weird, even with 1-element fields though, it takes 3 seconds!! or so to get the prompt back when I do: t0 = time.time()
print(data['MISSION_PHASE'])
print(time.time() - t0)
['PRIME']
3.31076288223 |
Yes, there's one row with many multi-element fields. Loading the table, or any individual column, took under a second for me. Have you ever used kern_prof? |
Just so that you don't think it's my computer, I did the same with another file: # first timer
0.189756155014
# second timer
['PRIME']
0.00368285179138 |
No experience with kern_prof, but willing to learn. I tried with astropy master, to no avail. |
It's not that I think it's your computer. I just want to narrow down things like the Numpy version, etc. The OS could matter too. Actually, instead of going straight to kernprof (since I'm not really sure what functions you need to be profiling) it might be nice if you made a simple test script that simply opens that file and loads the data (such that it demonstrates the slowdown), and then run |
I used pandas.show_versions() for the current stuff in miniconda: INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None |
Oh I see, this is your problem:
Let me see if this is fixed in #4228 |
Wow, never would have thought that. Thanks! |
If you downgrade to Numpy 1.9 for now you'll have no problem. |
If you are still interested in the cProfile output, it's here: https://gist.github.com/michaelaye/cbec1c5c9e4e6dcf306e |
Yep: 0.025 seconds, maaaaan. 👊 |
What kind of data are affected by this? Is this only a problem with FITS? I want to put a warning out to my planetary buddies... |
Thanks! The good news is we know this is specific to Numpy 1.10, so at least there's a simple workaround. The bad news is the issue does not appear to be fixed by #4228 and I have no idea what's causing it. The worse news is I'm going on vacation starting tomorrow so if it's not obvious this will have to wait to be fixed :\ (which is too bad because I wanted to promise Numpy 1.10 support in the next release--if I can't find an obvious fix to this we'll have to say that Numpy 1.10 works but has performance issues on FITS tables). |
I'm not sure yet... Certainly not most data otherwise this would have been noticed sooner. |
I also noticed that whatever is going on here it's going crazy with memory usage too--topping out at as much as 2.4 GB while trying to read this file. Which is absurd because the file itself is only 25 MB uncompressed. |
(even considering that on Python 3 reading string columns quadruples the size of this columns it's still absurd) |
Well, thanks for your interactive help, it surely removes some hurdles for now! Try to enjoy your vacation nevertheless! ;) |
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 906.673 906.673 test.py:1(<module>)
66 3.165 0.048 850.936 12.893 {method 'getfield' of 'numpy.ndarray' objects}
----> 66 5.955 0.090 847.764 12.845 _internal.py:425(_getfield_is_safe)
249/89 0.001 0.000 620.248 6.969 decorators.py:507(__get__)
1 0.000 0.000 620.248 620.248 table.py:401(data)
1 0.000 0.000 620.248 620.248 table.py:151(_get_tbdata)
----> 112 339.191 3.028 618.556 5.523 _internal.py:372(_check_field_overlap)
44 0.001 0.000 568.663 12.924 records.py:535(field)
99/63 0.000 0.000 310.548 4.929 {method 'view' of 'numpy.ndarray' objects}
1 0.000 0.000 309.512 309.512 table.py:184(_init_tbdata)
5/3 0.000 0.000 285.600 95.200 numeric.py:1737(array_str)
9/3 0.000 0.000 285.600 95.200 arrayprint.py:344(array2string)
5/1 0.000 0.000 285.600 285.600 arrayprint.py:234(_array2string)
90/59 0.000 0.000 285.597 4.841 {repr}
17/1 0.000 0.000 285.597 285.597 arrayprint.py:470(_formatArray)
10/1 0.000 0.000 285.597 285.597 arrayprint.py:231(repr_format)
1 0.000 0.000 285.597 285.597 fitsrec.py:116(__repr__)
22 0.000 0.000 285.589 12.981 fitsrec.py:76(__getitem__)
22 0.001 0.000 285.588 12.981 fitsrec.py:619(field)
22 0.000 0.000 285.577 12.981 fitsrec.py:1186(_get_recarray_field)
2 0.000 0.000 282.319 141.159 column.py:1053(__init__)
1 0.000 0.000 282.295 282.295 fitsrec.py:234(__array_finalize__)
1 0.001 0.001 282.295 282.295 column.py:1115(_init_from_array)
22 0.000 0.000 282.274 12.831 _internal.py:287(_index_fields)
2056/224 0.984 0.000 275.827 1.231 _internal.py:337(_get_all_field_offsets)
1948 111.653 0.057 273.478 0.140 {method 'extend' of 'list' objects}
----> 153093045 161.825 0.000 161.825 0.000 _internal.py:366(<genexpr>)
153094164 146.572 0.000 146.572 0.000 {method 'update' of 'set' objects}
153095111 131.871 0.000 131.871 0.000 {range} Wow, I don't know what's going on in |
I'd guess this slowdown is related to the recarray performance issues in numpy 1.10, i.e., numpy/numpy#6467. If so, numpy/numpy#6208 nominally resolved this on the numpy side, so the next bug-fix release of numpy should also get this one resolved. |
@mhvk That was my guess too, but I'm testing right now against numpy master and it's still painfully slow. I'm taking a closer look for a bit, but don't have a lot of time. |
Yeah, even on numpy master it's still spending an incredible amount of time in those functions:
For reference, Michael's table has 22 fields, so in the process of reading and printing the table each field is accessed twice (I think for pyfits anyways this is normal). Each field access is taking almost 13 seconds on my system. Times 44 that's nearly 10 minutes right there... :( |
At the same time, I made a plain Numpy array using the same dtype as that table and didn't have any noticeable problem. So maybe there's something crazy going on at our side too.... |
Probably the best one can do right now is to see if one can construct an equivalent recarray-only case which clearly still is much slower on master than on 1.9, and then post it to numpy/numpy#6467. |
Sorry, it seems I'm getting in a habit of commenting when you are just writing what I thought... So, partly it is on our side, although the number of calls to things in
It is part of a section that was changed in the PR that made |
Note the fix has not been merged into numpy master yet |
@esheldon numpy/numpy#6208 has been merged, and that's the only fix to this I'm aware of? Or is there another. I confirmed that I was testing against it. @mhvk The test case I just added in numpy/numpy#6467 (comment) is totally independent of anything particularly weird astropy does (except maybe assigning a new dtype to an existing array but with a different byte order?) |
@embray Yeah, sorry, I thought I was getting email updates about that that pull request but I wasn't. |
@embray - good that you could narrow it down. Hopefully, there will be a fix soon. |
…so noted the known issue of astropy#4259 [skip ci]
@michaelaye - with the recent versions of numpy master and maintenance/1.10.x (see numpy/numpy#6562) this issue is resolved on my system. So, I'm closing this, but feel free to reopen if it turns out there still is a problem (if so, also note at numpy/numpy#6467). And: thanks for the report -- nice to see us helping numpy get better! -- or at least not get worse... |
…so noted the known issue of astropy#4259 [skip ci]
In this FITS file reading out the
.data
object of the 6th (0-indexed counting), the HDU named "Observation" takes over 3 minutes on my iMac 5K.The table indicates a total size of 16 MB:
but I don't believe that should be an issue, as indicated by our IDL reader, that has no noticeable delay in opening that file.
In case this is not an easy-to-fix performance bug with pyfits: Our group is creating these files, we can't change the structure of the files in terms of where is stored what, but maybe there are some
flags
that we can use to improve the structure internally on the binary level somehow?Here's the code I used to show the issue:
The text was updated successfully, but these errors were encountered: