Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FITS: Performance issue #4259

Closed
michaelaye opened this issue Oct 21, 2015 · 34 comments
Closed

FITS: Performance issue #4259

michaelaye opened this issue Oct 21, 2015 · 34 comments

Comments

@michaelaye
Copy link

In this FITS file reading out the .data object of the 6th (0-indexed counting), the HDU named "Observation" takes over 3 minutes on my iMac 5K.
The table indicates a total size of 16 MB:

data.itemsize
16777661

but I don't believe that should be an issue, as indicated by our IDL reader, that has no noticeable delay in opening that file.
In case this is not an easy-to-fix performance bug with pyfits: Our group is creating these files, we can't change the structure of the files in terms of where is stored what, but maybe there are some flags that we can use to improve the structure internally on the binary level somehow?

Here's the code I used to show the issue:

from astropy.io import fits
import time
fname = 'mvn_iuv_l1a_periapse-orbit01063-muvdark_20150417T234058_v02_r01.fits.gz'
hdulist = fits.open(fname)
t0 = time.time()
data = hdulist[6].data
print(time.time() - t0)

193.38006591796875
@michaelaye
Copy link
Author

Note, that I tried if gunzipping the file makes any difference, and it doesn't, AFAICS.

@embray
Copy link
Member

embray commented Oct 22, 2015

Weird--the data loaded immediately for me. You'll have to tell me more about the Python version and the platform.

@embray
Copy link
Member

embray commented Oct 22, 2015

Just to make sure I have the right data, is the table in question supposed to have only one row?

@michaelaye
Copy link
Author

How do you define row? len(data) is 1, yes, but there's lot's of stuff in it. My system is OSX, Miniconda distribution, both Python 2 and 3 show this. Will try with github astropy now.

@michaelaye
Copy link
Author

Just look at data.names to see all the tables in there:

['PRODUCT_ID',
 'COLLECTION_ID',
 'BUNDLE_ID',
 'CODE_SVN_REVISION',
 'ANC_SVN_REVISION',
 'PRODUCT_CREATION_DATE',
 'OBSERVATION_TYPE',
 'MISSION_PHASE',
 'TARGET_NAME',
 'ORBIT_SEGMENT',
 'ORBIT_NUMBER',
 'SOLAR_LONGITUDE',
 'GRATING_SELECT',
 'KEYHOLE_SELECT',
 'BIN_PATTERN_INDEX',
 'CADENCE',
 'INT_TIME',
 'DUTY_CYCLE',
 'CHANNEL',
 'WAVELENGTH',
 'WAVELENGTH_WIDTH',
 'KERNELS']

@michaelaye
Copy link
Author

It's really weird, even with 1-element fields though, it takes 3 seconds!! or so to get the prompt back when I do:

t0 = time.time()
print(data['MISSION_PHASE'])
print(time.time() - t0)

['PRIME']
3.31076288223

@embray
Copy link
Member

embray commented Oct 22, 2015

Yes, there's one row with many multi-element fields. Loading the table, or any individual column, took under a second for me.

Have you ever used kern_prof?

@michaelaye
Copy link
Author

Just so that you don't think it's my computer, I did the same with another file:

# first timer
0.189756155014
# second timer
['PRIME']
0.00368285179138

@michaelaye
Copy link
Author

No experience with kern_prof, but willing to learn. I tried with astropy master, to no avail.

@embray
Copy link
Member

embray commented Oct 22, 2015

It's not that I think it's your computer. I just want to narrow down things like the Numpy version, etc. The OS could matter too.

Actually, instead of going straight to kernprof (since I'm not really sure what functions you need to be profiling) it might be nice if you made a simple test script that simply opens that file and loads the data (such that it demonstrates the slowdown), and then run python -m cProfile --sort=cumtime test_script.py and copy the basic profiler output first. Might help give an idea....

@michaelaye
Copy link
Author

I used pandas.show_versions() for the current stuff in miniconda:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None

@embray
Copy link
Member

embray commented Oct 22, 2015

Oh I see, this is your problem:

numpy: 1.10.1

Let me see if this is fixed in #4228

@michaelaye
Copy link
Author

Wow, never would have thought that. Thanks!

@embray
Copy link
Member

embray commented Oct 22, 2015

If you downgrade to Numpy 1.9 for now you'll have no problem.

@michaelaye
Copy link
Author

If you are still interested in the cProfile output, it's here: https://gist.github.com/michaelaye/cbec1c5c9e4e6dcf306e

@michaelaye
Copy link
Author

Yep: 0.025 seconds, maaaaan. 👊

@michaelaye
Copy link
Author

What kind of data are affected by this? Is this only a problem with FITS? I want to put a warning out to my planetary buddies...

@embray
Copy link
Member

embray commented Oct 22, 2015

Thanks!

The good news is we know this is specific to Numpy 1.10, so at least there's a simple workaround.

The bad news is the issue does not appear to be fixed by #4228 and I have no idea what's causing it.

The worse news is I'm going on vacation starting tomorrow so if it's not obvious this will have to wait to be fixed :\ (which is too bad because I wanted to promise Numpy 1.10 support in the next release--if I can't find an obvious fix to this we'll have to say that Numpy 1.10 works but has performance issues on FITS tables).

@embray
Copy link
Member

embray commented Oct 22, 2015

What kind of data are affected by this? Is this only a problem with FITS? I want to put a warning out to my planetary buddies...

I'm not sure yet... Certainly not most data otherwise this would have been noticed sooner.

@embray
Copy link
Member

embray commented Oct 22, 2015

I also noticed that whatever is going on here it's going crazy with memory usage too--topping out at as much as 2.4 GB while trying to read this file. Which is absurd because the file itself is only 25 MB uncompressed.

@embray
Copy link
Member

embray commented Oct 22, 2015

(even considering that on Python 3 reading string columns quadruples the size of this columns it's still absurd)

@michaelaye
Copy link
Author

Well, thanks for your interactive help, it surely removes some hurdles for now! Try to enjoy your vacation nevertheless! ;)

@embray
Copy link
Member

embray commented Oct 22, 2015

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001  906.673  906.673 test.py:1(<module>)
       66    3.165    0.048  850.936   12.893 {method 'getfield' of 'numpy.ndarray' objects}
---->       66    5.955    0.090  847.764   12.845 _internal.py:425(_getfield_is_safe)
   249/89    0.001    0.000  620.248    6.969 decorators.py:507(__get__)
        1    0.000    0.000  620.248  620.248 table.py:401(data)
        1    0.000    0.000  620.248  620.248 table.py:151(_get_tbdata)
---->      112  339.191    3.028  618.556    5.523 _internal.py:372(_check_field_overlap)
       44    0.001    0.000  568.663   12.924 records.py:535(field)
    99/63    0.000    0.000  310.548    4.929 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000  309.512  309.512 table.py:184(_init_tbdata)
      5/3    0.000    0.000  285.600   95.200 numeric.py:1737(array_str)
      9/3    0.000    0.000  285.600   95.200 arrayprint.py:344(array2string)
      5/1    0.000    0.000  285.600  285.600 arrayprint.py:234(_array2string)
    90/59    0.000    0.000  285.597    4.841 {repr}
     17/1    0.000    0.000  285.597  285.597 arrayprint.py:470(_formatArray)
     10/1    0.000    0.000  285.597  285.597 arrayprint.py:231(repr_format)
        1    0.000    0.000  285.597  285.597 fitsrec.py:116(__repr__)
       22    0.000    0.000  285.589   12.981 fitsrec.py:76(__getitem__)
       22    0.001    0.000  285.588   12.981 fitsrec.py:619(field)
       22    0.000    0.000  285.577   12.981 fitsrec.py:1186(_get_recarray_field)
        2    0.000    0.000  282.319  141.159 column.py:1053(__init__)
        1    0.000    0.000  282.295  282.295 fitsrec.py:234(__array_finalize__)
        1    0.001    0.001  282.295  282.295 column.py:1115(_init_from_array)
       22    0.000    0.000  282.274   12.831 _internal.py:287(_index_fields)
 2056/224    0.984    0.000  275.827    1.231 _internal.py:337(_get_all_field_offsets)
     1948  111.653    0.057  273.478    0.140 {method 'extend' of 'list' objects}
----> 153093045  161.825    0.000  161.825    0.000 _internal.py:366(<genexpr>)
153094164  146.572    0.000  146.572    0.000 {method 'update' of 'set' objects}
153095111  131.871    0.000  131.871    0.000 {range}

Wow, I don't know what's going on in _getfield_is_safe and _check_field_overlap but something is definitely fishy in there. I'm thinking this is probably an upstream bug. Incidentally, this may already be fixed by numpy/numpy#6208. I'll check.

@mhvk
Copy link
Contributor

mhvk commented Oct 22, 2015

I'd guess this slowdown is related to the recarray performance issues in numpy 1.10, i.e., numpy/numpy#6467. If so, numpy/numpy#6208 nominally resolved this on the numpy side, so the next bug-fix release of numpy should also get this one resolved.

@embray
Copy link
Member

embray commented Oct 22, 2015

@mhvk That was my guess too, but I'm testing right now against numpy master and it's still painfully slow. I'm taking a closer look for a bit, but don't have a lot of time.

@embray
Copy link
Member

embray commented Oct 22, 2015

Yeah, even on numpy master it's still spending an incredible amount of time in those functions:

       44    0.001    0.000  570.352   12.963 records.py:543(field)
       44    2.667    0.061  570.351   12.963 {method 'getfield' of 'numpy.ndarray' objects}
       44    5.754    0.131  567.681   12.902 _internal.py:397(_getfield_is_safe)
       45  218.477    4.855  410.559    9.124 _internal.py:344(_check_field_overlap)

For reference, Michael's table has 22 fields, so in the process of reading and printing the table each field is accessed twice (I think for pyfits anyways this is normal). Each field access is taking almost 13 seconds on my system. Times 44 that's nearly 10 minutes right there... :(

@embray
Copy link
Member

embray commented Oct 22, 2015

At the same time, I made a plain Numpy array using the same dtype as that table and didn't have any noticeable problem. So maybe there's something crazy going on at our side too....

@mhvk
Copy link
Contributor

mhvk commented Oct 22, 2015

Probably the best one can do right now is to see if one can construct an equivalent recarray-only case which clearly still is much slower on master than on 1.9, and then post it to numpy/numpy#6467.

@mhvk
Copy link
Contributor

mhvk commented Oct 22, 2015

Sorry, it seems I'm getting in a habit of commenting when you are just writing what I thought... So, partly it is on our side, although the number of calls to things in _internal.py seem more than a little strange. Line 366 in that file is:

def _get_all_field_offsets(dtype, base_offset=0):
...
            fields.extend((typ, off + dtype.base.itemsize*j)
                           for j in range(count) for (typ, off) in sub_offsets)

It is part of a section that was changed in the PR that made recarray so slow. cc @ahaldane just in case he has ideas.

@esheldon
Copy link

Note the fix has not been merged into numpy master yet

@embray
Copy link
Member

embray commented Oct 22, 2015

@esheldon numpy/numpy#6208 has been merged, and that's the only fix to this I'm aware of? Or is there another. I confirmed that I was testing against it.

@mhvk The test case I just added in numpy/numpy#6467 (comment) is totally independent of anything particularly weird astropy does (except maybe assigning a new dtype to an existing array but with a different byte order?)

@esheldon
Copy link

@embray Yeah, sorry, I thought I was getting email updates about that that pull request but I wasn't.

@mhvk
Copy link
Contributor

mhvk commented Oct 22, 2015

@embray - good that you could narrow it down. Hopefully, there will be a fix soon.

@mhvk
Copy link
Contributor

mhvk commented Oct 27, 2015

@michaelaye - with the recent versions of numpy master and maintenance/1.10.x (see numpy/numpy#6562) this issue is resolved on my system. So, I'm closing this, but feel free to reopen if it turns out there still is a problem (if so, also note at numpy/numpy#6467). And: thanks for the report -- nice to see us helping numpy get better! -- or at least not get worse...

@mhvk mhvk closed this as completed Oct 27, 2015
dhomeier pushed a commit to dhomeier/astropy that referenced this issue Jun 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants