Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing many small HDUs gets progressively slower #32

Closed
dkirkby opened this issue Jan 19, 2015 · 6 comments
Closed

Writing many small HDUs gets progressively slower #32

dkirkby opened this issue Jan 19, 2015 · 6 comments

Comments

@dkirkby
Copy link

dkirkby commented Jan 19, 2015

I am writing a FITS file with many small HDUs and have noticed that the time to write each HDU increases ~linearly with the number of HDUs already written. The following test program shows the behavior:

import numpy as np
import fitsio
import time

fits = fitsio.FITS('test.fits',mode=fitsio.READWRITE,clobber=True)
cube = np.zeros((8,8,8),dtype=np.float32)
now = time.time()
for i in range(2000):
    fits.write_image(cube)
    if (i+1)%100 == 0:
        last = now
        now = time.time()
        print '%6d %6.1f' % (i+1,10.*(now-last))
fits.close()

I find that writing the first HDU takes ~0.02ms but that this slows down to ~1.0ms for the 2000th HDU. If you set clobber=False above, you can check that what matters is the number of HDUs in the file, not the number written by the current process.

Any idea if this is something peculiar about my system, a feature of the cfitsio library, or possibly due to the python wrapping of cfitsio? Any suggestions for avoiding this slowdown? I would like to be able to write ~100K HDUs.

@esheldon
Copy link
Owner

I think this is due to calling update_hdu_list after each write.

It might make more sense to just append to the list rather than regenerate it from scratch, but only if it can be guaranteed there were no side effects on other extensions

@dkirkby
Copy link
Author

dkirkby commented Jan 19, 2015

Thanks for the pointer. I see that each call to fits.write_image(img) calls update_hdu_list() twice, so I replaced these with calls to append_image(fits,img), following by a single call to update_hdu_list() after all HDUs have been appended:

def append_image(fits,img):
    comptype = fitsio.fitslib.get_compress_type(None)
    tile_dims = fitsio.fitslib.get_tile_dims(None,img)
    fits._FITS.create_image_hdu(img,
        comptype = comptype,tile_dims = tile_dims,extname = '',extver = 0)
    fits[-1].write(img)

This is significantly faster but still has the same linear increase in time per call (so quadratic growth in the total time required) with a smaller slope. I suspect this is due to the fact that you close and reopen the file after writing each HDU with a call to fits_flush_file():

// this is a full file close and reopen

Is this always necessary? Would it be possible to provide an option to _FITS.write_image() that bypasses this?

@dkirkby
Copy link
Author

dkirkby commented Jan 19, 2015

For comparison, this test program using astropy.io.fits has a larger constant term but no linear term, so is much better overall for writing many HDUs:

import numpy as np
import astropy.io.fits
import time

fits = astropy.io.fits.open('test2.fits',mode='ostream',memmap=False)
cube = np.zeros((8,8,8),dtype=np.float32)
now = time.time()
for i in range(2000):
    fits.append(astropy.io.fits.ImageHDU(data = cube))
    if (i+1)%100 == 0:
        last = now
        now = time.time()
        print '%6d %6.1f' % (i+1,10.*(now-last))
print len(fits)
fits.flush()
fits.close()

The final fits.flush() is what actually writes the file from memory to disk, so this only works with enough memory to hold the entire file.

@esheldon
Copy link
Owner

right, I should have been more clear. I plan to replace the call to update_hdu_list with something that just appends to the hdu_list. This should result in a constant time for adding a new extension.

I'm out sick today, I might not get to it until tomorrow.

dkirkby added a commit to LSSTDESC/WeakLensingDeblending that referenced this issue Jan 19, 2015
@esheldon
Copy link
Owner

OK, I pushed a change to master that makes writing a new image extension constant time.

Currently I'm only using this to write images. If your tests show it works for that I'll look into tables as well.

@dkirkby
Copy link
Author

dkirkby commented Jan 19, 2015

Your changes are working with my test program and fitsio is now about 10x faster for writing 2K HDUs than the astropy version above. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants