-
-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sdss loader (and other loaders?) work with file-like objects and fits objects #450
Comments
This is a really good idea and I agree with the general philosophy here. However, I wonder if doing this would be in conflict with conventions (if not hard restrictions) imposed by the Astropy io registry? It seems like all other readers/writers/identifiers assume filename input. That might not necessarily prevent us from doing this here though. |
I don't think so. If you look at https://github.com/astropy/astropy/blob/0c64572a2e8531fac1ce660002e0c92b70674fc1/astropy/io/registry.py#L509 you'll see a line in the |
@eteq It sounds like you already know of loaders that handle the open file object case - could you point me to one? I've been looking a couple places and haven't found one yet. Additionally, while the HDUList use case is clear, the use case for open file-like objects is less obvious to me - especially since e.g. Spectrum1D.read("https://data.com/remote_file.fits") already uses the astropy machinery to download the file to the local python cache and deal with it successfully. I don't have a good idea of what streaming a file from the internet in a manner requiring this would look like, any chance you could point me to/explain a science workflow for some more context on this? FYI I've already been poking a bit at the HDUList case for the SDSS loader, and while it's trivial to implement in the loader functions I haven't yet successfully handled it in the identifier functions. So right now I have it working if one explicitly passes Spectrum1D.read() a format string along with the HDUList, but it's not auto-identifying the format. I'm digging into the astropy.io.registry code to figure out what I'm missing about the identifier functionality. |
Having dug into the identifiers, it seems like auto-identifying the correct loader in the HDUList case would require changes to the underlying astropy.io.registry code, since that code requires for data loader auto-identification either (a) a string or pathlib.Path, or (b) the input to have a 'read' attribute. Because an HDUList has no 'read' attribute, this ends up with both path and fileobj being None and thus determining the format using the available identifiers fails. Would it be acceptable to document that one can pass an HDUList to Spectrum1D.read(), but that the format must be explicitly specified in that case? Or do we want to look into improving the underlying astropy code/figuring out a workaround? |
Also tagging in @nmearl and @jdavies-st - see the above |
I'm really not a fan of this approach. There doesn't seem to be any precedent for supporting using HDUList as file objects, and it does not mesh well with the current approach with loaders in that they always have an accessible file path (even open file objects still have a property for the file name). An alternative approach for pseudo-support would be to create a new file object from an HDUList object by overloading the In [3]: import tempfile
In [4]: with tempfile.NamedTemporaryFile() as f:
...: hdulist.writeto(f)
...:
In [5]: f # File-list object
Out[5]: <tempfile._TemporaryFileWrapper at 0x10e4694a8>
In [6]: f.name # Get file path
Out[6]: '/var/folders/__/0lwnrhqs667grt4jr51hwpbm0001d3/T/tmpsu0f2tea' |
The example/expectation I was working from is the
And note that the exact same thing works if you open a fits file, either as an
|
Oh, and to clarify one thing from @rosteen's comment above: I'm ok with the auto-identifying not working for a file-like object if that's a blocker - in the examples above I definitely had to explicitly give the format, and that's ok given the case that this is sort of an "advanced" usage, as well as not having the file-extension information. That can always be a later follow-on project once we get it working with explicit formats. |
@eteq Well, if the auto-identification doesn't need to work, then the file-like object case is already handled because
already works to get an HDUList and thus for example this also works:
|
Now on the question of a use-case for Also more prosaically, More broadly, the idea is that it should be possible to write a notebook where the data are downloaded from an archive (like the SDSS) and never saved to disk. For many notebooks that's a more reproducible scientific workflow than depending on on-disk files. More broadly this is forward-looking because some cloud platforms discourage using regular file i/o in this way - e.g. in AWS Lambda it makes a lot more sense to stream from an archive straight into memory and never bother with writing to disk. This is possible with file-like objects but not filenames. TL;DR: the future is file-less. We should be ready. 😉 Re: @nmearl
Can you clarify what you mean here? You mean in |
Oh really?? Maybe this got fixed sometime between when this issue was created? If so then I guess this issue is just "write a test to prove that"... and perhaps similar tests for the other loaders which load fits files (including |
Yes, sorry in specutils. However, if it works with
Interestingly, auto-identification works fine with |
Possible... honestly, thinking/looking at this more, it's unclear to me why the auto-identification doesn't already work in the case of the open file object, since as @eteq pointed out in the initial discussion here the astropy.io.registry machinery looks like it should handle file objects in the identifier workflow, and the SDSS identifier functions also work for the file object case if you just pass one in directly. It seems like there's a bug to find somewhere that's causing the ball to be dropped somewhere in between. |
Aha! Here's an example use case that at least for me is currently failing:
Note that for a test it might be overkill to use Oh, and could even use the stdlib |
Hmm...you say that it's loading the file from that URL with fits.open, but your example results in:
whereas if one just pointed fits.open at the remote file you would get:
It seems like the problem with your example isn't that your |
@eteq to your comment about testing with
|
@rosteen he uses the first element in the returned list in the example, which seems to be an In [2]: spec = SDSS.get_spectra(plate=751, mjd=52251, fiberID=160)
In [3]: type(specs[0])
Out[3]: astropy.io.fits.hdu.hdulist.HDUList |
But the reason this fails is because the "SDSS-III/IV spec" loader sees a file-like object and thinks it's just a file and attempts to use |
Argh, of course specs[0] is an HDUList, somehow in my testing I had dropped the [0] and convinced myself that spec[0] was also a simple list. Clearly need to eat some lunch. In that case this circles back around to my comment that the HDUList case is pretty much trivial in the loader if we don't care about the auto-identification; I already have a branch where it's working:
Should I create a PR with just the HDUList handling added (plus some tests demonstrating that the examples here work) then? |
Right now the SDSS loaders (and I think many of the other loaders?) only accept file names. E.g.,
specutils.Spectrum1D.read('filename')
works butspecutils.Spectrum1D.read(open('filename'))
does not. Most of the other loaders in astropy (and elsewhere) understand how to work directly with file-like objects, important for e.g. streaming files from the internet or other non-filesystem sources of spectral data.Relatedly, to support use cases like the
astroquery.sdss
module where anHDUList
is created, it would be nice for fits formats (like the SDSS spectral loaders) to also acceptHDUList
s instead of needing a file at all.cc @drdavella @nmearl
The text was updated successfully, but these errors were encountered: