Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sdss loader (and other loaders?) work with file-like objects and fits objects #450

Closed
eteq opened this issue Feb 18, 2019 · 19 comments · Fixed by #637
Closed

Make sdss loader (and other loaders?) work with file-like objects and fits objects #450

eteq opened this issue Feb 18, 2019 · 19 comments · Fixed by #637
Labels

Comments

@eteq
Copy link
Member

eteq commented Feb 18, 2019

Right now the SDSS loaders (and I think many of the other loaders?) only accept file names. E.g., specutils.Spectrum1D.read('filename') works but specutils.Spectrum1D.read(open('filename')) does not. Most of the other loaders in astropy (and elsewhere) understand how to work directly with file-like objects, important for e.g. streaming files from the internet or other non-filesystem sources of spectral data.

Relatedly, to support use cases like the astroquery.sdss module where an HDUList is created, it would be nice for fits formats (like the SDSS spectral loaders) to also accept HDULists instead of needing a file at all.

cc @drdavella @nmearl

@eteq eteq added the io label Feb 18, 2019
@drdavella
Copy link
Contributor

This is a really good idea and I agree with the general philosophy here. However, I wonder if doing this would be in conflict with conventions (if not hard restrictions) imposed by the Astropy io registry? It seems like all other readers/writers/identifiers assume filename input. That might not necessarily prevent us from doing this here though.

@eteq
Copy link
Member Author

eteq commented Feb 18, 2019

I wonder if doing this would be in conflict with conventions (if not hard restrictions) imposed by the Astropy io registry?

I don't think so. If you look at https://github.com/astropy/astropy/blob/0c64572a2e8531fac1ce660002e0c92b70674fc1/astropy/io/registry.py#L509 you'll see a line in the read function itself seems to be expecting a file-like object but no file name is a valid option. Whether all the io loaders are aware of that is unclear... but at least for specutils I think this means it's safe to assume as an option.

@rosteen
Copy link
Contributor

rosteen commented Mar 31, 2020

@eteq It sounds like you already know of loaders that handle the open file object case - could you point me to one? I've been looking a couple places and haven't found one yet. Additionally, while the HDUList use case is clear, the use case for open file-like objects is less obvious to me - especially since e.g. Spectrum1D.read("https://data.com/remote_file.fits") already uses the astropy machinery to download the file to the local python cache and deal with it successfully. I don't have a good idea of what streaming a file from the internet in a manner requiring this would look like, any chance you could point me to/explain a science workflow for some more context on this?

FYI I've already been poking a bit at the HDUList case for the SDSS loader, and while it's trivial to implement in the loader functions I haven't yet successfully handled it in the identifier functions. So right now I have it working if one explicitly passes Spectrum1D.read() a format string along with the HDUList, but it's not auto-identifying the format. I'm digging into the astropy.io.registry code to figure out what I'm missing about the identifier functionality.

@rosteen
Copy link
Contributor

rosteen commented Mar 31, 2020

Having dug into the identifiers, it seems like auto-identifying the correct loader in the HDUList case would require changes to the underlying astropy.io.registry code, since that code requires for data loader auto-identification either (a) a string or pathlib.Path, or (b) the input to have a 'read' attribute. Because an HDUList has no 'read' attribute, this ends up with both path and fileobj being None and thus determining the format using the available identifiers fails. Would it be acceptable to document that one can pass an HDUList to Spectrum1D.read(), but that the format must be explicitly specified in that case? Or do we want to look into improving the underlying astropy code/figuring out a workaround?

@rosteen
Copy link
Contributor

rosteen commented Apr 1, 2020

Also tagging in @nmearl and @jdavies-st - see the above

@nmearl
Copy link
Contributor

nmearl commented Apr 1, 2020

it would be nice for fits formats (like the SDSS spectral loaders) to also accept HDULists instead of needing a file at all.

I'm really not a fan of this approach. There doesn't seem to be any precedent for supporting using HDUList as file objects, and it does not mesh well with the current approach with loaders in that they always have an accessible file path (even open file objects still have a property for the file name).

An alternative approach for pseudo-support would be to create a new file object from an HDUList object by overloading the .read attribute in Spectrum1D, or possibly in the identifier/reader functions. But this may be a bit of work.

In [3]: import tempfile

In [4]: with tempfile.NamedTemporaryFile() as f:
   ...:     hdulist.writeto(f)
   ...:

In [5]: f  # File-list object
Out[5]: <tempfile._TemporaryFileWrapper at 0x10e4694a8>

In [6]: f.name  # Get file path
Out[6]: '/var/folders/__/0lwnrhqs667grt4jr51hwpbm0001d3/T/tmpsu0f2tea'

@eteq
Copy link
Member Author

eteq commented Apr 1, 2020

@rosteen

It sounds like you already know of loaders that handle the open file object case - could you point me to one?

The example/expectation I was working from is the astropy.table.Table ascii readers (which are actually implemented in astropy.io.ascii) - they happily take file-like objects if you provide them:

>>> from io import BytesIO
>>> bio=BytesIO("""a b
... 1 2
... 3 4
... """.encode())
...
>>> table.Table.read(bio, format='ascii', guess=False)
<Table length=2>
  a     b
int64 int64
----- -----
    1     2
    3     4

And note that the exact same thing works if you open a fits file, either as an HDUList or as a regular file-like object:

>>> t1 = table.Table.read(open('/path/to/a/fits/table.fits', 'rb'), format='fits')
>>> from astropy.io import fits
>>> t2 = table.Table.read(fits.open('/path/to/a/fits/table.fits'), format='fits')

@eteq
Copy link
Member Author

eteq commented Apr 1, 2020

Oh, and to clarify one thing from @rosteen's comment above: I'm ok with the auto-identifying not working for a file-like object if that's a blocker - in the examples above I definitely had to explicitly give the format, and that's ok given the case that this is sort of an "advanced" usage, as well as not having the file-extension information. That can always be a later follow-on project once we get it working with explicit formats.

@rosteen
Copy link
Contributor

rosteen commented Apr 1, 2020

@eteq Well, if the auto-identification doesn't need to work, then the file-like object case is already handled because

test = fits.open(open('/path/to/a/fits/file.fits'))

already works to get an HDUList and thus for example this also works:

test_spectrum = specutils.Spectrum1D.read(open('/Users/rosteen/Data/spec-9321-58069-0010.fits'), format="SDSS-III/IV spec")

@eteq
Copy link
Member Author

eteq commented Apr 1, 2020

Now on the question of a use-case for HDUList: if I recall correctly the original issue was a case where an HDUList was the output of some other function, and the user (who I think was me?) was intentionally modifying something in the fits file before wanting it to be loaded by specutils.

Also more prosaically, astroquery.sdss yields HDUList objects and that's an important source of science spectra...

More broadly, the idea is that it should be possible to write a notebook where the data are downloaded from an archive (like the SDSS) and never saved to disk. For many notebooks that's a more reproducible scientific workflow than depending on on-disk files. More broadly this is forward-looking because some cloud platforms discourage using regular file i/o in this way - e.g. in AWS Lambda it makes a lot more sense to stream from an archive straight into memory and never bother with writing to disk. This is possible with file-like objects but not filenames.

TL;DR: the future is file-less. We should be ready. 😉

Re: @nmearl

There doesn't seem to be any precedent for supporting using HDUList as file objects

Can you clarify what you mean here? You mean in specutils or some wider context? As I indicated above this already works in astropy.table for loading fits files and is used liberally in various science notebooks I've seen.

@eteq
Copy link
Member Author

eteq commented Apr 1, 2020

Well, if the auto-identification doesn't need to work, then the file-like object case is already handled because

Oh really?? Maybe this got fixed sometime between when this issue was created? If so then I guess this issue is just "write a test to prove that"... and perhaps similar tests for the other loaders which load fits files (including SpectralCollection)

@nmearl
Copy link
Contributor

nmearl commented Apr 1, 2020

Can you clarify what you mean here? You mean in specutils or some wider context?

Yes, sorry in specutils. However, if it works with Table, then it can certainly work with Spectrum1D, since they use the same io machinery.

Oh, and to clarify one thing from @rosteen's comment above: I'm ok with the auto-identifying not working for a file-like object if that's a blocker

Interestingly, auto-identification works fine with HDUList objects and the astropy Table machinery, so it certainly is possible, it seems, to use HDUList file objects in the identifier.

@rosteen
Copy link
Contributor

rosteen commented Apr 1, 2020

Maybe this got fixed sometime between when this issue was created?

Possible... honestly, thinking/looking at this more, it's unclear to me why the auto-identification doesn't already work in the case of the open file object, since as @eteq pointed out in the initial discussion here the astropy.io.registry machinery looks like it should handle file objects in the identifier workflow, and the SDSS identifier functions also work for the file object case if you just pass one in directly. It seems like there's a bug to find somewhere that's causing the ball to be dropped somewhere in between.

@eteq
Copy link
Member Author

eteq commented Apr 1, 2020

Aha! Here's an example use case that at least for me is currently failing:

>>> from astroquery.sdss import SDSS
>>> specs = SDSS.get_spectra(plate=751, mjd=52251, fiberID=160)
>>> from specutils import Spectrum1D
>>> Spectrum1D.read(specs[0], format="SDSS-III/IV spec")
TypeError: stat: path should be string, bytes, os.PathLike or integer, not method

Note that for a test it might be overkill to use astroquery and add the dependency. Under the hood it goes to the URL 'https://data.sdss.org/sas/dr14/sdss/spectro/redux/26/spectra/0751/spec-0751-52251-0160.fits' and then loads that with astropy.io.fits.open, so that's probably the easiest way to mock it up without adding an astroquery dependency.

Oh, and could even use the stdlib urllib.request.urlopen at that same url to test "file-like object that's not backed by a file".

@rosteen
Copy link
Contributor

rosteen commented Apr 1, 2020

Hmm...you say that it's loading the file from that URL with fits.open, but your example results in:

>>> specs = SDSS.get_spectra(plate=751, mjd=52251, fiberID=160)
>>> type(specs)
<class 'list'>
>>> hasattr(specs, 'read')
False

whereas if one just pointed fits.open at the remote file you would get:

>>> spec2 = fits.open('https://data.sdss.org/sas/dr14/sdss/spectro/redux/26/spectra/0751/spec-0751-52251-0160.fits')
>>> type(spec2)
<class 'astropy.io.fits.hdu.hdulist.HDUList'>

It seems like the problem with your example isn't that your specs is a file-like object but rather that it's a list.

@rosteen
Copy link
Contributor

rosteen commented Apr 1, 2020

@eteq to your comment about testing with urllib.request.urlopen, that does in fact work if you provide a format string:

>>> specs = urllib.request.urlopen('https://data.sdss.org/sas/dr14/sdss/spectro/redux/26/spectra/0751/spec-0751-52251-0160.fits')
>>> test = specutils.Spectrum1D.read(specs, format="SDSS-III/IV spec")
>>> test
<Spectrum1D(flux=<Quantity [30.596626,...]...>

@nmearl
Copy link
Contributor

nmearl commented Apr 1, 2020

@rosteen he uses the first element in the returned list in the example, which seems to be an HDUList object:

In [2]: spec = SDSS.get_spectra(plate=751, mjd=52251, fiberID=160)
In [3]: type(specs[0])

Out[3]: astropy.io.fits.hdu.hdulist.HDUList

@nmearl
Copy link
Contributor

nmearl commented Apr 1, 2020

But the reason this fails is because the "SDSS-III/IV spec" loader sees a file-like object and thinks it's just a file and attempts to use fits.open with it. Essentially, the loader would have to also check that it's not already an HDUList object.

@rosteen
Copy link
Contributor

rosteen commented Apr 1, 2020

Argh, of course specs[0] is an HDUList, somehow in my testing I had dropped the [0] and convinced myself that spec[0] was also a simple list. Clearly need to eat some lunch. In that case this circles back around to my comment that the HDUList case is pretty much trivial in the loader if we don't care about the auto-identification; I already have a branch where it's working:

>>> from astroquery.sdss import SDSS
>>> specs = SDSS.get_spectra(plate=751, mjd=52251, fiberID=160)
>>> from specutils import Spectrum1D
>>> Spectrum1D.read(specs[0], format="SDSS-III/IV spec")
<Spectrum1D(flux=<Quantity [30.596626,]...>

Should I create a PR with just the HDUList handling added (plus some tests demonstrating that the examples here work) then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants