-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support remote and cloud-hosted FITS files #13238
Conversation
Quick comments without carefully reading the whole PR... @astrofrog tried something like this a while ago so maybe he can have a look. He had commented that this was hard to generalize to n-dimensional data. Have you tried this on 3D or 4D data cubes? Also, does this support non-contiguous slicing? I tried to build upon Tom R's proof-of-concept once but quickly hit a roadblock because AWS did not support multi-range GET. Perhaps that has changed since? |
setup.cfg
Outdated
fsspec | ||
s3fs | ||
gcsfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we dump this in all or we need special categories for each cloud service. You are unlikely to use both AWS and Google Cloud together, are you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could definitely create a special [cloud]
category for these if we are worried about bloating the list of optional dependencies.
Some background details:
s3fs
andgcsfs
are both fairly lightweight pure Python packages which enable logic specific to AWS or Google Cloud to be left out of thefsspec
core package.- Including
s3fs
made sense to me because the data archives of several NASA missions are hosted in S3 storage. - We could remove
gcsfs
because I am not aware of any astronomy projects which use Google Cloud storage right now, but I included it to avoid giving the impression that we prefer one vendor over another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astronomy projects which use Google Cloud storage
rubin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But on topic: we already have all the backends for hdf5, asdf, dask, pyarrow, pandas, etc listed, I don't see why these small pure Python should be shuffled into a different section, but not the rest of the very optional dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since fsspec
supports many different services (see below) and many those of those require some optional dependency, and since fsspec
will tell the user if a dependency is missing, I think we should just have an optional dependency on fsspec
itself and delegate to it to warn the user if some toher package is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this; except that since fsspec itself is so small and with no deps of it's own, you may well want to include it as required and not need to check for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I see the interest of doing that but Astropy always had a very limited number of required dependencies and probably more important, the code in io.fits
dealing with the file descriptors is already quite complicated and I'm not sure I want to have another layer dealing with the fd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good points! There is no doubt that adding dependencies to Astropy (optional or otherwise) risks creating work for maintainers in the future, so it's definitely important to trade this risk against the benefits.
My two cents:
- Adding the core
fsspec
package as an optional dependency is relatively low risk, because it is a pure Python package, has zero dependencies of its own, and is already an optional dependency of popular packages such aspandas
anddask
. The core package would enable FITS files to be opened efficiently via thehttp
protocol. - There is a case to be made for including
s3fs
alongsidefsspec
, because NASA already hosts Hubble/Kepler/TESS/Chandra data in S3, and it appears to be moving towards adopting S3 for many other science missions (can anyone confirm exact plans?). It is true however that the adoption of S3 in astronomy isn't exactly ubiquitous right now, ands3fs
does have two dependencies of its own (aiobotocore
andaiohttp
), so I'd be OK with waiting to adopts3fs
as an optional dependency until S3 gains more significant adoption. - The case for
gcsfs
is weak because I am not aware of any facilities currently providing data via Google Storage. Rubin Observatory (LSST) appears to rely on Google Cloud, but someone would have to confirm whether or not the observatory plans to enable users to access FITS files viags://
paths. Until then, I'd be OK with skippinggcsfs
as an optional dependency.
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, fsspec does also require aiohttp even for http. The principal advantage from a user's point of view, is that there would be a concrete message saying which additional dependencies should be installed, when trying to access a storage backend not currently available.
Of the protocols, s3 may well be the most important, partly because other services exist outside of AWS which support the same API. However, I have been surprised by the amount of data funnelling to Azure (abfs), albeit not FITS.
Thanks @pllim! I must acknowledge that the idea to access partial FITS files from the cloud has been explored and inspired by many others before me, including:
A key difference from these previous efforts is that this PR does not attempt to compute byte offsets for data slices in multi-dimensional arrays itself (this is indeed challenging, I tried and failed to find an elegant solution). Instead it re-uses the existing
Excellent question! This PR works for multi-dimensional cubes, e.g. see here for an example with 4D TESS cubes. Tests which verify the correctness of using
The good news is that the limitations of It should be possible to add support for more complicated slicing operations to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already reviewed this PR when it was in draft form, won't nitpick on the details getting CI pass here, so approve it from my side.
I think this is a great improvement providing an out of box solution for those working in cloud environments. Besides timing-wise it's done perfectly, right at the beginning of a dev cycle providing ample opportunity to fix any potential issues and get integration downstream by the time it lands in an actual release.
@barentsen - do rebase rather than pull in merge commits to update the branch. |
Oh cool! Glad to see this finally made it into Astropy. In case it helps, the core issue I ran into was I either couldn't calculate the offset correctly or there are corrupt s3 files. The math lined up, but my ability to ask a coherent question didn't seem to work at the time. I'm definitely willing to collaborate on this again, in the future. |
Re coverage failure: very likely red herring as the coverage for this new functionality requires remote-data access and all dependencies being installed. |
Yes, remote data job does not have coverage. If you want to do due diligence, you can always run coverage locally with remote data turned on. 😸 |
I haven't read the code, but fsspec is happy to help here. Most other libraries doing this kind of thing don't use a flag for fsspec, but determine if the path is fsspec-like; also the kwargs to pass are usually I'll comment on buffering and non-contiguous access separately when I have the chance. |
Good point! For paths with prefix We have the option to default to fsspec for every fsspec-like path, but we'd probably want fsspec to be a required rather than an optional dependency in that case. Does anyone have thoughts? (Note: for prefix |
I am not excited at another refactor of |
I daresay this is the most disruptive, so I can well understand caution. |
I don't think we would necessarily touch We do have the option to default to |
On the matter of contiguous reads and such... When you create an fsspec file-like object, the interface to it is If you have an N-D array and read a small selection of anything but the smallest dimension (i.e., strided), then you will vary between reading all of the data anyway, and reading extra data along with every piece you actually need - a real fail case. All caching provided for the file-like interface will be limited by the fact that a file object is Stateful, and so cannot do anything concurrently, even when the backing store allows concurrent operations. However, you could use Anyone that would like to talk to me about cloud-friendly astronomy data access, whether FITS or not (zarr), I am more than happy to! |
Address doctest failure Address doctest failure
Update docs Codestyle fixes Update docs Typo fix Typo fix Docs typo. Co-authored-by: Brigitta Sipőcz <b.sipocz@gmail.com>
Address additional PR feedback Fine tune the docs
Mention fsspec & s3fs in the installation instructions; improve links
Fix codestyle Fix CI with --pre
Fix whatsnew CI codestyle fixes Fix CI failure Additional CI fixes Avoid Matplotlib deprecation warning Attempt to fix CI --open-files failure Fix syntax error in setup.cfg Try git+https instead of git+ssh
Resolve rebase conflict Fix isort issues Experiment with daily cron failure Address Python bug Check if we still need the skipif decorator We do need the skipif to prevent the daily cron from failing
24fe935
to
d8c668a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't grok everything but I am confident other reviewers went through the cloud guts. And CI is green, which is a good sign.
Just minor comments and one last concern about the global warning ignore (can it be local?).
setup.cfg
Outdated
@@ -132,6 +134,7 @@ filterwarnings = | |||
error | |||
ignore:unclosed <socket:ResourceWarning | |||
ignore:unclosed <ssl.SSLSocket:ResourceWarning | |||
ignore:unclosed transport <asyncio.sslproto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was happening only for one test, can we just ignore this one for that test instead of a global ignore like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True! I'll go ahead and see if I can replace it with a local @pytest.mark.filterwarnings
decorator instead.
Co-authored-by: P. L. Lim <2090236+pllim@users.noreply.github.com>
Co-authored-by: P. L. Lim <2090236+pllim@users.noreply.github.com>
Co-authored-by: P. L. Lim <2090236+pllim@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
linkcheck failure is unrelated and can be ignored for this PR. |
Geert, go take a break. The cron is gonna take a few hours. 😆 ☕ |
Merged. Thank you so much for your contribution and patience! Hopefully we will smoke out any remaining issues before v5.2 release. 😸 |
Thanks @pllim! I'm on standby to resolve any issues before v5.2. |
Summary
This PR would enable users to seamlessly extract data from FITS files stored on a web server or in the cloud without downloading the entire file to local storage. Specifically, this PR proposes adding the fsspec package as an optional astropy dependency to enable remote FITS files to be accessed using the following usage pattern:
A key feature is that the example above does not download the entire 213 MB file. Instead, only the necessary chunks of the FITS file are transferred on demand.
How does this work?
The efficient behavior is achieved by using the well-established fsspec package to open remote files. Fsspec provides seamless access to cloud data by providing a
file
-like interface to a range of remote file systems. In the case of files hosted on web servers or in cloud storage, fsspec translates random accessfile.read()
operations into buffered HTTP Range Requests.Once a remote FITS file has been opened with fsspec,
astropy.io.fits
can use it in nearly the same way as a local file (with the exception of memory mapping). As a result, we can leverage two existing "lazy data loading" features which have been part of Astropy for many years:lazy_load_hdus=True
parameter takes care of reading HDU header and data sections on demand rather than loading all HDUs at once.ImageHDU.section
property enables a subset of a data array to be read into memory without downloading the entire image or cube.Fsspec is already a dependency of major packages such as
dask
andpandas
, so we can reasonably expect it to be maintained long term. For example, pandas uses fsspec to enable users to open data from S3 seamlessly usingpandas.read_csv("s3://...")
, similar to what is being proposed here.Which changes does this PR make?
The changes required to Astropy are fairly modest. In summary, this PR:
fsspec
as an optional dependency, alongside the fsspec-affiliated packages3fs
(for Amazon S3 support, which is the cloud vendor used by the Hubble/JWST data archive at MAST).use_fsspec
parameter toastropy.utils.data.get_readable_fileobj
, which triggers file paths to be opened withfsspec.open()
rather thanio.FileIO()
.fsspec_kwargs
parameter toastropy.utils.data.get_readable_fileobj
, which enables buffering options and cloud credentials to be configured.docs/io/fits/usage/cloud.rst
).io/fits/tests/test_fsspec.py
)..section
.How can we verify it really works?
I wrote a lightweight tool called fsspec-monitor which enables the exact network traffic to be explored when opening a FITS file with astropy+fsspec. For example, we can use this tool to monitor what happens when we request a 10-by-20 pixel cutout from a 213 MB FITS file as follows:
The above example triggers the following diagnostic log to be printed to stdout:
These diagnostics confirm that it took just 2 HTTP requests and less than 1 MB of data to extract a cutout from a 213 MB FITS file. 🎉
(Disclosure: the example above was optimized by using
fsspec_kwargs
to set the minimumblock_size
to 500KB. The block size refers to the fact that fsspec uses a buffered readahead strategy to minimize the number of requests. The minimum block size is 5MB by default which would have caused 10MB of data transfer, which is probably a good default for many use cases.)