Support remote and cloud-hosted FITS files #13238

barentsen · 2022-05-10T14:00:14Z

Summary

This PR would enable users to seamlessly extract data from FITS files stored on a web server or in the cloud without downloading the entire file to local storage. Specifically, this PR proposes adding the fsspec package as an optional astropy dependency to enable remote FITS files to be accessed using the following usage pattern:

from astropy.io import fits

# URI of a 213 MB FITS file hosted in a free Amazon S3 cloud storage bucket
uri = "s3://stpubdata/hst/public/j8pu/j8pu0y010/j8pu0y010_drc.fits"

# Download the primary header
with fits.open(uri, fsspec_kwargs={"anon": True}) as hdul:

    # Download a single header
    header = hdul[1].header

    # Download a single image
    mydata = hdul[1].data

    # Download a small cutout
    myslice = hdul[2].section[10:12, 20:22]

A key feature is that the example above does not download the entire 213 MB file. Instead, only the necessary chunks of the FITS file are transferred on demand.

How does this work?

The efficient behavior is achieved by using the well-established fsspec package to open remote files. Fsspec provides seamless access to cloud data by providing a file-like interface to a range of remote file systems. In the case of files hosted on web servers or in cloud storage, fsspec translates random access file.read() operations into buffered HTTP Range Requests.

Once a remote FITS file has been opened with fsspec, astropy.io.fits can use it in nearly the same way as a local file (with the exception of memory mapping). As a result, we can leverage two existing "lazy data loading" features which have been part of Astropy for many years:

The lazy_load_hdus=True parameter takes care of reading HDU header and data sections on demand rather than loading all HDUs at once.
The ImageHDU.section property enables a subset of a data array to be read into memory without downloading the entire image or cube.

Fsspec is already a dependency of major packages such as dask and pandas, so we can reasonably expect it to be maintained long term. For example, pandas uses fsspec to enable users to open data from S3 seamlessly using pandas.read_csv("s3://..."), similar to what is being proposed here.

Which changes does this PR make?

The changes required to Astropy are fairly modest. In summary, this PR:

Adds fsspec as an optional dependency, alongside the fsspec-affiliated package s3fs (for Amazon S3 support, which is the cloud vendor used by the Hubble/JWST data archive at MAST).
Adds the use_fsspec parameter to astropy.utils.data.get_readable_fileobj, which triggers file paths to be opened with fsspec.open() rather than io.FileIO().
Adds the fsspec_kwargs parameter to astropy.utils.data.get_readable_fileobj, which enables buffering options and cloud credentials to be configured.
Adds a new docs chapter titled "Obtaining subsets from cloud-hosted FITS files" (docs/io/fits/usage/cloud.rst).
Adds unit tests (io/fits/tests/test_fsspec.py).
Makes minor edits to the docs and the implementation of .section.

How can we verify it really works?

I wrote a lightweight tool called fsspec-monitor which enables the exact network traffic to be explored when opening a FITS file with astropy+fsspec. For example, we can use this tool to monitor what happens when we request a 10-by-20 pixel cutout from a 213 MB FITS file as follows:

from fsspecmonitor import FsspecMonitor  # requires `barentsen/fsspec-monitor`
from astropy.io import fits

# URL of a 213 MB Hubble image
url = "https://mast.stsci.edu/api/v0.1/Download/file/?uri=mast:HST/product/j8pu0y010_drc.fits"

with FsspecMonitor() as monitor:
    with fits.open(url, use_fsspec=True, fsspec_kwargs={"block_size": 500_000}) as hdul:
        cutout = hdul[2].section[10:20, 30:40]
    monitor.summary()

The above example triggers the following diagnostic log to be printed to stdout:

Reading https://mast.stsci.edu/api/v0.1/Download/file/?uri=mast:HST/product/j8pu0y010_drc.fits (213.82 MB)
Fetch bytes 0-500080 (0.91 MB/s)
Fetch bytes 74759040-75261920 (2.71 MB/s)
Summary: fetched 1002960 bytes (0.96 MB) in 0.70 s (1.36 MB/s) using 2 requests.

These diagnostics confirm that it took just 2 HTTP requests and less than 1 MB of data to extract a cutout from a 213 MB FITS file. 🎉

(Disclosure: the example above was optimized by using fsspec_kwargs to set the minimum block_size to 500KB. The block size refers to the fact that fsspec uses a buffered readahead strategy to minimize the number of requests. The minimum block size is 5MB by default which would have caused 10MB of data transfer, which is probably a good default for many use cases.)

pllim · 2022-05-10T14:14:18Z

Quick comments without carefully reading the whole PR...

@astrofrog tried something like this a while ago so maybe he can have a look. He had commented that this was hard to generalize to n-dimensional data. Have you tried this on 3D or 4D data cubes?

Also, does this support non-contiguous slicing? I tried to build upon Tom R's proof-of-concept once but quickly hit a roadblock because AWS did not support multi-range GET. Perhaps that has changed since?

astropy/io/fits/file.py

astropy/io/fits/hdu/image.py

docs/io/fits/usage/cloud.rst

pllim · 2022-05-10T14:21:56Z

setup.cfg

+    fsspec
+    s3fs
+    gcsfs


I wonder if we dump this in all or we need special categories for each cloud service. You are unlikely to use both AWS and Google Cloud together, are you?

We could definitely create a special [cloud] category for these if we are worried about bloating the list of optional dependencies.

Some background details:

s3fs and gcsfs are both fairly lightweight pure Python packages which enable logic specific to AWS or Google Cloud to be left out of the fsspec core package.

Including s3fs made sense to me because the data archives of several NASA missions are hosted in S3 storage.

We could remove gcsfs because I am not aware of any astronomy projects which use Google Cloud storage right now, but I included it to avoid giving the impression that we prefer one vendor over another.

astronomy projects which use Google Cloud storage

rubin?

But on topic: we already have all the backends for hdf5, asdf, dask, pyarrow, pandas, etc listed, I don't see why these small pure Python should be shuffled into a different section, but not the rest of the very optional dependencies.

Since fsspec supports many different services (see below) and many those of those require some optional dependency, and since fsspec will tell the user if a dependency is missing, I think we should just have an optional dependency on fsspec itself and delegate to it to warn the user if some toher package is needed.

``` In [5]: known_implementations Out[5]: {'abfs': {'class': 'adlfs.AzureBlobFileSystem', 'err': 'Install adlfs to access Azure Datalake Gen2 and Azure Blob ' 'Storage'}, 'adl': {'class': 'adlfs.AzureDatalakeFileSystem', 'err': 'Install adlfs to access Azure Datalake Gen1'}, 'arrow_hdfs': {'class': 'fsspec.implementations.arrow.HadoopFileSystem', 'err': 'pyarrow and local java libraries required for HDFS'}, 'az': {'class': 'adlfs.AzureBlobFileSystem', 'err': 'Install adlfs to access Azure Datalake Gen2 and Azure Blob ' 'Storage'}, 'blockcache': {'class': 'fsspec.implementations.cached.CachingFileSystem'}, 'cached': {'class': 'fsspec.implementations.cached.CachingFileSystem'}, 'dask': {'class': 'fsspec.implementations.dask.DaskWorkerFileSystem', 'err': 'Install dask distributed to access worker file system'}, 'dbfs': {'class': 'fsspec.implementations.dbfs.DatabricksFileSystem', 'err': 'Install the requests package to use the ' 'DatabricksFileSystem'}, 'dropbox': {'class': 'dropboxdrivefs.DropboxDriveFileSystem', 'err': 'DropboxFileSystem requires "dropboxdrivefs","requests" ' 'and "dropbox" to be installed'}, 'file': {'class': 'fsspec.implementations.local.LocalFileSystem'}, 'filecache': {'class': 'fsspec.implementations.cached.WholeFileCacheFileSystem'}, 'ftp': {'class': 'fsspec.implementations.ftp.FTPFileSystem'}, 'gcs': {'class': 'gcsfs.GCSFileSystem', 'err': 'Please install gcsfs to access Google Storage'}, 'gdrive': {'class': 'gdrivefs.GoogleDriveFileSystem', 'err': 'Please install gdrivefs for access to Google Drive'}, 'git': {'class': 'fsspec.implementations.git.GitFileSystem', 'err': 'Install pygit2 to browse local git repos'}, 'github': {'class': 'fsspec.implementations.github.GithubFileSystem', 'err': 'Install the requests package to use the github FS'}, 'gs': {'class': 'gcsfs.GCSFileSystem', 'err': 'Please install gcsfs to access Google Storage'}, 'hdfs': {'class': 'fsspec.implementations.hdfs.PyArrowHDFS', 'err': 'pyarrow and local java libraries required for HDFS'}, 'http': {'class': 'fsspec.implementations.http.HTTPFileSystem', 'err': 'HTTPFileSystem requires "requests" and "aiohttp" to be ' 'installed'}, 'https': {'class': 'fsspec.implementations.http.HTTPFileSystem', 'err': 'HTTPFileSystem requires "requests" and "aiohttp" to be ' 'installed'}, 'jlab': {'class': 'fsspec.implementations.jupyter.JupyterFileSystem', 'err': 'Jupyter FS requires requests to be installed'}, 'jupyter': {'class': 'fsspec.implementations.jupyter.JupyterFileSystem', 'err': 'Jupyter FS requires requests to be installed'}, 'libarchive': {'class': 'fsspec.implementations.libarchive.LibArchiveFileSystem', 'err': 'LibArchive requires to be installed'}, 'memory': {'class': 'fsspec.implementations.memory.MemoryFileSystem'}, 'oci': {'class': 'ocifs.OCIFileSystem', 'err': 'Install ocifs to access OCI Object Storage'}, 'reference': {'class': 'fsspec.implementations.reference.ReferenceFileSystem'}, 's3': {'class': 's3fs.S3FileSystem', 'err': 'Install s3fs to access S3'}, 's3a': {'class': 's3fs.S3FileSystem', 'err': 'Install s3fs to access S3'}, 'sftp': {'class': 'fsspec.implementations.sftp.SFTPFileSystem', 'err': 'SFTPFileSystem requires "paramiko" to be installed'}, 'simplecache': {'class': 'fsspec.implementations.cached.SimpleCacheFileSystem'}, 'smb': {'class': 'fsspec.implementations.smb.SMBFileSystem', 'err': 'SMB requires "smbprotocol" or "smbprotocol[kerberos]" ' 'installed'}, 'ssh': {'class': 'fsspec.implementations.sftp.SFTPFileSystem', 'err': 'SFTPFileSystem requires "paramiko" to be installed'}, 'tar': {'class': 'fsspec.implementations.tar.TarFileSystem'}, 'wandb': {'class': 'wandbfs.WandbFS', 'err': 'Install wandbfs to access wandb'}, 'webhdfs': {'class': 'fsspec.implementations.webhdfs.WebHDFS', 'err': 'webHDFS access requires "requests" to be installed'}, 'zip': {'class': 'fsspec.implementations.zip.ZipFileSystem'}} ```

Agree with this; except that since fsspec itself is so small and with no deps of it's own, you may well want to include it as required and not need to check for it.

Hmm I see the interest of doing that but Astropy always had a very limited number of required dependencies and probably more important, the code in io.fits dealing with the file descriptors is already quite complicated and I'm not sure I want to have another layer dealing with the fd.

These are good points! There is no doubt that adding dependencies to Astropy (optional or otherwise) risks creating work for maintainers in the future, so it's definitely important to trade this risk against the benefits.

My two cents:

Adding the core fsspec package as an optional dependency is relatively low risk, because it is a pure Python package, has zero dependencies of its own, and is already an optional dependency of popular packages such as pandas and dask. The core package would enable FITS files to be opened efficiently via the http protocol.

There is a case to be made for including s3fs alongside fsspec, because NASA already hosts Hubble/Kepler/TESS/Chandra data in S3, and it appears to be moving towards adopting S3 for many other science missions (can anyone confirm exact plans?). It is true however that the adoption of S3 in astronomy isn't exactly ubiquitous right now, and s3fs does have two dependencies of its own (aiobotocore and aiohttp), so I'd be OK with waiting to adopt s3fs as an optional dependency until S3 gains more significant adoption.

The case for gcsfs is weak because I am not aware of any facilities currently providing data via Google Storage. Rubin Observatory (LSST) appears to rely on Google Cloud, but someone would have to confirm whether or not the observatory plans to enable users to access FITS files via gs:// paths. Until then, I'd be OK with skipping gcsfs as an optional dependency.

Any thoughts?

Unfortunately, fsspec does also require aiohttp even for http. The principal advantage from a user's point of view, is that there would be a concrete message saying which additional dependencies should be installed, when trying to access a storage backend not currently available.

Of the protocols, s3 may well be the most important, partly because other services exist outside of AWS which support the same API. However, I have been surprised by the amount of data funnelling to Azure (abfs), albeit not FITS.

barentsen · 2022-05-10T15:50:42Z

@astrofrog tried something like this a while ago so maybe he can have a look.

Thanks @pllim! I must acknowledge that the idea to access partial FITS files from the cloud has been explored and inspired by many others before me, including:

A key difference from these previous efforts is that this PR does not attempt to compute byte offsets for data slices in multi-dimensional arrays itself (this is indeed challenging, I tried and failed to find an elegant solution). Instead it re-uses the existing lazy_load_hdus and ImageHDU.section features of Astropy to do this. I'm not sure who originally wrote .section (@embray?) but I am impressed by whoever achieved it.

He had commented that this was hard to generalize to n-dimensional data. Have you tried this on 3D or 4D data cubes?

Excellent question!

This PR works for multi-dimensional cubes, e.g. see here for an example with 4D TESS cubes.

Tests which verify the correctness of using ImageHDU.section in 3D/4D currently exist on the main branch in io/fits/tests/test_image.py.

Also, does this support non-contiguous slicing? I tried to build upon Tom R's proof-of-concept once but quickly hit a roadblock because AWS did not support multi-range GET. Perhaps that has changed since?

ImageHDU.section only appears to support the basic forms of slicing right now. For example, it does not support parsing keys with tuples such as ImageHDU.section[(0, 5, 10), (0, 5, 10)].

The good news is that the limitations of .section are not directly related to the lack of multi-range GET support. This PR does not worry about making GET requests directly and passes this problem off to fsspec via its file-like interface. Behind the scenes, fsspec implements a smart buffered readahead strategy which minimizes the number of GET requests in other ways (e.g., by enforcing a minimum block_size of 5 MB and caching those blocks).

It should be possible to add support for more complicated slicing operations to ImageHDU.section, I believe it is complicated merely because the implementation of .section is quite hard to follow. I did already add support for negative indexing to .section as part of this PR. I am going to add a comment in the docstring of .section to explain that not every form of Numpy slicing is supported.

bsipocz

I've already reviewed this PR when it was in draft form, won't nitpick on the details getting CI pass here, so approve it from my side.

I think this is a great improvement providing an out of box solution for those working in cloud environments. Besides timing-wise it's done perfectly, right at the beginning of a dev cycle providing ample opportunity to fix any potential issues and get integration downstream by the time it lands in an actual release.

bsipocz · 2022-05-10T16:16:42Z

@barentsen - do rebase rather than pull in merge commits to update the branch.

docs/io/fits/usage/cloud.rst

jbcurtin · 2022-05-10T19:19:52Z

Oh cool! Glad to see this finally made it into Astropy. In case it helps, the core issue I ran into was I either couldn't calculate the offset correctly or there are corrupt s3 files. The math lined up, but my ability to ask a coherent question didn't seem to work at the time. I'm definitely willing to collaborate on this again, in the future.

bsipocz · 2022-05-10T19:47:12Z

Re coverage failure: very likely red herring as the coverage for this new functionality requires remote-data access and all dependencies being installed.

pllim · 2022-05-10T20:22:07Z

Yes, remote data job does not have coverage. If you want to do due diligence, you can always run coverage locally with remote data turned on. 😸

martindurant · 2022-05-10T22:04:10Z

I haven't read the code, but fsspec is happy to help here. Most other libraries doing this kind of thing don't use a flag for fsspec, but determine if the path is fsspec-like; also the kwargs to pass are usually storage_options.

I'll comment on buffering and non-contiguous access separately when I have the chance.

barentsen · 2022-05-10T22:49:51Z

Most other libraries doing this kind of thing don't use a flag for fsspec, but determine if the path is fsspec-like

Good point! For paths with prefix ['http', 'https', 'ftp', 'sftp', 'ssh', 'file'], Astropy's existing behavior is to download the file using urllib. I introduced use_fsspec to avoid changing the existing behavior unless explicitly requested.

We have the option to default to fsspec for every fsspec-like path, but we'd probably want fsspec to be a required rather than an optional dependency in that case. Does anyone have thoughts?

(Note: for prefix s3:// and gs://, this PR does default to fsspec because there was no prior support.)

pllim · 2022-05-11T01:03:51Z

I am not excited at another refactor of utils if we can help it. Lots of people use it, including HPC people.

martindurant · 2022-05-11T01:07:30Z

We have the option to default to fsspec for every fsspec-like path

I daresay this is the most disruptive, so I can well understand caution.

barentsen · 2022-05-11T01:27:37Z

I am not excited at another refactor of utils if we can help it.

I don't think we would necessarily touch utils, we just wouldn't be calling it from astropy.io.fits for downloading files any longer.

We do have the option to default to use_fsspec=False for now, and change the default in a future major version release of Astropy.

martindurant · 2022-05-11T01:28:24Z

On the matter of contiguous reads and such...

When you create an fsspec file-like object, the interface to it is seek()/read(), so the storage layer knows nothing about the intended access pattern. There is indeed caching by default on most filesystems including google/amazon/azure; and the most common default is 5MB read-ahead. This is great for a typical case of processing nearly linearly through a file in small chunks, e.g., csv readers.

If you have an N-D array and read a small selection of anything but the smallest dimension (i.e., strided), then you will vary between reading all of the data anyway, and reading extra data along with every piece you actually need - a real fail case.
So we provide a variety of caching techniques, including "none" (unbuffered); as well as "caching" in terms of saving some or all of a requested file to local storage.

All caching provided for the file-like interface will be limited by the fact that a file object is Stateful, and so cannot do anything concurrently, even when the backing store allows concurrent operations.

However, you could use fs.cat directly, with one or more files and many ranges. Those ranges are all sent concurrently (well, batches of up to 1000, typically), so you amortise the latency and get massive speedups when the reads are small. This, again, is for specific backends: http, s3, gcs, azure. You can push this further in the case that you know a lot about your file format. https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/ shows this in the context of parquet, where we analyse the metadata, pre-fetch all the pieces of a given file concurrently and quickly (joining ranges as required), and then generate a "pre-cached" file-like object for the target library to read from. Obviously that's not for this PR, but worth thinking about!

Anyone that would like to talk to me about cloud-friendly astronomy data access, whether FITS or not (zarr), I am more than happy to!

Address doctest failure Address doctest failure

Update docs Codestyle fixes Update docs Typo fix Typo fix Docs typo. Co-authored-by: Brigitta Sipőcz <b.sipocz@gmail.com>

Address additional PR feedback Fine tune the docs

Mention fsspec & s3fs in the installation instructions; improve links

Fix codestyle Fix CI with --pre

Fix whatsnew CI codestyle fixes Fix CI failure Additional CI fixes Avoid Matplotlib deprecation warning Attempt to fix CI --open-files failure Fix syntax error in setup.cfg Try git+https instead of git+ssh

Fix two CI errors

Resolve rebase conflict Fix isort issues Experiment with daily cron failure Address Python bug Check if we still need the skipif decorator We do need the skipif to prevent the daily cron from failing

barentsen · 2022-10-01T19:36:12Z

@pllim @bsipocz I squashed it down to 15 commits now. Happy to squash it down into 1 commit as well?

CI continues to be happy. Ready to be merged on my end 👍

pllim

I don't grok everything but I am confident other reviewers went through the cloud guts. And CI is green, which is a good sign.

Just minor comments and one last concern about the global warning ignore (can it be local?).

astropy/io/fits/tests/test_fsspec.py

astropy/io/fits/hdu/hdulist.py

docs/install.rst

pllim · 2022-10-03T15:26:09Z

setup.cfg

@@ -132,6 +134,7 @@ filterwarnings =
    error
    ignore:unclosed <socket:ResourceWarning
    ignore:unclosed <ssl.SSLSocket:ResourceWarning
+    ignore:unclosed transport <asyncio.sslproto


If this was happening only for one test, can we just ignore this one for that test instead of a global ignore like this?

True! I'll go ahead and see if I can replace it with a local @pytest.mark.filterwarnings decorator instead.

Co-authored-by: P. L. Lim <2090236+pllim@users.noreply.github.com>

pllim · 2022-10-03T16:24:20Z

linkcheck failure is unrelated and can be ignored for this PR.

pllim · 2022-10-03T16:30:08Z

Geert, go take a break. The cron is gonna take a few hours. 😆 ☕

pllim · 2022-10-03T18:36:26Z

Merged. Thank you so much for your contribution and patience!

Hopefully we will smoke out any remaining issues before v5.2 release. 😸

barentsen · 2022-10-03T18:37:29Z

Thanks @pllim! I'm on standby to resolve any issues before v5.2.

github-actions bot added Docs installation io.fits nddata utils labels May 10, 2022

pllim added this to the v5.2 milestone May 10, 2022

pllim added the needs-discussion label May 10, 2022