Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make xarray datasets discoverable #6

Open
jasongilman opened this issue Sep 15, 2018 · 2 comments
Open

Make xarray datasets discoverable #6

jasongilman opened this issue Sep 15, 2018 · 2 comments

Comments

@jasongilman
Copy link

There are a large and growing number of publicly-available datasets that are loadable into xarray from buckets in the Cloud. Currently, however, there is no effective way to discover these datasets.

Using standards like OGC Catalog Service the Web (CSW) and OpenSearch, it would be possible to discover these xarray datasets via sites like data.gov (and data.gov.uk, data.gov.au, etc) but it requires producing the ISO metadata which these sites consume.

It would also be possible to discover [xarray datasets via sites like Google's dataset search, but it would necessary to produce the json-ld metadata that these sites consume.

Since xarray preserves the content of datasets which follow the CF and ACDD metadata conventions, it should be possible to generate both types of metadata in a straightforward way from the xarray dataset object, using metadata tools that have already been developed for datasets that adhere to the CF conventions. The ncISO tool exists that generate ISO records from netCDF or OPeNDAP endpoints, so the mapping from CF/ACDD attributes to ISO could be reused for records from xarray. Similarly, there has been work already done to create nco-json metadata from netcdf files, a complete metadata representation from which the json-ld content could be extracted.

Proposed Work:

  • Develop code that integrates the nco-json spec into the xarray package, which represent the complete metadata of the xarray object.

  • Develop code that, from the complete nco-json metadata associated with xarray objects, generates the more restrictive ISO and json-ld metadata formats.

@rabernat
Copy link

There are a large and growing number of publicly-available datasets that are loadable into xarray from buckets in the Cloud.

Can you give some examples of this?

The ones I know about are the datasets we have put online in zarr format in Pangeo. (Some docs about this process here: http://pangeo.io/data.html#data-in-the-cloud). Cataloging these datasets is an open issue (pangeo-data/pangeo#39)

The current problem with hosting xarray data in the cloud is that hdf does not play well with cloud storage. This is a technical obstacle that is being discussed in many places across xarray, zarr, netCDF, etc. That's why I'm curious about your claim that there are already a large number of publicly available cloud datasets that play well with xarray.

All that said, I am supportive of this idea in general.

@apawloski
Copy link

We were actually thinking about the Pangeo datasets. The term "large" is subjective of course, and large enough to warrant a catalog, as in: pangeo-data/pangeo#39. We experimented with something along these lines a few weeks ago at the Pangeo workshop, https://gist.github.com/rsignell-usgs/88cfae22896bf9fed5bd36a6689e7210. The goal would be to facilitate discovery of these datasets through their attributes/metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants