Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

access to (ICON model) output and input #96

Closed
felix-mue opened this issue Jun 15, 2023 · 12 comments
Closed

access to (ICON model) output and input #96

felix-mue opened this issue Jun 15, 2023 · 12 comments
Assignees
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@felix-mue
Copy link

I am interested in data from the ICON model runs (in- and output), but this question could be generalised to other data as well.

  1. Is the output from the ICON model runs publicly available? The eurec4a intake repository folder for ICON contains yaml files with links to dkrz, which I suspect to be the location of the output data, but I do not understand how to access them.
  2. Are the corresponding input settingsfiles publicly available as well? (They might be bundled directly with the output of course.)
@observingClouds observingClouds added documentation Improvements or additions to documentation question Further information is requested labels Jun 15, 2023
@observingClouds observingClouds self-assigned this Jun 15, 2023
@observingClouds
Copy link
Collaborator

Hi @felix-mue,
Thank you very much for reaching out to us and raise your question here.

To 1)
You can access most (if not all) datasets listed on the how.eurec4a.eu page via our intake catalog without even the need to know where they are stored 😎.

import eurec4a
cat = eurec4a.get_intake_catalog()
datasets = list(cat.simulations.ICON.LES_CampaignDomain_control)  # show all available entries of a catalog level
ds = cat.simulations.ICON.LES_CampaignDomain_control.surface_DOM01.to_dask()  # lazy loading of data

In addition, this will only download the data that you are actually using in your analysis (keyword: lazy loading). No need to download all the TB of output 🥳

Please try it out! Does this answer your first question?

@observingClouds
Copy link
Collaborator

To 2)
The run-scripts are available at the experiment repository. Please let me know if you have access to those.

@felix-mue
Copy link
Author

Thanks for the quick reply!
Yes, I have access to the other repository. I will work through the data handling and the files and get back to you when something comes up.

@felix-mue
Copy link
Author

About accessing the data: While lazy loading is great for many places, for me it would actually be helpful to have one big download of the data (maybe subset by variables). Is that available as well?

@observingClouds
Copy link
Collaborator

May I ask what your application is? The latency to access the files here should be fairly low and loading the data lazily ensures that you will always access the latest version.

At https://howto.eurec4a.eu/eurec4a_mip.html we show you how you can download data with wget. The paths you can find in the eurec4a catalog files, e.g. here

@felix-mue
Copy link
Author

A simple barrier sadly: Our code is running in matlab, not python. So I have to access the data from matlab and assumed that isn't possible with the python package.

@observingClouds
Copy link
Collaborator

observingClouds commented Jun 22, 2023

Sorry to hear that! Maybe it's time for a change 🥳 MATLAB supports yaml files so you could read those files and grep the links. But honestly it seems like you would need to invent the wheel again. MATLAB's python support might also be something to look into but I'd be surprised if it works well.

Another issue you might face with MATLAB is that the simulations are saved in the zarr-format. It seems like MATLAB has no dedicated driver for this format yet. However, zarr is now besides HDF5 also a supported backend of netCDF and is supported by the newer libraries. You should therefore be able to load the zarr-files (after downloading them) through the netCDF library. The syntax is however a bit unusual.

So, here is an example how you can download a zarr-file from the catalog and read it with the netCDF library:

  1. Download the data with wget
wget -r -H -N --cut-dirs=3 --include-directories="/v1/" "https://swiftbrowser.dkrz.de/public/dkrz_948e7d4bbfbb445fbff5315fc433e36a/EUREC4A_LES/experiment_2/meteograms/EUREC4A_ICON-LES_control_meteogram_DOM03_BCO.zarr/?show_all"

Note the change of the prefix and ending of the url compared to the one given in the catalog.

  1. Note that wget creates two directories (swift.dkrz.de, swiftbrowser.dkrz.de). The actual dataset is in swift.dkrz.de.
  2. Append the absolute path of the zarr file following the scheme: file:///path/to/zarr/file.zarr#mode=xarray
  3. You should be able to use this path with your favourite netCDF tool, e.g.
ncdump -h "file:///path/to/swift.dkrz.de/experiment_2/meteograms/EUREC4A_ICON-LES_control_meteogram_DOM03_BCO.zarr#mode=xarray"

Unfortunately, reading a variable from this dataset is for this particular case not working on my end. It might be that the used compressor is not supported (although it seems) or the blosc library (we use lz4 as a compressor here) is not linked to the netcdf library.

ncdump -v time "file:///path/to/swift.dkrz.de/experiment_2/meteograms/EUREC4A_ICON-LES_control_meteogram_DOM03_BCO.zarr#mode=xarray"

returns the metadata and then

data:

NetCDF: Filter error: undefined filter encountered
Location: file ?; fcn ? line 478
 time = % 

@observingClouds
Copy link
Collaborator

@d70-t do you have an idea what is going on here? The filter in .zmetadata/.zarray is actually null and deleting it entirely does not help to solve the problem.

@d70-t
Copy link
Collaborator

d70-t commented Jun 22, 2023

.zmetadata is a zarr-python extension, which as far as I know isn't adopted yet by netCDF. But .zarray is used.

You probably need netCDF >= 4.9 and there are some steps required for setting up netCDF to run with filters.

@d70-t
Copy link
Collaborator

d70-t commented Jun 22, 2023

Download the data with wget

If you really really want a download of a subset, I'd probably recommend to just open the data with intake / xarray, then do some ds[[vars...]].sel(...).to_netcdf(). But just as @observingClouds said, I didn't yet discover cases in which downloading would be so much better that it would justify the additional hassle involved.

@felix-mue
Copy link
Author

Thanks a lot to both of you! I agree, of course I'd rather not download. I just didn't see a way to access it otherwise (within matlab).

I will try the cross-accessibility features @observingClouds mentioned, but I also don't have high hopes.
I am also downloading some data simultaneously to try if that gets me further.

@felix-mue
Copy link
Author

I ended up downloading the data with a python script to save them as netcdf files. This is of course unfortunate, because the pythonic way of accessing this data is way more comfortable! Thanks a lot again for your help and providing the data in the first place!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants