Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep metadata of sample datasets in the xarray objects #184

Merged
merged 5 commits into from Apr 7, 2021

Conversation

nshea3
Copy link
Member

@nshea3 nshea3 commented Aug 3, 2020

Fixes #125

Many global attributes are included in the netcdf files, however they are dropped in the fetch_* functions in sample_data.py with the float64 cast: data = xr.open_dataset(fname, engine="scipy").astype("float64"). This is a known issue in xarray (pydata/xarray#2049) and the current solution is to open the NetCDF file as-is, cast to the new datatype, and then copy the global attributes to the new object.

Also added tests to test_sample_data.py to check that the attrs attribute is not an empty dictionary.

Reminders:

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
  • Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
  • If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
  • Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nshea3 for spotting the true problem and solving it!

I would make small changes on you code.
By defining data and cast_data we will be storing the same data twice: one in float64 and another one in the original type. So, what if we cache the attrs on a variable, change the datatype and then set the attrs equal to the cached dictionary. Something like:

    data = xr.open_dataset(fname, engine="scipy")
    attrs = data.attrs.copy()
    data = data.astype("float64")
    data.attrs = attrs

I think this should work just fine, but worth testing it of course.

Besides, I would also test if the attrs, besides not being empty, has some important attributes, like refsysname.

What do you think?

@santisoler
Copy link
Member

Since we are about to move all datasets to Rockhound (see #179), we should remember to apply the same changes we perform here on fatiando/rockhound#84.

@nshea3
Copy link
Member Author

nshea3 commented Aug 7, 2020

Excellent point thank you @santisoler I will make that change so we avoid duplicating the xarray.

I'll add a test for the refsysname attribute as well. Are there any others I should include?

@santisoler
Copy link
Member

I'll add a test for the refsysname attribute as well. Are there any others I should include?

Maybe we can add those attributes of which the data depends on.
For example, the refsysname is crucial because it sets the reference frame (and which normal gravity was removed if we are requesting for gravity anomalies or disturbances).
I would also test the modelname, the max_used_degree and probably the tide_system.
Another things we might want to test if the min and maximum values agree with the ones that are hardcoded in the tests/sample_data.py:

def test_gravity_earth():
"Sanity checks for the loaded grid"
grid = fetch_gravity_earth()
assert grid.gravity.shape == (361, 721)
npt.assert_allclose(grid.gravity.max(), 9.8018358e05)
npt.assert_allclose(grid.gravity.min(), 9.7476403e05)
assert grid.height_over_ell.shape == (361, 721)
npt.assert_allclose(grid.height_over_ell, 10000)

Feel free to add as many tests as you think, more tests is usually a good thing to have!

@nshea3
Copy link
Member Author

nshea3 commented Aug 17, 2020

Hi @santisoler, I have made the change you suggested, the attrs are now copied to a dictionary and set after the conversion.

I have also added several tests for the presence of specific attributes: refsysname, max_used_degree, modelname, and tide_system.

I have not yet changed the netcdf files as discussed in #124. I started editing the files with the netcdf4-python library but I realized that the global attributes do not update as changes are made to the files, so the global attributes would be incorrect.

I am currently trying to regenerate the files from the ICGEM website (http://icgem.gfz-potsdam.de/calcgrid) without the repeated longitudes but the only output option is a .gdf file and I do not know how to convert that to a .cdl or .nc file?

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nshea3. Thanks for making the changes! I think they are good to go! I will just make small changes on the comment lines, just to update them.

I have not yet changed the netcdf files as discussed in #124. I started editing the files with the netcdf4-python library but I realized that the global attributes do not update as changes are made to the files, so the global attributes would be incorrect.

Don't worry about #124 here. Now that you figured out that the missing attributes weren't in fact loaded by the function (but still in the files), these two issues are somewhat unrelated. So we can solve them in different PRs.

I am currently trying to regenerate the files from the ICGEM website (http://icgem.gfz-potsdam.de/calcgrid) without the repeated longitudes but the only output option is a .gdf file and I do not know how to convert that to a .cdl or .nc file?

Oh, you can use harmonica.load_icgem_gdf to load .gdf files into xarray.Dataset, which can be stored as netCDF files with xr.Dataset.to_netcdf.

@santisoler
Copy link
Member

@nshea3 Sorry for stalling this PR for too long. I don't know why it kept under my radar for too long.
I'll merge right away if you agree.

We are planning a new Harmonica release for the next week(s). Thanks for the contribution!

@santisoler santisoler added this to the v0.2.0 milestone Mar 29, 2021
@santisoler santisoler changed the title Missing attributes copied from xarray before cast Keep metadata of sample datasets in the xarray objects Apr 7, 2021
@santisoler santisoler merged commit 62f3e65 into fatiando:master Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing attributes on ICGEM datasets
2 participants