Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: active versus passive coordinate assignment #2

Closed
djhoese opened this issue Nov 11, 2018 · 8 comments
Closed

Design: active versus passive coordinate assignment #2

djhoese opened this issue Nov 11, 2018 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@djhoese
Copy link
Contributor

djhoese commented Nov 11, 2018

I'm going to start making issues here on this repository so pydata/xarray#2288 doesn't get any longer. This issue is regarding what the best and most useful way is for geoxarray to add or provide information to xarray objects. This library will likely be heavily influenced by metpy's handling (@dopplershift) of CF metadata: https://github.com/Unidata/MetPy/blob/master/metpy/xarray.py#L152. From talking with @dopplershift on gitter, he mentioned @shoyer had suggested adding a crs coordinate to DataArrays.

The main question is should geoxarray assign things to the user provided xarray object (what I'm calling "active" behavior) or should geoxarray make things available for other libraries to use by calculating things when needed ("passive"). I haven't fully laid out the design for the passive behavior, that's what this issue is for.

Active

Basic behavior:

  • Requires user to request the change (ex. new_ds = my_dataset.metpy.parse_cf())
  • Existing coordinates and attribute may be changed (x and y dimensions) to match expectations and "standardize" the data object
  • Coordinates will be added (my_data_arr.coords['crs'] = CRS(...))
  • Anyone using the new DataArray or Dataset can assume that x and y coordinates are in meters and that a crs coordinate defines the CRS.

Pros:

  • Arithmetic and merging takes the CRS coordinate in to account
  • Makes the data more useable in the xarray model without knowing of the parsing library (metpy/geoxarray)

Cons:

  • Requires changing metadata that the user may have wanted use
  • Cannot handle Datasets with multiple CRSs.

Passive

Basic behavior:

  • CRS and coordinate information is calculated/determined when requested (ex. my_data_arr.geo.crs where geo is the geoxarray accessor)
  • Information is cached and stored in the xarray accessor
  • Any special handling of metadata can be configured (ex. my_data_arr.geo.set_xy_dims('x2', 'y2')) where handling all variants of dimension/coordinate names would be difficult otherwise.

Pros:

  • Doesn't modify the data the user is using
  • Handles DataArrays separately so multiple CRSes in the same Dataset are allowed (technically)

Cons:

  • No handling of CRS in arithmetic and merging
  • Requires the using library (cartopy, etc) to know stuff about geoxarray or for geoxarray to pass information in a way that the library understands. This is technically needed by both.

Example usages

See also MetPy's example.

import xarray as xr
import geoxarray as gxr

my_dataset = xr.open_dataset('/path/to/file.nc')

# active (taken from metpy)
data_var = my_dataset.geo.parse_cf('Temperature')
...
ax = fig.add_subplot(1, 1, 1, projection=data_var.geo.cartopy_crs)
x = data_var.x
y = data_var.y
ax.imshow(im_data, extent=(x.min(), x.max(), y.min(), y.max()),
          cmap='RdBu', origin='lower' if y[0] < y[-1] else 'upper')

# passive

data_var = my_dataset['Temperature']
ax = fig.add_subplot(1, 1, 1, projection=data_var.geo.cartopy_crs)
x = data_var.geo.x
y = data_var.geo.y
ax.imshow(im_data, extent=(x.min(), x.max(), y.min(), y.max()),
          cmap='RdBu', origin='lower' if y[0] < y[-1] else 'upper')

Conclusion

I like active better I think even though passive was my original plan for solving things. I think both methods require more information from the user in special cases (accepting non-CF/non-standard coordinate/dimension names). Both require the user or using library to expect certain things about the data or require the user to know how to pass things to that library (cartopy_crs, etc). Some of the metpy CRS handling may also be handled by the pycrs library.

My biggest flexibility issue was the active method not being able to handle multiple CRS in one Dataset, but realize now that the user just needs to split the dataset up by CRS.

Obviously feedback and concerns are welcome (@mraspaud, @pnuu, @leouieda, @fmaussion, @snowman2, @karimbahgat).

@djhoese djhoese added the question Further information is requested label Nov 11, 2018
@djhoese djhoese self-assigned this Nov 11, 2018
@snowman2
Copy link
Collaborator

This is a great start for the discussion. I like how you have clearly laid out everything and have some great points. I have a couple of items to add to your comments that could potentially aid in the discussion.

  1. From my experience, the active method can support both geographic and projected coordinate systems. You just specify the projection in the crs coordinate connected to the grid_mapping attribute and put the correct coordinates in x and y and it works. One alternative I have seen to differentiate that the system is to put the coordinates in longitute and latitude variables. This is a nice alternative as it makes it more clear that it is a geographic coordinate system. However, it is not required as the crs coordinate can store the projection information and would be just fine without.

  2. With a slightly more complex design, the active method could support multiple projections in the same dataset via the grid_mapping attribute on each variable and additional crs coordinates. There is probably a more elegant method, but this is just food for thought at the moment.

Here is an example of how multiple CRS in the same dataset might work in the active method:

<xarray.Dataset>
Dimensions:      (x: 65, y: 31, x1: 30, y1: 30)
Coordinates:
  * x            (x) float64 ...
  * y            (y) float64  ...
  * x1           (x1) float64 ...
  * y1           (y1) float64  ...
    time         datetime64[ns] ...
    crs          int64 ...
    crs1          int64 ...
Data variables:
    ndvi         (y, x) float64 ...
    ndvi1         (y1, x1) float64 ...
Attributes:

And on the variables:

<xarray.DataArray 'ndvi' (y: 31, x: 65)>
array([[ ...]])
Coordinates:
  * x            (x) float64 ...
  * y            (y) float64 ...
    time         datetime64[ns] ...
    crs          int64 0
Attributes:
    grid_mapping:  crs
<xarray.DataArray 'ndvi1` (y1: 30, x1: 30)>
array([[ ...]])
Coordinates:
  * x1           (x1) float64 ...
  * y1            (y1) float64 ...
    time         datetime64[ns] ...
    crs1          int64 0
Attributes:
    grid_mapping:  crs1

@djhoese
Copy link
Contributor Author

djhoese commented Nov 12, 2018

@snowman2 Thanks. For your first point, metpy currently does this by creating a CRS stating it is handling lat/lon data. Really it looks like it is trying to see if the coordinates for a variable have coordinates that are lon/lat variables: https://github.com/Unidata/MetPy/blob/master/metpy/xarray.py#L187

For 2, I could see one big issue with this being that attributes tend to disappear by default in xarray so any arithmetic on these DataArrays/Dataset and the grid_mapping attribute would disappear. Right?

@djhoese
Copy link
Contributor Author

djhoese commented Nov 12, 2018

Also related: Unidata/MetPy#960

@snowman2
Copy link
Collaborator

I could see one big issue with this being that attributes tend to disappear by default in xarray so any arithmetic on these DataArrays/Dataset and the grid_mapping attribute would disappear. Right?

Yep, that is definitely something to be wary about. If you choose to go down this route, you would have to have it well documented and maybe raise warnings/exceptions if it is missing when more than one crs like coordinate is present. If you want to produce a CF compliant file, the grid_mapping attribute will definitely have to be there. 🤔

@jthielen
Copy link

jthielen commented Nov 12, 2018

@djhoese MetPy's handling of simultaneous x/y and longitude/latitude might not be optimal right now. Currently, the line you referenced should only come into play if the grid_mapping on a variable is missing (either the variable corresponding to the attribute or the attribute itself) and longitude and latitude coordinates are found. Otherwise, the attributes of the variable corresponding to the grid_mapping are used directly create the CFProjection object, which becomes the crs coordinate (see here).

In terms of coordinate recognition, if both projection x/y coordinates and lon/lat coordinates are present for a variable, MetPy will try to default to the projection coordinates, and if those are not unique, then to dimension coordinates, as in this function:

https://github.com/Unidata/MetPy/blob/master/metpy/xarray.py#L345-L368

I'm not sure if this is the best approach, but this is what we had settled on this past summer.

@fmaussion
Copy link

Thanks for the summary @djhoese ! My 2 cents:

  • I think you won't be able to avoid active handling altogether - especially if you want to support less conventional datasets. I know of modelling products which have parsable CRS at the dataset level only (not the variable level). If you want to support these, you'll have to actively add (meta)data to all variables of the dataset beforehand.
  • an elegant way to solve this (I believe) it to offer a dataset constructor at the library level: gxr.open_geodataset() (or simply: gxr.open_dataset). This is how salem does it.
  • The active version doesn't need to be too intrusive, i.e. if possible add things as opposed to change things.
  • As mentioned earlier, the CRS info has to be stored as a coordinate, not an attr. I don't see much downsides with this approach though.

@djhoese
Copy link
Contributor Author

djhoese commented Nov 14, 2018

  • For your first point: understood. Things like these are why the SatPy library exists for satellite instrument data. People keep inventing new standards even within standards ("your file says it follows CF conventions but why are your flag meanings only in a 'description' attribute?").
  • Agreed. Although, since geoxarray will probably be limited to only geolocation information parsing I could see having an open_dataset that does extra stuff that is only available in that function being an issue. For example, if another library provides an open_dataset to handle some specific file format but doesn't handle geolocation then geoxarray would need to provide a way to parse the geolocation post-open. This may be obvious, just thinking "out loud".
  • I guess that would be the goal and if the user calls a "parse" like method then I suppose there could be keyword arguments for what it is allowed to do with the understanding by the user that future operations may not work perfectly. For example, you tell geoxarray not to add/overwrite x/y coordinates then you run the risk of xarray merging things in undesired ways.
  • The only downside is as I said earlier, that if the coordinate is called crs in every DataArray then putting data in to a Dataset is not allowed by xarray because of conflicting CRS coordinates. This is mainly an issue for CF-compliant NetCDF files where they could have multiple "grid_mapping" variables.

I guess an extension of this issue's general question is: should users be expected to know what "structure" their data is in? Should they know that geolocation can be retrieved from conventional CF variables (open_cf_dataset) or should geoxarray be able to figure it out? I think geoxarray could probably figure things out in most cases and either raise an exception or produce a warning or maybe silently not provide any geolocation information if it can't figure it out. Geoxarray can ultimately provide a way to customize how it parses geolocation.

@djhoese
Copy link
Contributor Author

djhoese commented Jul 17, 2023

I'm going to close this issue as I think in #24 and the initial 0.1.0 version (now released), geoxarray follows kind of an half-way design between active and passive behavior. On one hand, doing .geo.crs will try its best to find "standard" CRS information. On the other, geoxarray needs some in-memory way of storing CRS information that doesn't get lost between operations...or at least not easily lost. @snowman2's experience with rioxarray is a huge benefit to geoxarray when it comes to this point and I've followed those lessons learned in the current version of geoxarray. Mainly:

  1. When asked to store CRS information (via .geo.write_crs), store it in a CF-compatible manner.
  2. Store the grid mapping variable as a coordinate variable in .coords.
  3. Store the name of the grid mapping variable to use for each DataArray in the .encoding of that DataArray.
  4. Use crs_wkt as a fallback for CRS cases that CF doesn't completely define while still being CF compliant.
  5. Replicate that WKT to a spatial_ref attribute in the grid mapping variable for GDAL compatibility.

Until I/we/geoxarray have more use cases (data loading all the way to final usage), I don't think we can say for certain which way each possible feature of geoxarray should be implemented. The basic functionality have an inplace keyword argument gets around a lot of the "hhhmmm should we really be modifying the user's object?" type of confusion/decisions.

People can feel free to continue to comment here, but in my opinion specific cases should probably be discussed in new issues.

@djhoese djhoese closed this as completed Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants