-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for a hierarchical parent-child metadata standard #62
Comments
Hi all - The GOES Geostationary Lightning Mapper (GLM) data is stored as netCDF-4 files containing CF DSG point data (though running one through the CF checker gave me some errors). Each file contains three different point data features each with its own observation dimension. The three point data features (events, groups, and flashes) are related by a hierarchical parent-child relationship. It is this hierarchical parent-child relationship the CF-Tree proposal would like to standardize. In the GLM data, each event represents a pixel where lightning was detected during a time step. Each group represents a cluster (in time and space) of events. Each flash represents a cluster (again in time and space) of groups. Here’s an example CDL (a simplified version of the CDL in the CF-Tree Google Document that Eric, @deeplycloudy, referenced above):
The above CDL is the bare bones of the CF-Tree proposal. The CF-Tree document proposes some explicit declarations of the various roles ( So, to get started, any thoughts on how/where this might fit in CF? To me, the CF-Tree concept seems related to Cell Bounds and Geometries -- in the GLM example, the @davidhassell - Any other data model thoughts? [In case its useful, current GLM datasets are a bit different and can be found here. Just navigate down to a recent dataset, selecting the “CdmRemote” access method will return the CDL for the dataset.] |
Dear @ethanrd I think you could regard this as a DSG with two element dimensions, like It could be called a Best wishes Jonathan |
Hi @ethanrd, Thanks for this summary - very useful for getting into it. I'm thinking that this fits in logically with cell methods. It seems different to the current encoding of DSG that @JonathanGregory describes (for which thanks). It feels instead more like a modified cell methods, that re-uses some of the DSG machinery. In your example, one would need to know what the relationship between two adjacent tree elements is. For example, you have connected a subset of groups to a particular flash, but I don't know how those groups have been combined to produce each flash value. What I'm thinking of is cell methods modified such that in the usual "name: method", the name is actually another data variable, or other dimension, rather than a dimension of the data or a standard name. In essence, this would tell us that the data values in our data variable are in fact a function of the elements of another data variable (rather than a function of elements of a higher resolution version of itself, as is usual). This also provides a link from the data variable to the next parent level in the tree, i.e. the adjacent one with finer granularity, although that link will also always be explicit through other data variable attributes. In your CDL this could look something like:
The first thing to note is that we may, quite correctly, not want or have actual cell method attributes to attach, so we still need an explicit indicator of the child (data variable or dimension), along the lines of the This example would be interpreted as follows:
Data modelThe data model would not be affected if this were only fancy compression (like DSG), but this doesn't look that to me. Based on what I've been thinking above, this looks like it is straying into the realm of connecting two field constructs in such a a way one logically depends on the other. This would be a new concept. (It is also something that linking vector components would want to do, for example.) |
Hi all, I've read through the proposal and comments and skim-read the detailed proposal. It seems to me that there are good motivations and benefits, though I am still getting my head around the specific nature of the proposal. I thought I'd comment because there is a lot of text, with some CDL examples, to describe the proposal and the context that motivated it, without any diagrams except the figure from the paper Bruning et al. included in the external document which is specific to your use-case. Personally I think it is easier to understand the meat of the proposal if it is abstracted out to not refer to any particular use context, and think others may find that too (though I appreciate it is important to demonstrate why the idea would be helpful in practice). Therefore I think it would be very helpful to have a diagram if you could produce some sort of schematic that covers the abstract nature of the proposal (no mention of GLM, flashes, etc.). The following:
implies the general idea so I suppose it could be a tree with labelling to describe for instance what can and cannot be inherited from parent to child, and/or what must be defined on a parent and on a child. UML or similar for the inheritance details would also be really useful, if you were able to illustrate in that way. Would it be possible for you to create a diagram to illustrate the key ideas in the abstract? I don't think it would need to be very detailed to be useful (at the very least to me!) Thanks. |
Title
CF-Tree: Hierarchical parent-child data in the Climate and Forecast Metadata framework
Authors
Eric Bruning (@deeplycloudy): Texas Tech University, Lubbock, TX
Ethan Davis (@ethanrd), Ryan May (@dopplershift), Sean Arms (@lesserwhirls): UCAR/Unidata, Boulder, CO.
Requirement Summary
This is a proposal to define a formal metadata standard under the CF conventions for hierarchical parent-child relationships of arbitrary depth, for data with zero to many associated spatiotemporal or other dimensions. We propose the name CF-tree to help the user picture the data linkages implied by the metadata.
Much of the text here is repeated from the complete proposal linked below, which should be fully open for comments. I kept things brief here out of courtesy, but am not opposed to pasting the complete proposal as necessary. I assume the complete proposal text, if encouraged, will eventually need to be posted in full as an issue/PR on the cf-conventions repository.
Technical Proposal Summary
The datasets to which this proposed standard applies have in common a parent-child ID variable that links two or more tree dimensions. Other variables might be associated with these IDs by dimension, or by explicit use of the ID key on a variable that does not share the dimension.
Such data structures are like the foreign key relationships used in databases. They specify a one-to-many hierarchical relationship. This may be visualized as a directed graph down a tree. Other relevant theory includes connected-component labeling.
Recognizing these common features, the mind wanders to many to many relationships, and arbitrary and possibly directed graph structures, and even specification of unstructured grids or better, standardized handling of ragged arrays. However, those applications are out of scope for this proposal, which is focused on the more straightforward one-to-many problem.
We also note that database-like groupby functionality and labeled coordinate indexing exists in two popular Python data science libraries, pandas and xarray. The latter is aimed at extending the ideas in pandas to multidimensional data, and seeks to implement the CF conventions. In our proof of concept implementation of machine-facilitated traversal of the Geostationary Lightning Mapper data tree, we recognized several fundamental operations, which we implemented in a generic way using xarray.groupby:
Benefits
Development of a concrete metadata standard would stimulate progress toward a standardized implementation of machine-automated traversal of hierarchical tree structures in CF-honoring packages (e.g., xarray) that are directly used by domain science practitioners.
This proposal arises from our work with the Geostationary Lightning Mapper (GLM) on the GOES-16 and GOES-17 meteorological satellites, which already implement the data model proposed herein; we are proposing a formalization of some implicit conventions and minor additions to flag the conformance to those conventions.
The GLM traversal code example we developed to demonstrate this is open source, and includes unit tests for data structures beyond GLM itself. The future we have in mind is sufficient metadata so that xarray or similar libraries could recognize the hierarchical structure, and perform such fundamental operations for the user without the user having to walk the tree. Right now, the traversal is manually configured from user knowledge of the hierarchy’s ID variables, and probably not as generalized as it could be.
We foresee application to lightning datasets, thunderstorm cell tracking, and weather and climate model validation, among others as detailed in the complete proposal linked below.
Status Quo
Our proposal is related to some elements of the discrete sampling geometry standards, but extends those ideas to generalized one-to-many foreign key-type data models.
Associated pull request
None yet. Please see some proposed uses of CF-tree, including draft format specifications, as linked in the detailed proposal.
Detailed Proposal
Please see the Google Doc we prepared as part of our development of this proposal.
The text was updated successfully, but these errors were encountered: