Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating the CFA convention for aggregated datasets into CF #508

Open
davidhassell opened this issue Feb 7, 2024 · 3 comments
Open
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@davidhassell
Copy link
Contributor

Incorporating the CFA convention for aggregated datasets into CF

Moderator

To be decided

Moderator Status Review [last updated: 2024-02-07]

  • New issue created on 2024-02-07

Requirement Summary

This is a proposal to incorporate the CFA conventions into CF.

CFA (Climate and Forecast Aggregation) is a convention for recording aggregations of data, without copying their data.

The CFA conventions were discussed at the 2021 and 2023 annual CF workshops, the latter discussion resulting in an agreement to propose their incorporation into CF.

By an “aggregation” we mean a single dataset which has been formed by combining several datasets stored in any number of files. In the CFA convention, an aggregation is recorded by variables with a special function, called “aggregation variables”, in a single netCDF file called an “aggregation file”. The aggregation variables contain no data but instead record instructions on both how to find the data in their original files, and how to combine the data into an aggregated data array. An aggregation variable will almost always take up a negligible amount of disk space compared with the space taken up by the data that belongs to it, because each constituent piece, called a “fragment”, of the aggregated data array is represented solely by file and netCDF variable names and a few indices that describe where its data should be placed relative to the other fragments (see examples 1 and 2).

Example 1: For a timeseries of surface air temperature from 1861 to 2100 that is archived across 24 files, each spanning 10 years, it is useful to view this as if it were a single netCDF dataset spanning 240 years.
CFA_1

CFA has been developed since 2012 and is now a stable and versioned convention that has been fully implemented by cf-python for both aggregation file creation and reading.

Note that this proposal does not cover how to decide whether or not the data arrays of two existing variables could or should be aggregated into a single larger array. That is a software implementation decision. For instance, cf-python has an algorithm for this purpose (We think that the cf-python aggregation rules are complete and consistent because they are entirely based on the CF data model.)

Storing aggregations of existing datasets is useful for data analysis and archive curation. Data analysis benefits from being able to view an aggregation as a single entity and from avoiding the computational expense of creating aggregations on-the-fly; and aggregation files can act as metadata-rich archive indices that consume a very small amount of disk space.

The CFA conventions only affect the representation of a variable’s data, and thus they work alongside all CF metadata, i.e. the CFA conventions do not duplicate, extend, nor re-define any of the metadata elements defined by the CF conventions.

An aggregation file may, and often will, contain both aggregation variables and normal CF-netCDF variables i.e. those with data arrays. All kinds of CF-netCDF variables (e.g. data variables, coordinate variables, cell measures) can be aggregated using the CFA conventions. For instance an aggregated data variable (whose actual data are in other files) may have normal CF-netCDF coordinate variables (whose data are in the aggregation file).

Another approach to file aggregation without copying data is NcML Aggregation, which has been extensively used. CFA is similar in intent to NcML but is more general and efficient, because it

  • keeps the CF metadata in the same place as the aggregation instructions;
  • allows aggregations over any number of dimensions in any array positions;
  • places no restrictions on netCDF elements that are not standardised by CF (such as variable names);
  • uses the binary netCDF format to speed up read times for large aggregations.

Technical Proposal Summary

The CFA conventions currently have their own document (https://github.com/NCAS-CMS/cfa-conventions/blob/main/source/cfa.md) which describes in detail how to create and interpret an "aggregation variable", i.e. a netCDF variable that does not contain a data array but instead has attributes that contain instructions on how to assemble the data array as an aggregation of data from other sources.

A Pull Request to incorporate CFA into CF has not been created yet. Before starting any work on translating the content of the CFA document into the CF conventions document, it is important to get the community’s consensus that this is a good idea, and about how the new content should be structured (e.g. a new section, a new appendix, both, or something else).

The main features of CFA are summarised in example 2, a CDL view of an aggregation of two 6-month datasets into a single 12-month variable (see the CFA document for details).

Example 2: An aggregation data variable whose aggregated data comprises two fragments. Each fragment spans half of the aggregated time dimension and the whole of the other three aggregated dimensions, and is stored in an external netCDF file in a variable called temp. The fragment URIs define the file locations. Both fragment files have the same format, so the format variable can be stored as a scalar variable.
CFA_diagram_CF

Benefits

Aggregations persisted to disk allow users and software libraries to access pre-created aggregations with no complicated and time-consuming processing.

Status Quo

Not being able to persist fully generalised aggregations to disk means that every user/software library has to be able to create their own aggregations every time the data files are accessed. This is a complicated and time-consuming task.

Associated pull request

None yet (see above).

CFA authors

CFA has been developed by David Hassell, Jonathan Gregory, Neil Massey, Bryan Lawrence, and Sadie Bartholomew.

Contributors to CFA discussions at the CF workshops

Chris Barker, Ethan Davies, Roland Schweitzer, Karl Taylor, Charlie Zender, and Klaus Zimmermann (please let us know if we have accidentally missed you off this list).

@davidhassell davidhassell added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Feb 7, 2024
@larsbarring
Copy link
Contributor

David,
This will be an excellent and very useful addition to the CF Conventions! I have not yet wrapped my head around the technical details. There is one thing I do not quite understand, first you write:

The aggregation variables contain no data but instead record instructions on both how to find the data in their original files, and how to combine the data into an aggregated data array

And below the Figure 1 you write:

Note that this proposal does not cover how to decide whether or not the data arrays of two existing variables could or should be aggregated into a single larger array.

Probably I am missing something here, but to me this seems contradictory? Anyway, that is a detail, and I think the more important questions are the one you raise in the Technical Proposal Summary:

... incorporate CFA into CF ... that this is a good idea, ...

To me this is no doubt a good idea, which already has a strong community backing.

... how the new content should be structured (e.g. a new section, a new appendix, both, or something else).

Perhaps an outline somewhere in the main text: end of Chapter 2 regarding aggregation files and their relation to the fragment files, somewhere in Chapter 3 regarding aggregation variables? And then an exhaustive description in an Appendix?

This, brings me a more general thought that I have been thinking about for some time:
I think that the CF Conventions document is getting increasingly long and complex/difficult to get an overview of. The Table of Content takes 8 full screens (5 pdf pages), then 5 screens of Tables of tables/figures/examples (3 pdf pages). I have no idea how to improve upon this, but it becomes more and more of a concern as we add new features to the Conventions. However, this is not something to discuss and solve here in this enhancement proposal, but I wanted to bring it up here anywaay.

@davidhassell
Copy link
Contributor Author

davidhassell commented Apr 9, 2024

Thank you for you comments, Lars, and sorry that it has taken me some time to respond.

Even though you are the only person to have commented here (and in support), this proposal has been scrutinised carefully at two CF workshops, with a group decision being reached in 2023 to work towards incorporating CFA into CF. I'm therefore minded to move to writing the PR, now that Lars has made a good suggestion of how and where the content could go into the existing CF conventions. This shouldn't take too long, because it will largely be a "cut and paste" job from the existing CFA description, which was deliberately written in a CF-ish style in anticipation of this :).

The aggregation variables contain no data but instead record instructions on both how to find the data in their original files, and how to combine the data into an aggregated data array
...
Note that this proposal does not cover how to decide whether or not the data arrays of two existing variables could or should be aggregated into a single larger array.

Good point. The first statement applies to the reading of the data, and the second to the writing of the data. The CFA conventions do not give any guidance on the decision of how fragment files can be combined prior to creating an aggregation variable, rather once you have an aggregation in mind, they provide a framework in which you can encode it in such a way that other people can decode it.

If I give you two datasets (A and B) then the CFA conventions won't give you any help in working out if A and B can be sensibly combined into a single larger dataset (C). There are various ways in which you could work this out yourself - you could inspect the metadata and apply an aggregation algorithm (e.g. this one, or by visual inspection), or base it on files names (e.g. I know that model outputs from March.nc and April.nc are safe to combine into a 2-month dataset), etc.

Perhaps an outline somewhere in the main text: end of Chapter 2 regarding aggregation files and their relation to the fragment files, somewhere in Chapter 3 regarding aggregation variables? And then an exhaustive description in an Appendix?

I like the idea of a Chapter 2 outline. I might suggest content from Introduction, Terminology, Aggregation variables, and Aggregation instructions (without its subsections) for Chapter 2, and everything else - which is most of the existing CFA document - (Standardized aggregation instructions, Non-standardized terms, Fragment Storage and examples) for the appendix.

The Table of Content takes 8 full screens (5 pdf pages), then 5 screens of Tables of tables/figures/examples (3 pdf pages).

Just a thought - the TOC currently shows all subnsections - maybe it could be restricted to just one level of subsection, so for instance Chapter 7 would go from

[7. Data Representative of Cells](https://cfconventions.org/cf-conventions/cf-conventions.html#_data_representative_of_cells)
    [7.1. Cell Boundaries](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries)
    [7.2. Cell Measures](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-measures)
    [7.3. Cell Methods](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-methods)
        [7.3.1. Statistics for more than one axis](https://cfconventions.org/cf-conventions/cf-conventions.html#statistics-more-than-one-axis)
        [7.3.2. Recording the spacing of the original data and other information](https://cfconventions.org/cf-conventions/cf-conventions.html#recording-spacing-original-data)
        [7.3.3. Statistics applying to portions of cells](https://cfconventions.org/cf-conventions/cf-conventions.html#statistics-applying-portions)
        [7.3.4. Cell methods when there are no coordinates](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-methods-no-coordinates)
    [7.4. Climatological Statistics](https://cfconventions.org/cf-conventions/cf-conventions.html#climatological-statistics)
    [7.5. Geometries](https://cfconventions.org/cf-conventions/cf-conventions.html#geometries)

to

[7. Data Representative of Cells](https://cfconventions.org/cf-conventions/cf-conventions.html#_data_representative_of_cells)
    [7.1. Cell Boundaries](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries)
    [7.2. Cell Measures](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-measures)
    [7.3. Cell Methods](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-methods)
    [7.4. Climatological Statistics](https://cfconventions.org/cf-conventions/cf-conventions.html#climatological-statistics)
    [7.5. Geometries](https://cfconventions.org/cf-conventions/cf-conventions.html#geometries)

That alone would remove 71 lines from the TOC! But as you say, any more on that should be discussed elsewhere, which I would welcome.

@taylor13
Copy link

taylor13 commented Apr 9, 2024

I think this is generally a good idea and have been meaning to go over the details.

A quick thought about the table of contents: Would it be easy in the web view to collapse the subsection hierarchy to 1 or 2 levels, then click on an upper level to display its subsections? That might give a newbie a more accessible overview. On the other hand, I usually just execute "find" for some key word I know is relevant to what I want to look up, and if that word becomes hidden (in a hidden low level subsection), then I may have a harder time navigating quickly to the relevant section. So I can see arguments for the current expanded table of contents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

No branches or pull requests

3 participants