Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the use of compressed dimensions in related variables #147

Closed
davidhassell opened this issue Oct 10, 2018 · 25 comments · Fixed by #466
Closed

Clarify the use of compressed dimensions in related variables #147

davidhassell opened this issue Oct 10, 2018 · 25 comments · Fixed by #466
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors

Comments

@davidhassell
Copy link
Contributor

The problem

When a data variable has dimensions that have been compressed by gathering (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#compression-by-gathering), the conventions currently make no statement on whether or not any related auxiliary coordinate, cell measure or ancillary variables that spans those dimensions must also be compressed, or whether such attached variables may be compressed across a different list of dimensions.

The situation is, however, clarified in the conformance document (http://cfconventions.org/Data/cf-documents/requirements-recommendations/requirements-recommendations-1.7.html), which states that attached variables must span a subset of the actual dimensions spanned by the data variable (sections 5 and 7.2). There are exceptions to this rule for auxiliary coordinate variables, for character arrays and DSG ragged arrays, but they do not apply to compression by gathering.

Proposed solution

I propose that the conformance document is correct, and we would be fixing a defect in the conventions to clarify this in the main text.

Backwards compatibility

I see no problems, because If there are existing datasets that apply compression differently to related variables than to the data variable, then these datasets would already fail the tests from the conformance document.

Changes required

A. Add a third paragraph to section 8.2

Any auxillary coordinate, cell measure or ancillary variables related
to a compressed data variable must not be compressed unless it can be
compressed using the same list variable indices as used by the data
variable, in which case compression must be applied. This will occur
for a related variable for which, when uncompressed, the individual
compressed axes appear adjacently and in the same order as the data
variable.

B. Update Example 8.1 to include a 2-d auxiliary coordinate variable:

Example 8.1. Horizontal compression of a three-dimensional array

dimensions:
  lat=73;
  lon=96;
  landpoint=2381;
  depth=4;
variables:
  int landpoint(landpoint);
    landpoint:compress="lat lon";
  float landsoilt(depth,landpoint);
    landsoilt:long_name="soil temperature";
    landsoilt:units="K";
    landsoilt:coordinates="soil_type";
  float depth(depth);
  float lat(lat);
  float lon(lon);
  int soil_type(landpoint);
     soil_type:long_name="integer code defining the soil type";
data:
  landpoint=363, 364, 365, ...;
@davidhassell davidhassell added the defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors label Oct 10, 2018
@davidhassell
Copy link
Contributor Author

Sorry - this is more complicated than I first thought! The clarification needs to be redesigned, which I shall do shortly.

The issue I've realised is that we need to allow auxiliary coordinates and other related variables to span a subset of the compressed axes, but not be compressed themselves. In the following example, this would be the case for a 1-d auxiliary coordinate aux_var that spans the lat dimension:

dimensions:
  lat=73;
  lon=96;
  landpoint=2381;
  depth=4;
variables:
  int landpoint(landpoint);
    landpoint:compress="lat lon";
  float landsoilt(depth,landpoint);
    landsoilt:long_name="soil temperature";
    landsoilt:units="K";
    landsoilt:coordinates="soil_type aux_var";
  float depth(depth);
  float lat(lat);
  float lon(lon);
  int soil_type(landpoint);
     soil_type:long_name="integer code defining the soil type"; 
  float aux_var(lat);

If this is allowed, then it would be the conformance document that is wrong, rather than the conventions document, as we are allowing the related variable to span a dimension (lat) that is not explicitly spanned by the data variable (which spans depth and landpoint).

@JonathanGregory
Copy link
Contributor

Dear David

It's good to see the use of the Defect label. I agree, this inconsistency is a defect. Is it the case that any variable which has all the compressed dimensions in the same order as the data variable must be compressed?

Cheers

Jonathan

@davidhassell
Copy link
Contributor Author

Well, I've realised that this is a lot simpler than I thought. There is, I believe now, no problem with conventions - they don't disallow anything you might want to do, nor allow anything that you shouldn't do.

All that is needed, I think, is to amend the conformance document to allow for the fact that compression by gathering may result in some indirection when finding the "real" dimensions of auxiliary coordinate variables.

The existing text reads:

The dimensions of each auxiliary coordinate must be a subset of the dimensions of the variable they are attached to, with two exceptions. First, a label variable which will have a trailing dimension for the maximum string length. Second a ragged array (Chapter 9, Discrete sampling geometries and Appendix H) uses special, more indirect, methods to connect the data and coordinates.

I propose adding a third exception (new text in bold):

The dimensions of each auxiliary coordinate must be a subset of the dimensions of the variable they are attached to, with three exceptions. First, a label variable which will have a trailing dimension for the maximum string length. Second, a ragged array (Chapter 9, Discrete sampling geometries and Appendix H) uses special, more indirect, methods to connect the data and coordinates. Third, a data variable that has been compressed by gathering (chapter 8.2. Compression by Gathering) may also use special, more indirect, methods to connect the data and coordinates.

@davidhassell
Copy link
Contributor Author

Is it the case that any variable which has all the compressed dimensions in the same order as the data variable must be compressed?

I now don't think that this needs to be proscribed, as we might want to share auxiliary coordinates between compressed and non-compressed data variables, e.g.

dimensions:
  lat=73;
  lon=96;
  landpoint=2381;
  depth=4;
variables:
  int landpoint(landpoint);
    landpoint:compress="lat lon";
  float landsoilt(depth, landpoint);
    landsoilt:long_name="soil temperature";
    landsoilt:units="K";
    landsoilt:coordinates="aux1";
  float tas(lat, lon);
    landsoilt:long_name="air temperature";
    landsoilt:units="K";
    landsoilt:coordinates="aux1";
  float landsoilq(depth, lat, lon);
    landsoilt:long_name="soil moisture";
    landsoilt:units="1";
    landsoilt:coordinates="aux1";
  float depth(depth);
  float lat(lat);
  float lon(lon);
  int aux1(landpoint);

@JonathanGregory
Copy link
Contributor

Dear David

I'm sorry to say I'm confused now about what you're proposing. Also, by "proscribed" do you mean "prescribed", which is the opposite?

It seems to me that it's simplest not to add a third exception after all. That means auxiliary coordinates which use the compressed dimension must themselves be compressed, and hence cannot be shared with data variables where those dimensions are not compressed. Or, even simpler, state that only data variables can be compressed.

Best wishes

Jonathan

@davidhassell
Copy link
Contributor Author

Dear Jonathan,

I did indeed mean prescribed - thanks.

My motivation for this comes from trying to write generic library code to read/write CF-netCDF datasets in a a wholly CF-compliant fashion. As you can tell, I have never been clear in my mind what the problem I was trying to solve actually was, for which apologies.

Your first suggestion (that there is in fact no defect after all) is fine by me, so I am tempted to close and withdraw the issue, pending further opinions.

All the best,
David

@JonathanGregory
Copy link
Contributor

Dear David

I think you were quite right to point out there's an inconsistency, or at least a lack of clarity. Something should be said in the section on compression by gathering about auxiliary coordinate variables, cell measures and other things which might share the compressed dimensions of the data variables, especially aux coord vars, because of the explicit rule about their dimensions having to be a subset of the data variable's. I think that compression should either be required, or prohibited, in the latter case requiring an exception to that rule. Maybe I'm confused now?

Best wishes and thanks

Jonathan

@davidhassell
Copy link
Contributor Author

Thanks, Jonathan.

It would be very useful to hear from people who create or use datasets that employ compression by gathering. The main question arising from the previous discussion here is essentially:

  • Should compression on related variables - such as auxiliary coordinates - be
    • required, or
    • optional, or
    • prohibited?

If required, then all that is required is some conventions text to support what is already in the conformance document.

If optional or prohibited then the conformance document will also need to be changed to allow for related variables to span dimensions that are not spanned by the parent variable (à la DSG auxiliary coordinate variables) when compression is in play.

If required or prohibited then it will not be possible to share related variables between uncompressed and compressed data variables.

My slight preference is for required because it puts compression by gathering on the same footing as compression by DSG ragged array, i.e. controlled by the parent data variable, and requires the least intervention to the conventions. However, I have no problem with any of the other options if the consensus moves their way.

Thanks, David

@JonathanGregory
Copy link
Contributor

Dear all

@davidhassell and I agreed in this issue (approaching two years ago) that the convention is defective in section 8.2 on "Lossless compression by gathering" in not saying what should happen to other variables which span the compressed dimensions. He and I preferred that this defect should be remedied by requiring them to be compressed as well, because that is consistent with what the conformance document implies. No-one else expressed a view.

On reflection, I now think that we can't require them to be compressed, because that only works if all the affected dimensions are present and in the same order as in the data variable. Hence, I'm now inclined to say that they should not be compressed, meaning that we need exceptions in the conformance document.

What do you think now, David?

Jonathan

@taylor13
Copy link

taylor13 commented Sep 1, 2022

Just to add half a pence to the discussion ... it seems like there is indeed a need to allow flexibility in compressing or not variables associated with a compressed variable (auxiliary coordinates, cell measures, and ???). So I support Jonathan's last comment. Nevertheless, I have little confidence that I've grasped all the nuances here, so we should wait for a more authoritative opinion from @davidhassell or others.

@davidhassell
Copy link
Contributor Author

Hello. Thanks to Jonathan and Karl for resurrecting this. It was amusing to read about myself go though multiple U-turns.

I agree with Jonathan and Karl that all that we need are exceptions in the conformance document. Does the text I suggested at the end of #147 (comment) still make sense? Reprinted here for convenience (new text in italics):

The dimensions of each auxiliary coordinate must be a subset of the dimensions of the variable they are attached to, with three exceptions. First, a label variable which will have a trailing dimension for the maximum string length. Second, a ragged array (Chapter 9, Discrete sampling geometries and Appendix H) uses special, more indirect, methods to connect the data and coordinates. Third, a data variable that has been compressed by gathering (chapter 8.2. Compression by Gathering) may also use special, more indirect, methods to connect the data and coordinates.

@taylor13
Copy link

taylor13 commented Sep 1, 2022

I find the last sentence in italics too vague. I think we ought to clarify somewhere (probably in the conventions document itself) that an auxiliary coordinate or a cell_measures variable associated with a variable with compressed coordinates may either have its relevant coordinates compressed as in the variable itself, or they could remain uncompressed. [If I've correctly understood what we've agreed on.]

I suppose there is also a question about whether a variable with uncompressed coordinates can associate itself with an auxiliary coordinate and/or cell_measures that have their coordinates compressed. There may be practical reasons to forbid this, but I can't think of any. What do you think?

@JonathanGregory
Copy link
Contributor

Dear @davidhassell and Karl @taylor13

I agree with Karl that we should be explicit in the conformance document; we have to say precisely what the checker should examine to detect an error. Also I think we should say explicitly in the standard document either that these variables should not be compressed, or that the data-writer can decide. I favour the first option, which is better-defined. David, if you've already implemented this in software, you probably have an informed opinion.

Best wishes

Jonathan

@davidhassell
Copy link
Contributor Author

Hi @taylor13 and @JonathanGregory,

I think we ought to clarify somewhere (probably in the conventions document itself) that an auxiliary coordinate or a cell_measures variable associated with a variable with compressed dimensions may either have its relevant coordinates compressed as in the variable itself, or they could remain uncompressed. [If I've correctly understood what we've agreed on.]

Yes.

I suppose there is also a question about whether a variable with uncompressed coordinates can associate itself with an auxiliary coordinate and/or cell_measures that have their coordinates compressed. There may be practical reasons to forbid this, but I can't think of any. What do you think?

I think that this should be forbidden. It seems to much to match up the auxiliary coordinate's dimensions to those of it's parent data variable via a coordinate variable that doesn't otherwise apply to the data variable.

I agree that some text in the main conventions is needed. In the meantime, the conformance text text could cover these cases with:

The dimensions of each auxiliary coordinate must be a subset of the dimensions of the variable they are attached to, with three exceptions. First, a label variable which will have a trailing dimension for the maximum string length. Second, a ragged array (Chapter 9, Discrete sampling geometries and Appendix H) uses special, more indirect, methods to connect the data and coordinates. Third, for a data variable that has been compressed by gathering (chapter 8.2. Compression by Gathering), if an auxiliary coordinate does not span the compressed dimension then its dimensions may be any subset of the data variable's uncompressed dimensions.

If we agree on this scenario, then I'll create PR that updates the conformance document for auxiliary and cell measure variables, and try to put some text into the main document.

What about ancillary variables?

@JonathanGregory
Copy link
Contributor

Dear David

At the moment, the convention says, "If any auxiliary coordinate variable has all the dimensions to be compressed, adjacent and in the same order as in the data variable, and if the auxiliary coordinate variable has missing data at all the points which are to be eliminated from the data variable, then the affected dimensions can optionally be replaced by the list dimension for the auxiliary coordinate variable just as for the data variable." It's not possible to compress an aux coord var if it does not have all of the dimensions to be compressed, or if they aren't the same order as in the data variable. Given these exceptions, maybe it was a mistake to have allowed this possibility! But there it is; I don't think we should seem to invalidate data which has already been written, according to the usual principle.

Going back to the original question, about lack of clarity, I think that the sentence I quoted should say "If any auxiliary coordinate variable, cell measures variable or ancillary variable has all ...". If it has only some of those dimensions, or they're in the wrong order, or there isn't missing data at all the affected points, or the data-writer simply prefers not to compress them, then the variable will have dimensions that aren't dimensions of the data variable. So we do need an exception.

Given all this, I think you're right about the exception, but we could be more explicit! I didn't understand the subtlety of what you wrote until I'd done all the above thinking! I would say, "Third, if an auxiliary coordinate variable of a data variable that has been compressed by gathering (chapter 8.2. Compression by Gathering) does not span the compressed dimension, then its dimensions may be any subset of the data variable's uncompressed dimensions, i.e. any of the dimensions of the data variable except the compressed dimension, and any of the dimensions listed by the compress attribute of the compressed coordinate variable.

I presume we may need similar exceptions for cell measures and ancillary variables. I haven't looked at the conformance doc.

Cheers

Jonathan

@davidhassell
Copy link
Contributor Author

Dear Jonathan,

Thanks for making things clearer - for me as well! I think it's time to get a PR together ...

@davidhassell
Copy link
Contributor Author

Hang on! I see that PR #431 already put some text in relating to this: e821eab:

"Second if the data variable to which the auxiliary coordinate variable is attached has a dimension whose coordinate variable has a compress attribute, the auxiliary coordinate variable may have any of the dimensions".

Which is logically similar to our proposed text but not quite same, as it doesn't preclude the auxiliary coordinate variable from spanning both the compressed dimension and one of its uncompressed counterparts.

Is there any objection to replacing this text with what has been discussed here?

Thanks,
David

@JonathanGregory
Copy link
Contributor

Thanks, @davidhassell. I didn't remember about that. I agree that our new text is more accurate. What do you think, @fmanzano-pde?

@fmanzano-pde
Copy link
Contributor

Dear @JonathanGregory, @davidhassell, @taylor13,

there's no objection from my side as my concern is about the compression of one single dimension, which case already is included in the new definition. As you know, we're focused on DSGs and we only use compression for including the deployment positions as in Example H.5. A single timeseries with time-varying deviations from a nominal point spatial location.

By the way, to be honest, I found it quite difficult to follow the discussion. It is true that I have arrived late, but anyway I think it would be recommendable to put examples of use to cover all the casuistry and to be sure that everything is coherent.

All the best,
Fer

@davidhassell
Copy link
Contributor Author

Dear @fmanzano-pde,

Thanks - I'll press on with my PR on this, then.

You're not alone in finding the discussion confusing, as can be seen by the number of times I changed my mind on what I thought was required! I think some CDL examples are an excellent idea, and I shall brew some up.

David

@davidhassell
Copy link
Contributor Author

Hello,

PR #466 implements the changes agreed here. It would be great if these could be reviewed in the next couple of weeks, in time for CF-1.11.

Many thanks to all who have taken part in this discussion, both years ago and more recently!

David

@JonathanGregory
Copy link
Contributor

Thanks, David. I think it is good. I have marked a couple of typos in the PR.

@davidhassell davidhassell linked a pull request Nov 2, 2023 that will close this issue
4 tasks
@davidhassell
Copy link
Contributor Author

Thanks, Jonathan. Typos fixed.

@davidhassell
Copy link
Contributor Author

Three weeks have passed with no further comment, so I think we can merge PR #466 now, in time for CF-1.11. Would someone like to do that?

Many thanks to everyone who took part in the discussion,

David

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Nov 21, 2023

I will merge it with pleasure, having just resolved a trivial and inevitable conflict in history.adoc. Thanks, @davidhassell et al.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants