Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lossy Compression by Coordinate Sampling #327

Closed
AndersMS opened this issue May 5, 2021 · 79 comments · Fixed by #326
Closed

Lossy Compression by Coordinate Sampling #327

AndersMS opened this issue May 5, 2021 · 79 comments · Fixed by #326
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Milestone

Comments

@AndersMS
Copy link
Contributor

AndersMS commented May 5, 2021

Title

Lossy Compression by Coordinate Sampling

Moderator

@JonathanGregory

Moderator Status Review [last updated: YYYY-MM-DD]

Brief comment on current status, update periodically

Requirement Summary

The spatiotemporal, spectral, and thematic resolution of Earth science data are increasing rapidly. This presents a challenge for all types of Earth science data, whether it is derived from models, in-situ, or remote sensing observations.

In particular, when coordinate information varies with time, the domain definition can be many times larger than the (potentially already very large) data which it describes. This is often the case for remote sensing products, such as a swath measurements from a polar orbiting satellite (e.g. slide 4 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).

Such datasets are often prohibitively expensive to store, and so some form of compression is required. However, native compression, such as is available in the HDF5 library, does not generally provide enough of a saving, due to the nature of the values being compressed (e.g. few missing or repeated values).

An alternative form of compression-by-convention amounts to storing only a small subsample of the coordinate values, alongside an interpolation algorithm that describes how the subsample can be used to generate the original, unsampled set of coordinates. This form of compression has been shown to out-perform native compression by "orders of magnitude" (e.g. slide 6 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).

Various implementations following this broad methodology are currently in use (see cf-convention/discuss#37 (comment) for examples), however, the steps that are required to reconstitute the full resolution coordinates are not necessarily well defined within a dataset.

This proposal offers a standardized approach covering the complete end-to-end process, including a detailed description of the required steps. At the same time it is a framework where new methods can be added or existing methods can be extended.

Unlike compression by gathering, this form of compression is lossy due to rounding and approximation errors in the required interpolation calculations. However, the loss in accuracy is a function of the degree to which the coordinates are subsampled, and the choice of interpolation algorithm (of which there are configurable standardized and non-standardized options), and so may be determined by the data creator to be within acceptable limits. For example, in one application with cell sizes of approximately 750 metres by 750 metres, interpolation of a stored subsample comprising every 16th value in each dimension was able to recreate the original coordinate values to a mean accuracy of ~1 metre. (Details of this test are available.)

Whilst remote sensing applications are the motivating concern for this proposal, the approach presented has been designed to be fully general, and so can be applied to structured coordinates describing any domain, such as one describing model outputs.

Technical Proposal Summary

See PR #326 for details. In summary:

The approach and encoding is fully described in the new section 8.3 "Lossy Compression by Coordinate Sampling" to Chapter 8: Reduction of Dataset Size.

A new appendix J describes the standardized interpolation algorithms, and includes guidance for data creators.

Appendix A has been updated for a new data and domain variable attribute.

The conformance document has new checks for all of the new content.

The new "interpolation variable" has been included in the Terminology in Chapter 1.

The list of examples in toc-extra.adoc has been updated for the new examples in section 8.3.

Benefits

Anyone may benefit who has prohibitively large domain descriptions for which absolute accuracy of cell locations is not an issue.

Status Quo

The storage of large, structure domain descriptions is either prohibitively expensive, or is handled non-standardized ways

Associated pull request

PR #326

Detailed Proposal

PR #326

Authors

This proposal has been put together by (in alphabetic order)

Aleksandar Jelenak
Anders Meier Soerensen
Daniel Lee
David Hassell
Lucile Gaultier
Sylvain Herlédan
Thomas Lavergne

@AndersMS AndersMS added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label May 5, 2021
@erget erget linked a pull request May 5, 2021 that will close this issue
7 tasks
@davidhassell
Copy link
Contributor

Hello @taylor13, @AndersMS,

It might be better to continue the conversion over at #37 on the precision of interpolation calculations (the comment thread starting at cf-convention/discuss#37 (comment)) here in this issue, as this is the main place now for discussing the PR containing the details of this proposal, of which this precision question is one.

I hope that's alright, thanks,
David

@AndersMS
Copy link
Contributor Author

AndersMS commented May 9, 2021

Hi @taylor13

Thank you very much or your comments. We did have a flaw or a weakness in the algorithm, which we have corrected following your comments.

To briefly explain: the methods of the proposal stores coordinates at a set of tie points, from which the coordinates in the target domain may then be reconstituted by interpolation. The source of the problem was the computation of the distance squared between two such tie points. The distance will never be zero and could for example be in the order of a few kilometers. As the line between the two tie points forms a right triangle with two other lines of known length, the fastest way to compute the distance squared is to use Pythagoras's theorem. However, as the two other sides both of a length significantly larger than the one we wish to calculate, the result was very sensitive to rounding in 32-bit floating-point calculations and occasionally returned zero. We have now changed the algorithm to compute the distance squared as x*x + y*y + z*z, where (x, y, z) is the vector between the two tie points. This expression does not have the weakness explained above and has now been tested to work well.

In terms of accuracy of how well the method reconstitutes the original coordinates, the change improved the performance of the internal calculations being calculated in 32-bit floating-point. However, still with errors a couple of times larger than when using 64-bit floating-point calculations.

I would therefore support the proposal put forward by @davidhassell. The proposal avoids setting a general rule, which, as you point out, may not cover all cases. It permits setting a requirement when needed to reconstitute data with the accuracy intended by the data creator.

Once again, thank you very much for your comments – further comments from your side on the proposal would be highly welcome!

Cheers
Anders

@davidhassell
Copy link
Contributor

For convenience, here is the proposal for specifying the precision to be used for the interpolation calculations (slightly robustified):

  • By default, the user may use any precision they like for the interpolation calculatins, but if the interpolation_precision attribute has been set to a numerical value then the precision should match the precision of the given numerical value.
// Interpolation variable with NO 'interpolation_precision' attribute
// => the creator is saying "you can use whatever precision you like when uncompressing" 
char interp ;
    interp:interpolation_name = "bi_linear" ;
// Interpolation variable with 'interpolation_precision' attribute
// => the creator is saying "you must use the precision I have specified when uncompressing"
char interp ;
    interp:interpolation_name = "bi_linear" ;
    interp:interpolation_precision = 0D ;  // use double precision when uncompressing

Do you think that this might work, @taylor13?

Thanks,
David

@taylor13
Copy link

Thanks @AndersMS for the care taken to address my concern, and thanks @davidhassell for the proposed revision. A few minor comments:

  1. I wonder if users could be given the freedom to do their interpolation at a higher precision than specified by interpolation_precision. I would hate to think that the interpolation method would be degraded by doing so. I suggest, therefore, replacing "the precision should match" with "the precision should match or exceed" or something similar. Also, a comma should follow the introductory clause, "if the interpolation_precision attribute has been set to a numerical value" and typo in "calculatins" should be corrected.
  2. I think we should avoid using a specified numerical value in a format that is language dependent. Rather I would prefer following the IEEE 753 standard identifiers of precision. Also, I think "interpolation_precision" might imply something about the accuracy of the interpolation method. To avoid this confusion, perhaps the attribute could be named "computational_precision". I therefore propose the following alternative:
  • By default, the user may use any precision they like for the interpolation calculations, but if accuracy requires a certain precision, the calculational_precision attribute should be defined and set to one of the names defined by the IEEE 753 technical standard for floating point arithmetic (e.g., "decimal32", "decimal64", "decimal128"). If the calculational_precision attribute has been defined, all interpolation calculations should be executed at the specified precision (or higher).

In the example, then, "0D" would be replaced by "decimal64".

@davidhassell
Copy link
Contributor

Hi @taylor13,

1: I agree that higher precisions should be allowed. A modified description (which could do with some rewording, but the intent is clear for now, I hope):

  • By default, the user may use any precision they like for the interpolation calculations. If the computational_precision attribute has been set then the precision should match or exceed the precision specified by computational_precision attribute. <some text about allowed values and their interpretation>.

2: computational_precision is indeed better. You mention "calculational_precision" in 3 - was that intentional? That term is also OK for me.

3: A controlled vocabulary is certainly clearer than my original proposal, both in term of defining the concept and the encoding, and the IEEE standard does indeed provide what we need. I wonder if it might be good to define the (subset) of IEEE terms ourselves in a table (I'm reminded of https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#table-supported-units) rather than relying on the contents of the external standard, to avoid the potential governance issues we always have when standards outside of CF's influence are brought in. Would the "binary" terms be valid, as well as the "decimal" ones?

@taylor13
Copy link

yes, calculational_precision was a mistake; I prefer computational_precision. Also I'd be happy with not referring to an external standard, and for now, just suggesting that two values, "decimal32" and "decimal64", are supported, unless someone thinks others are needed at this time.

@AndersMS
Copy link
Contributor Author

Thank you @taylor13 for the proposals and @davidhassell for the implementation details.

I fully agree with your point 1, 2 and 3.

There is possibly one situation that might need attention. If the coordinates subject to compression are stored in decimal64, typically we would require the computations to be decimal64 too, rater than decimal32.

We could deal with that either by:

A. Using the scheme proposed above, requiring the data creator to set the computational_precision accordingly.
B. Requiring that the interpolation calculation are never carried out at a lower precision than that of the coordinates subject to compression, even if the computational_precision is not set.

Probably A would be the cleanest, what do you think?

@davidhassell
Copy link
Contributor

Thanks, @taylor13 and @AndersMS,

I, too, would favour A (Using the scheme proposed above, requiring the data creator to set the computational_precision accordingly.).

I'm starting to think that the we need to be clear about "decimal64" (or 32, 128, etc.). I'm fairly sure that we only want to specify a precision, rather than also insisting/implying that the user should use decimal64 floating-point format numbers in their calculations. The same issue would arise with "binary64", although I suspect that most code would use double precision floating-point by default.

Could the answer to be to define our own vocabulary of "16", "32", "64", and "128"?

Or am I over complicating things?

@taylor13
Copy link

I don't understand the difference between decimal64 and binary64 or what they precisely mean. If these terms specify things beyond precision, it's probably not appropriate to use them here, so I would support defining our own vocabulary, which would not confuse precision with anything else.

And I too would favor (or favour) A over B.

@AndersMS
Copy link
Contributor Author

Hi @taylor13 and @davidhassell,

I am not fully up to date on the data types, but following the links that David sent, it appears that decimal64 is a base-10 floating-point number representation that is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. I think we can disregard that for now.

binary32 and binary64 are the new official IEEE 754 names for what used to be called single- and double-precision floating-point numbers respectively, and is what most of us are familiar with.

I would suggest that we do not require a specific floating-point arithmetic standard to be used, but more a level of precision. If we adopt the naming convention proposed by David, it could look like:

By default, the user may use any floating-point arithmetic precision they like for the interpolation calculations. If the computational_precision attribute has been set then the precision should match or exceed the precision specified by computational_precision attribute.

The allowed values of computational_precision attribute are:

(table)
"32": 32-bit base-2 floating-point arithmetic, such as IEEE 754 binary32 or equivalent
"64": 64-bit base-2 floating-point arithmetic, such as IEEE 754 binary64 or equivalent

I think that would archive what we are after, while leaving the implementers the freedom to use what their programming language and computing platform offers.

What do you think?

@taylor13
Copy link

looks good to me. Can we omit "base-2" from the descriptions, or is that essential? Might even reduce description to, for example:

"32": 32-bit floating-point arithmetic

@AndersMS
Copy link
Contributor Author

Leaving out "base-2" is fine. Shortening the description further as you suggest would also be fine with me.

I am wondering if we could change the wording to:

"The floating-point arithmetic precision should match or exceed the precision specified by computational_precision attribute. The allowed values of computational_precision attribute are:

(table)
"32": 32-bit floating-point arithmetic
"64": 64-bit floating-point arithmetic

If the computational_precision attribute has not been set, then the default value "32" applies."

That would ensure that we can assume a minimum precision on the user side, which would be important. Practically speaking, high level languages that support 16-bit floating-point variables, typically use 32-bit floating-point arithmetic for the 16-bit floating-point variables (CPU design).

@erget
Copy link
Member

erget commented May 17, 2021

@taylor13 by the way I'm still on the prowl for a moderator for this discussion. As I see you've taken an interest, would you be willing to take on that role? I'd be able to do it as well, but as I've been involved in this proposal for quite some time it would be nice to have a fresh set of eyes on it.

@davidhassell
Copy link
Contributor

Hi Anders,

"The floating-point arithmetic precision should match or exceed the precision specified by computational_precision attribute. The allowed values of computational_precision attribute are:

(table)
"32": 32-bit floating-point arithmetic
"64": 64-bit floating-point arithmetic

This is good for me.

If the computational_precision attribute has not been set, then the default value "32" applies."

That would ensure that we can assume a minimum precision on the user side, which would be important. Practically speaking, high level languages that support 16-bit floating-point variables, typically use 32-bit floating-point arithmetic for the 16-bit floating-point variables (CPU design).

I'm not so sure about having a default value. In the absence of guidance from the creator, I'd probably prefer that the user is free to use whatever precision they would like.

Thanks, David

@AndersMS
Copy link
Contributor Author

AndersMS commented Jun 7, 2021

Hi David,

Fine, I take your advice regarding not having a default value. That is probably also simpler - one rule less.

Anders

@davidhassell
Copy link
Contributor

Hi Anders - thanks, it sounds like we're currently in agreement - do you want to update the PR?

@AndersMS
Copy link
Contributor Author

AndersMS commented Jun 9, 2021

Hi David,

Yes, I would be happy to update the PR. However, I still have one concern regarding the computational_precision attribute.

In the introduction to Lossy Compression by Coordinate Sampling in chapter 8, I am planning to change the last sentence from

The creator of the compressed dataset can control the accuracy of the reconstituted coordinates through the degree of subsampling and the choice of interpolation method, see [Appendix J].

to

The creator of the compressed dataset can control the accuracy of the reconstituted coordinates through the degree of subsampling, the choice of interpolation method (see [Appendix J]) and the choice of computational precision (see section X).

where section X will be a new short section in chapter 8 describing the computational_precision attribute.

Recalling that we also write in the introduction to Lossy Compression by Coordinate Sampling in chapter 8 that

The metadata that define the interpolation formula and its inputs are complete, so that the results of the coordinate reconstitution process are well defined and of a predictable accuracy.

I think it would be more consistent if we make the computational_precision attribute mandatory and not optional. Otherwise the accuracy would not be predictable.

Would that be agreeable?

@davidhassell
Copy link
Contributor

Hi Anders,

I think it would be more consistent if we make the computational_precision attribute mandatory and not optional. Otherwise the accuracy would not be predictable.

That's certainly agreeable to me, as is your outline of how to change chapter 8.

Thanks,
David

@taylor13
Copy link

taylor13 commented Jun 9, 2021

Wouldn't the statement be correct as is (perhaps rewritten slightly; see below), if we indicated that if the computational_precision attribute is not specified, a default precision of "32" should be assumed? I would think that almost always the default precision would suffice, so for most data writers, it would be simpler if we didn't require this attribute. (But I don't feel strongly about this.)

Not sure how to word this precisely. Perhaps:

The attributes and default values defined for the interpolation formula and its inputs ensure 
that the results of the coordinate reconstitution process are reproducible and of predictable 
accuracy.

@AndersMS
Copy link
Contributor Author

Hi @taylor13 and @davidhassell,

Regarding the computational_precision attribute, it appears that we currently have two proposals: Either an optional attribute with a default value or a mandatory attribute.

I have written two versions of the new section 8.3.8, one for each of the two proposals. I hope that will help deciding!

Anders

Optional attribute version:

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.

To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table)
32: 32-bit floating-point arithmetic (default)
64: 64-bit floating-point arithmetic

For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by computational_precision attribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision attribute has not been set.

Mandatory attribute version:

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.

To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset must specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table)
"32": 32-bit floating-point arithmetic
"64": 64-bit floating-point arithmetic

For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by computational_precision attribute.

@taylor13
Copy link

I have a preference for "optional" because I suspect in most cases 32-bit will be sufficient and this would relieve data writers from including this attribute. There may be good reasons for making it mandatory; what are they?

Not sure about this, but I think "should" rather than "shall" is better.

@JonathanGregory
Copy link
Contributor

Dear all

I've studied the text of proposed changes to Sect 8, as someone not at all involved in writing it or using these kinds of technique. (It's easier to read the files in Daniel's repo than the pull request in order to see the diagrams in place.) I think it all makes sense. It's well-designed and consistent with the rest of CF. Thanks for working it out so thoughtfully and carefully. The diagrams are very good as well.

I have not yet reviewed Appendix J or the conformance document. I'm going to be on leave next week, so I thought I'd contribute just this part before going.

Best wishes

Jonathan

There is one point where I have a suggestion for changing the content of the proposal, although probably you've already discussed this possibility. If I understand correctly, you must always have both the tie_point_dimensions and tie_point_indices attributes of the interpolation variable, and they must refer to the same tie point dimensions. Therefore I think a simpler design, easier for the both data-writer and data-reader to use, would combine these two attributes into one attribute, whose contents would be "interpolation_dimension: tie_point_interpolation_dimension tie_point_index_variable [interpolation_zone_dimension] [interpolation_dimension: ...]".

Also, I have some suggestions for naming:

  • If you adopt my suggestion for a single attribute to replace tie_point_dimensions and tie_point_indices, an obvious name for it would be tie_points. You've used that name for the attribute of the data variable. However, I would suggest that the attribute of the data variable could equally well be called interpolation, since it names the interpolation variable, and signals that interpolation is to be used.

  • Your terminology has "tie point interpolation dimension" and "interpolation dimension", but the former is not a special case of the latter. That could be confusing, in the same way that (unfortunately) in CF terminology an auxiliary coordinate variable is not a special kind of coordinate variable. I suggest you rename "tie point interpolation dimension" as e.g. "tie point reduced dimension" to avoid this misunderstanding.

  • A similar possible confusion is that a tie point index variable is not a special kind of tie point variable. To avoid this confusion and add clarity, I suggest you could rename "tie point variable" as "tie point coordinate variable".

  • The terms "interpolation zone" and "interpolation area" are unhelpful because it's not obvious from the words which one is bigger, so it's hard to remember. If you stick with "zone" for the small one, for area it would be better to use something which is more obviously much bigger, such as "province" or "realm"! Or perhaps you could use "division" or "department", since the defining characteristic is the discontinuity.

In the first paragraph of Sect 8 we distinguish three methods of reduction of datset size. I would suggest minor clarifications:

There are three methods for reducing dataset size: packing, lossless compression, and lossy compression. By packing we mean altering the data in a way that reduces its precision (but has no other effect on accuracy). By lossless compression we mean techniques that store the data more efficiently and result in no loss of precision or accuracy. By lossy compression we mean techniques that store the data more efficiently and retain its precision but result in some loss in accuracy.

Then I think we could start a new paragraph with "Lossless compression only works in certain circumstances ...". By the way, isn't it the case that HDF supports per-variable gzipping? That wasn't available in the old netCDF data format for which this section was first written, so it's not mentioned, but perhaps it should be now.

There are a few points where I found the text of Sect 8.3 possibly unclear or difficult to follow:

  • "This form of compression may also be used on a domain variable with the same effect." I think this is an unclear addition. If I understand you correctly, insead of this final sentence you could begin the paragraph with "For some applications the coordinates of a data variable or a domain variable can require considerably more storage than the data in its domain."

  • Tie Point Dimensions Attribute. If you adopt my suggestion above, this subsection would change its name to "Tie points attribute". It would be good to begin the section by saying what the attribute is for. As it stands, it plunges straigjt into details. The second sentence in particular, about interpolation zones, bewildered me - I didn't know what it was talking about.

  • I follow this sentence: "For instance, interpolation dimension dimension1 could be mapped to two different tie point interpolation dimensions with dimension1: tp_dimension1 dimension1: tp_dimension2." But I don't understand the next sentence: "This is necessary when different tie point variables for a particular interpolation dimension do not contain the same number of tie points, and therefore define different numbers of interpolation zones, as is the case in Multiple interpolation variables with interpolation parameter attributes." The situation described does not occur in the example quoted, I think. I wonder if it should say, "This occurs when data variables that share an interpolation dimension and interpolation variable have different tie points for that dimension."

  • Instead of "A tie point variable must span at most one of the tie point interpolation dimensions associated with a given interpolation dimension." I would add a sentence to the first para of "Interpolation and non-interpolation dimension", which I would rewrite as follows:

For each interpolation variable identified in the tie_points attribute, all the associated tie point variables must share the same set of one or more dimensions. Each of the dimensions of a tie point variable must be either a dimension of the data variable, or a dimension of which is to be interpolated to a dimension of the data variable. A tie point variable must not have more than one dimension corresponding to any given dimension of the data variable, and may have fewer dimensions than the data variable. Dimensions of the tie point variable which are interpolated are called tie point reduced dimensions, and the corresponding data variable dimensions are called interpolation dimensions, while those for which no interpolation is required, being the same in the data variable and the tie point variable, are called non-interpolation dimensions. The size of a tie point reduced dimension must be less than or equal to the size of the corresponding interpolation dimension.

  • In one place, you say "For each interpolation dimension, the number of interpolation zones is equal to the number of tie points minus the number of interpolation areas," and in another place, "An interpolation zone must span at least two points of each of its corresponding interpolation dimensions." It seems to me that "at least" is wrong - it should be "exactly two".

  • "The dimensions of an interpolation parameter variable must be a subset of zero or more of the ...".

  • I suggest a rewriting of the part about the dimensions of interpolation paramater variable, for clarity, if I've understood it correctly, as follows:

Where an interpolation zone dimension is provided, the variable provides a single value along that dimension for each interpolation zone, assumed to be defined at the centre of interpolation zone.

Where a tie point reduced dimension is provided, the variable provides a value for each tie point along that dimension. The value applies to the two interpolation zones on either side of the tie point, and is assumed to be defined at the interpolation zone boundary (figure 3).

In both cases, the implementation of the interpolation method should assume that an interpolation parameter variable applies equally to all interpolation zones along any interpolation dimension which it does not span.

  • For "The bounds of a tie point must be the same as the bounds of the corresponding target grid cells," I would suggest, "The bounds of a tie point must be the same as the bounds of the target grid cells whose coordinates are specified as the tie point."

  • I don't understand this sentence: "In this case, though, the tie point index variables are the identifying target domain cells to which the bounds apply, rather than bounds values themselves." A tie point index variable could not possibly contain bounds values.

  • In Example 8.5, you need only one (or maybe two) data variables since they're all the same in structure.

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory

Thank you very much for your rich and detailed comments and suggestions, very appreciated.

The team behind the proposal met today and discussed all the points you raised. We have prepared or are in the process of preparing replies to each of the points. However, before sharing these here, we would like to update the proposal text accordingly via pull requests, in order to see if the changes have other effects on the overall proposal, which we have not yet identified.

Best regards,
Anders

@AndersMS
Copy link
Contributor Author

AndersMS commented Jun 17, 2021

Dear All,

Following a discussion yesterday in the team behind the proposal, we propose the 'computational_precision` attribute to be optional. Here is the proposed text, which now has a reference to [IEEE_754]. Feel free to comment.

Anders

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.

To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table)
32: 32-bit floating-point arithmetic (default), comparable to the binary32 standard in [IEEE_754]
64: 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]

For the coordinate reconstitution process, the floating-point arithmetic precision should match or exceed the precision specified by computational_precision attribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision attribute has not been set.

Bibliography
References

[IEEE_754] "IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.

@davidhassell
Copy link
Contributor

Maybe we could append a new item at the end of "Coordinate Compression Steps" in Appendix J recommending that data producers check the positional error by comparing the reconstructed coordinates against the original data, and then provide as many details as possible regarding the reconstruction process and results (computational precision, positional error, etc...) in the comment attribute of the subsampled coordinates variables.

Thanks, Sylvain, I support this suggestion

@AndersMS
Copy link
Contributor Author

AndersMS commented Jul 8, 2021

I too like the idea to recommend that data producers report positional errors (and I guess other coordinate value errors) between the original data coordinates and the reconstituted coordinates in the comment attribute of the subsampled coordinate variables.

Regarding the specification of the computational precision, which is required as input for the method to achieve an accuracy within the errors reported in the comment of the coordinate variable, my preference would still be the computational_precision attribute of the interpolation variable. I believe that for our Lossy Compression by Coordinate Subsampling to become popular, it should be easy and straight forward to use, in particular the data uncompression process by the data user. There should be no need for the user to look into the data variable comments, in order to be able to uncompress the data set. The computational_precision attribute makes it readable by the software in a safe and automated manner.

The reason for this preference is that I have tried out different selections of interpolation method, degree of subsampling (4x4, 8x8, 16x16, 64x16) and computational precision (64-bit, 32-bit floating-point arithmetic) on a test data set. All three components can have a comparable effect on the positional error between the original and the uncompressed file, which I think justifies specifying the computational precision in the same way as we specify the interpolation method and the degree of subsampling.

@erget: It is true that the Conventions do not address computational precision, but I guess there are a number of undocumented and implicit assumptions. Say, if you have specified a grid mapping for coordinates represented as 64-bit floating-point, one would assumes that the conversion between the two reference frames have been performed using 64-bit floating-point arithmetic, otherwise significant errors would be introduced. Considering the complexity of what we are doing, I think that stating the computational precision explicitly would be the safest.

Best regards,
Anders

@AndersMS
Copy link
Contributor Author

AndersMS commented Jul 20, 2021

Dear team,

Following our meeting this afternoon, I propose the following new paragraph at the end of the section "Tie Points and Interpolation Subareas":

Tie point coordinate variables for both coordinate and auxiliary coordinate variables must be defined as numeric data types and are not allowed to have missing values.

Please let me know if you have comments.

Anders

Done: 0c5b732

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory,

Just an update regarding the Lossy Compression by Coordinate Subsampling.

We have completed the implementation of the 16 changes in response to your comments on chapter 8. I have edited the comment above to include a link to the related commit(s) for each of the changes.

Generally we are very happy with the outcome and in particular the renaming of terms and attributes that you proposed has made the text easier to read.

You might wish to take a look at the rewritten section "Interpolation of Cell Boundaries". In response to your proposed change 15, we have had several discussions and meetings, resulting in a new concept for bounds interpolation. You will find the new section as the last in f3de508.

We will still do one more iteration on the section on Computational Precision, we will publish it here within the next days.

Regarding the Appendix J, we have nearly completed the changes required to reflect the changes in Chapter 8. We expect to complete the update tomorrow or Thursday and I think it would make sense for you to wait for that before reading the appendix J.

Best regards,
Anders

@JonathanGregory
Copy link
Contributor

Dear @AndersMS

Thanks for the update and your hard work on this. I will read the section again in conjunction with Appendix J, once you announce that the latter is ready.

Best wishes

Jonathan

@AndersMS
Copy link
Contributor Author

AndersMS commented Jul 21, 2021

Dear All,

Just to let you know that as agreed during the discussion of the new "Interpolation of Cell Boundaries" section (f3de508) I have added a the following sentence in the "Interpolation Parameters"
section (2ce5d66):

The interpolation parameters are not permitted to contain absolute coordinate information, such as additional tie points, but may contain relative coordinate information, for example an offset with respect to a tie point or with respect to a combination of tie points. This is to ensure that interpolation methods are equally applicable to both coordinate and bounds interpolation.

Anders

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory,

Appendix J is now ready for your review.

The only remaining open issues is now that we will do one more iteration on the section on Computational Precision for Chapter 8 - we will publish it here within the next days.

Best regards,
Anders

@JonathanGregory
Copy link
Contributor

Dear @AndersMS et al.

Thanks for the new version. Can you tell me where to find versions of Ch 8 and App J with the figures in place? That would make it easier to follow.

I've just read the text of Ch 8, which I found much clearer than before. I don't recall reading about bounds last time. Is that new, or was I asleep?

Best wishes

Jonathan

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory ,

I am still a bit new to documents on GtiHup, but these two links does the job in my browser:

I got these links by going to #326, then selecting the Files changed tab, then scrolling down to ch08.adoc or appj.adoc and then selecting View File in the "... " pull down menu on the right hand side, opposite to the file name.

Hope this will work at your end.

We had a section on boundary interpolation in the first version you read, but it was short and didn`t do the job we would like it to do. For example, it did not guarantee to reconstitute contiguous bounds as contiguous bounds. The new section is our consolidated version, which does all what we wanted it to do.

Best regards,
Anders

@JonathanGregory
Copy link
Contributor

Great, thanks, @AndersMS. I am still learning about GitHub. I was using the Diff, which doesn't show the diagrams, rather than Viewing the file, which works fine. Jonathan

@JonathanGregory
Copy link
Contributor

Dear @AndersMS and colleagues

Thanks again for the new version. I find it very clear and comprehensive. I have a few comments.

Chapter 8

"Tie point mapping attribute" mentions "target dimension", which is not a phrase used elsewhere. Should this be "interpolated dimension"?

You say, "For the purpose of bounds interpolation, a single bounds tie point is created for each coordinate tie point, and is selected as the vertex of the tie point cell that is the closest to the boundary of the interpolation subarea with respect to each interpolated dimension." I don't understand why there is a choice of bounds tie points, because there's no index variable for them. Doesn't the tie point index variable dictate the choice of tie points for bounds?

Appendix J

The title says Appendix A. Presumably that's something to do with automatic numbering.

All of the subsections listed at the start (Common Definitions and Notation, Common conversions and formulas, Interpolation Methods, Coordinate Compression Steps, Coordinate Uncompression Steps) should have subsection headings, I think. They will be Sections J.1 etc. At the moment the last two are labelled as Tables J.1 and J.2 rather than subsections, but they're never referenced as tables.

Fig 1ff. s is explained beneath the fig, but it would be useful it explain it at the side of the fig as well, as you do for tp i and u. Also, it would be useful to put the paragraph explaining notation before Fig 1, because Fig 1 uses the notation.

You say, "When an interpolation method is referred to as linear or quadratic, it means that the method is linear or quadratic in the indices of the interpolated dimensions." Linear also means that the coordinates of the interpolated points are evenly spaced, doesn't it; if so, that would be helpful to state.

You say, "In the case of two dimensional interpolation, the two variables are equivalently computed as ...". I would say "similarly", not "equivalently", which I would understand to mean that s1 and s2 are equivalent.

quadratic. It would be better not to use c for the coefficient, because it can be confused with the point c.

Please put the "Common conversion and formulae" table before the interpolation methods, or at least refer to it. Otherwise the reader encounters fdot fcross fplus fminus fmultiply etc. without having seen their definitions. Actually you list it before the interpolation methods in the preamble.

[bi_]quadratic_remote_sensing. Why not call it [bi_]quadratic_latitude_longitude, which describes the method, rather than its typical application? What does it mean to treat them as Cartesian or not? I would describe bilinear interpolation in lat,lon as treating them as Cartesian coordinates, but you must mean something different. Is there a projection plane involved?

Where is latitude_limit defined?

A couple of times, you write, "For each of the interpolated dimension". There should be an -s.

Conformance

For "Each tie_point_variable token specifies a tie point variable that must exist in the file, and each interpolation_variable token specifies an interpolation variable that must exist in the file," I think all you can say is that there are variables of these names in the file, since a checker can't tell they are definitely the "kind" of variable you intend.

Regarding, "The legal values for the interpolation_name attribute are contained in Appendix J," it would be helpful for the author of the checker to say where they can be found in the appendix.

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

Dear all

@AndersMS and colleagues have proposed a large addition to Chapter 8 and an accompanying new appendix to the CF convention, defining methods for storing subsampled coordinate variables and the descriptions of the interpolation methods that should be used to reconstruct the entire (uncompressed) coordinate variables. I've reviewed this in detail and it makes sense and seems to clear to me, as someone who's never used these methods. Those who wrote this proposal are the experts. Enough support has been expressed for this proposal to be adopted, after allowing the time prescribed for the rules for further comments, and there are no objections expressed.

Therefore this proposal is on course for adoption in the next release of the CF convention as things stand. If anyone else who wasn't involved in preparing it has the time and interest to review it, that would no doubt be helpful, and now is the time to do that, in order not to delay its approval. It definitely requires careful reading and thinking, but it's logical and well-illustrated.

Best wishes

Jonathan

@AndersMS
Copy link
Contributor Author

AndersMS commented Jul 23, 2021

Dear @JonathanGregory ,

Thank you for your rich set of comments and suggestions. I have provided replies below, in the same format we used for the first set of comments. Several of the replies I have already implemented in the document and indicated the corresponding commit. For others, the reply is not conclusive and if you find time, your feedback on the reply would be valuable.

The comments on the conformance chapter, I would prefer that @davidhassell look at when he is available again.

Best regards,
Anders

Comment/Proposed Change 17

Chapter 8: "Tie point mapping attribute" mentions "target dimension", which is not a phrase used elsewhere. Should this be "interpolated dimension"?

Reply to Comment/Proposed Change 17
You are right, it should be "interpolated dimension" in that section. I have updated the text.
Commit(s) related to Comment/Proposed Change 17
ca81618

Comment/Proposed Change 18

Chapter 8: You say, "For the purpose of bounds interpolation, a single bounds tie point is created for each coordinate tie point, and is selected as the vertex of the tie point cell that is the closest to the boundary of the interpolation subarea with respect to each interpolated dimension." I don't understand why there is a choice of bounds tie points, because there's no index variable for them. Doesn't the tie point index variable dictate the choice of tie points for bounds?

Reply to Comment/Proposed Change 18
In the compressed data set we only store one bounds tie points per coordinate tie point. However, in the existing boundary variable defined in section 7.1. Cell Boundaries, requires you to store, in the case of 2D bounds for example, fours bounds. The selection is between those four bounds, of which only one is the correct selection.
Text updated based on your feedback
Commit(s) related to Comment/Proposed Change 18
a30c58f

Comment/Proposed Change 19

Appendix J: The title says Appendix A. Presumably that's something to do with automatic numbering.

Reply to Comment/Proposed Change 19
Correct, that will be updated as part of the publishing magic.
Commit(s) related to Comment/Proposed Change 19
None.

Comment/Proposed Change 20

Appendix J: All of the subsections listed at the start (Common Definitions and Notation, Common conversions and formulas, Interpolation Methods, Coordinate Compression Steps, Coordinate Uncompression Steps) should have subsection headings, I think. They will be Sections J.1 etc. At the moment the last two are labelled as Tables J.1 and J.2 rather than subsections, but they're never referenced as tables.

Reply to Comment/Proposed Change 20
I agree and have introduced section numbering and removed table captions in Appendix J.
Commit(s) related to Comment/Proposed Change 20
f6f48fb

Comment/Proposed Change 21

Appendix J: Fig 1ff. s is explained beneath the fig, but it would be useful it explain it at the side of the fig as well, as you do for tp i and u.

Reply to Comment/Proposed Change 21
Done. Also, I have named s as the “interpolation argument”, which I think is what it is.
Commit(s) related to Comment/Proposed Change 21
1002806

Comment/Proposed Change 22

Appendix J: Also, it would be useful to put the paragraph explaining notation before Fig 1, because Fig 1 uses the notation.

Reply to Comment/Proposed Change 22
Agree. Done.
Commit(s) related to Comment/Proposed Change 22
0fdc7e4

Comment/Proposed Change 23

You say, "When an interpolation method is referred to as linear or quadratic, it means that the method is linear or quadratic in the indices of the interpolated dimensions." Linear also means that the coordinates of the interpolated points are evenly spaced, doesn't it; if so, that would be helpful to state.
Appendix J:

Reply to Comment/Proposed Change 23
The answer is a bit tricky. If the coordinates are latitude and longitude, then the steps in each of these coordinates on its own will be evenly spaced. However, the points that the combined latitude/longitude describe on the reference ellipsoid will in general not be evenly spaced, only in some special cases, like along a meridian. The easiest place to visualize the non-evenly spaced points is around one of the poles, but it applies globally.

Actually, the best of our current methods to generate evenly spaced coordinate points is the "quadratic_remote_sensing" method. It can utilize its quadratic terms to counteract the distorting effect of the latitude/longitude coordinates.

Commit(s) related to Comment/Proposed Change 23
None (5f9ad9a, 06dac3a (reverts 5f9ad9a))

Comment/Proposed Change 24

Appendix J: You say, "In the case of two dimensional interpolation, the two variables are equivalently computed as ...". I would say "similarly", not "equivalently", which I would understand to mean that s1 and s2 are equivalent.

Reply to Comment/Proposed Change 24
Done.
Commit(s) related to Comment/Proposed Change 24
0116283

Comment/Proposed Change 25

Appendix J: It would be better not to use c for the coefficient, because it can be confused with the point c.

Reply to Comment/Proposed Change 25
Agreed and renamed from “c” to “w”. Also renamed related function fc() to fw().
Commit(s) related to Comment/Proposed Change 25
ea474a5

Comment/Proposed Change 26

Appendix J: Please put the "Common conversion and formulae" table before the interpolation methods, or at least refer to it. Otherwise the reader encounters fdot fcross fplus fminus fmultiply etc. without having seen their definitions. Actually you list it before the interpolation methods in the preamble.

Reply to Comment/Proposed Change 26
Good point, section moved.
Commit(s) related to Comment/Proposed Change 26
4efef82

Comment/Proposed Change 27

Appendix J: [bi_]quadratic_remote_sensing. Why not call it [bi_]quadratic_latitude_longitude, which describes the method, rather than its typical application?
What does it mean to treat them as Cartesian or not? I would describe bilinear interpolation in lat,lon as treating them as Cartesian coordinates, but you must mean something different. Is there a projection plane involved?

Reply to Comment/Proposed Change 27
We hope to find time after this review to prepare a small paper or a presentation that provides more insight into the interpolation methods and their performance.

You are right, it is like a projection plane, but we are using 3D cartesian coordinates. The problem we are addressing is that interpolating directly in latitude/longitude is inadequate when we are close to the poles. So, we temporarily convert the four tie points from lat/lon to xyz, do the interpolation and then convert the result back from xyz to lat/lon. Another common way to address this problem is to project the lat/lon point on the xy plane, do the interpolation and project the point back to lat/lon. However, by using xyz, we can also solve the problem that arises when our interpolation subarea crosses +/-180 deg longitude.

Let me try to support the above with a simple example (hoping that I am not upsetting anybody with such a simple example...)

Think of a hypothetical remote sensing instrument that scans the Earth in a way that can be approximated as arcs of a great circle on the Earth surface. So, if the instrument scans from point A to point B, then the points it scanned between A and B will be on the great circle between A and B. It will follow this simple principle for any location on Earth.

If you are near Equator and A = (0W, 0N) and B= (4W, 4N), then you can generate three points between A and B by interpolating in longitude and latitude separately and will get (1W, 1N), (2W, 2N) and (3W, 3N), which are approximately aligned with the great circle arc between A and B.

If you are near the North Pole and A = (0W, 88N) and B= (180W, 88N) and do the interpolation in longitude and latitude separately, you will get (45W, 88N), (90W, 88N) and (135W, 88N), which are on an arc of a small circle and is the wrong result. By first converting to cartesian coordinates, then interpolating and then converting back to longitude latitude, you will get the correct result: (0W, 89N), (0W, 90N) and (180W, 89N), which are on a great circle.

That was also why we suggested the name with [bi_]quadratic_remote_sensing.
We agree to change the name to [bi_]quadratic_latitude_longitude too (done).
I have changed caartesian to three-dimensional cartesian as you suggested.

Commit(s) related to Comment/Proposed Change 27
d9436e4
af2a2ea
4cc00cb

Comment/Proposed Change 28

Appendix J: Where is latitude_limit defined?

Reply to Comment/Proposed Change 28
It is a value to be decided by the creator of the data set, as a trade-off between speed and accuracy, considering that the conversions to and from cartesian coordinates takes longer. Practically, it will be in the order of 85 degree of latitude. I could add that. But I suggest that we also discuss it in the group during our meeting on Tuesday.
Text improved.

Commit(s) related to Comment/Proposed Change 28
546d288

Comment/Proposed Change 29

Appendix J: A couple of times, you write, "For each of the interpolated dimension". There should be an -s.

Reply to Comment/Proposed Change 29
Corrected.
Commit(s) related to Comment/Proposed Change 29
303aaa4

@AndersMS
Copy link
Contributor Author

Dear All,

Here are the links to the easy-to-read versions including all the above changes:

Anders

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory,

Just to let you know that I just updated my reply to Reply to Comment/Proposed Change 23 above.

Anders

@JonathanGregory
Copy link
Contributor

Dear @AndersMS

Thanks for your detailed replies. I think there are only two outstanding points in those you have answered.

18: Now I understand what you mean, thanks. To make this clearer to myself, I would say something like this: Bounds interpolation uses the same tie point index variables and therefore the same tie point cells as coordinate interpolation. One of the vertices of each coordinate tie point cell as chosen as the bounds tie point for the cell. For 1D bounds, the vertex chosen is the one which is on the side closer to the boundary of the interpolation subarea. For 2D bounds, the vertex chosen is the one which is closest to the boundary of the interpolation subarea, considering all the interpolated coordinates together, or in other words, the one closest to the corner of the interpolation subarea.

Are you restricting the consideration of 2D bounds to rectangular cells, or are polygons of n vertices allowed?

27: I think the key point is that you mean three-dimensional Cartesian interpolation. I didn't think of that. If you could clarify this, it would be fine.

Cheers

Jonathan

@davidhassell
Copy link
Contributor

Dear @JonathanGregory, @AndersMS, and all,

Conformance

For "Each tie_point_variable token specifies a tie point variable that must exist in the file, and each interpolation_variable token specifies an interpolation variable that must exist in the file," I think all you can say is that there are variables of these names in the file, since a checker can't tell they are definitely the "kind" of variable you intend.

Regarding, "The legal values for the interpolation_name attribute are contained in Appendix J," it would be helpful for the author of the checker to say where they can be found in the appendix.

Addressed in AndersMS@8b8c185

I have also added some conformance requirements and recommendations for bounds tie point variables: AndersMS@bdac108

Thanks,
David

@erget
Copy link
Member

erget commented Aug 3, 2021

Dear @JonathanGregory et al.,

Due to the heroic contributions primarily of @AndersMS and @davidhassell as well as the expert review of @oceandatalab and friends we can present to you the now-finalised version of the pull request associated with this issue.

To see all points listed and addressed one-by-one you can check #327 (comment) hopefully that is traceable.

We have completed our proposal, finalising the section regarding computational precision - this is now found at the end of chapter 8.3.

#326 contains the documents in their latest state, which I have also attached in a compiled form for your perusal:

Note before finalisation of this version of the Conventions the following items will need to be addressed; these are however of a purely editorial nature so in the interest of time we are not correcting them for the 3 week freeze:

  • Fix numbering and labelling of figures
  • Fix image scaling so that the PDF doesn't make you squint
    This will need to be fixed globally before publishing so that figures don't get squeezed in.

A clever idea here would be to name e.g. the first figure in chapter 7 "Figure 7.1" so that the figures are always numbered correctly independently of previous chapters. I leave this to future minds to solve.

I therefore thank all contributors again for the loads of precise and hard work, and motion that the 3 week period start for this proposal so that we are on time to get it adopted into CF-1.9.

I look forward to hearing hopefully a resounding silence in response to the finalised proposal!

@JonathanGregory
Copy link
Contributor

Dear @AndersMS @davidhassell @erget @oceandatalab and collaborators

Thanks for the enormous amount of hard and thorough work you have put into this, and for answering all my questions and comments. I have no more concerns. Looking through the rendered PDF of App J, I see boxes, probably indicating some character which Chrome can't print, in "Common Conversions and Formulas", after sin and cos.

If anyone else would like to review and comment, they are welcome to do so. If no further concerns are raised, the proposal will be accepted on 24th August.

Cheers

Jonathan

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory ,

Regarding the interpolation of bounds, you asked:

Are you restricting the consideration of 2D bounds to rectangular cells, or are polygons of n vertices allowed?

We are restricting the interpolation of bounds to contiguous cell bounds. I think that the consequence of this is that we are are restricting the consideration of 2D bounds to rectangular cells. Possibly @davidhassell can confirm.

What we do support is interpolation of 1D, 2D, etc bounds. Hence the sentence:

One of the vertices of each coordinate tie point cell is chosen as the bounds tie point for the cell. It is selected as the vertex of the tie point cell that is the closest to the boundary of the interpolation subarea with respect to each interpolated dimension.

that applies for any number of interpolated dimensions.

Cheers

Anders

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory,

Once again, thank you very much for your thorough review and valuable comments, which significantly improved the proposal.

Cheers

Anders

@AndersMS
Copy link
Contributor Author

Dear @JonathanGregory ,

We have just discussed the matter of the cell bounds interpolation and the question you raised .

To make the conditions for bounds interpolation clearer, we have changed (b10fb67) the first part of the first paragraph in the section on bounds interpolation to:

Coordinates may have cell bounds. For the case that the reconstituted cells are contiguous and have exactly two cell bounds along each interpolated dimension, cell bounds of interpolated dimensions can be stored as bounds tie points and reconstituted through interpolation.

We hope you are fine with that change.

Best regards,
The interpolation team

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Aug 16, 2021

Dear @AndersMS

Thanks for the clarification. That's fine. The proposal will be approved next Tuesday 24th if no further concern is raised. [edited twice - I was accidentally reading the calendar for next month]

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

Dear @AndersMS @erget et al.

I would be pleased to merge the pull request and close this issue, but I see that the PR has conflicts which have to be resolved. I expect there is some GitHub incantation which you can pronounce to resolve them.

Best wishes

Jonathan

@erget
Copy link
Member

erget commented Aug 24, 2021

@JonathanGregory @AndersMS et al., chanting is all done and the merge is complete. Thanks all for your many varied contributions - this was a lot of work on all sides and my hope is that it proves useful to both data producers and consumers moving forward!

@JonathanGregory
Copy link
Contributor

Congratulations and thanks to all who contributed to this successful piece of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants