Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard way to define subsampled coordinates #37

Closed
oceandatalab opened this issue Mar 6, 2020 · 113 comments
Closed

Standard way to define subsampled coordinates #37

oceandatalab opened this issue Mar 6, 2020 · 113 comments

Comments

@oceandatalab
Copy link

At OceanDataLab we process a wide variety of EO data (satellite, in-situ, model) on a daily basis and develop tools to analyse and visualise these data.

Frame of the question

The resolution of sensors has steadily increased in the past years, producing data at higher resolution with each new generation of instruments. As finer and finer scales are now observable, more and more storage are also required to save these data.
Most remote sensing data are also in the geometry of the sensor, meaning that you need to provide coordinates along with each acquisition. So, for satellites, data at higher resolution also means more coordinates to store, download, load into memory, etc... In order to reduce the size overhead induced by these coordinates, a solution which seems to gain traction consists in providing coordinates only for a subset of cells and relying on interpolation to compute coordinates for the other cells of the data matrix. This is what has been done for Sentinel-1 (using ground control points) and Sentinel-3 data for example, but it is very likely that this solution becomes prevalent in the future.

Question

NetCDF is the main file format for satellite data, but the CF convention does not provide a standard way to define coordinates arrays or matrices at a resolution that does not match the resolution of data. This is a major issue because it is impossible to develop a generic software for reading these data, as each satellite mission implements this method in a different way: users have to adapt to each satellite product for reconstructing the coordinates. This is a barrier for many users and really is an issue for the exploitation of satellite data.

Has this problem already been discussed and are there plans to standardize the mapping of data matrices with coordinates provided at a different resolution?

@JimBiardCics
Copy link

@oceandatalab This has been discussed, but we haven't come to any agreement to move forward with any mapping scheme. We've also talked about allowing one netCDF file to reference coordinate variables in another file as a way to address redundancy in the use case where many files cover the same swath.

The basic CF approach of connecting coordinates to data via dimensions makes it difficult to figure out how to solve this problem. If you have a robust and concrete suggestion of a way to handle the issues, I'd love to hear it.

@oceandatalab
Copy link
Author

oceandatalab commented Mar 12, 2020

Well, in order to make our code as reusable as possible, our strategy so far has been to homogenize the EO data we use as input files, which implies a pre-processing step to perform the conversion to a common format.
Right now this common format is NetCDF with a list of mandatory attributes that our tools require (most of them are defined in CF conventions) and some rules to define the subsampled coordinates depending on the data model (CDM_data_type with a few tweaks) which matches best.
The description of the format we use is available here)

Basically, for each data model we have:

  • a set of dimensions with predefined names for our data variables (time, lat,
    lon, etc... we want to remain compatible with other software that support
    CF-compliant files)
  • a set of dimensions and variables with predefined names to define the
    subsampled coordinates

For example, for gridded data with axes aligned with meridians and parallels, we have:

 dimensions:
        lon = 7200 ;
        lat = 521 ;
        lon_gcp = 720;
        lat_gcp = 52 ;
        time = UNLIMITED ;
    double time(time) ;
        time:long_name = "time" ;
        time:standard_name = "time" ;
        time:units = "seconds since 1970-01-01T00:00:00.000000Z" ;
        time:calendar = "standard" ;
        time:axis = "T" ;
    float lat_gcp(lat_gcp) ;
        lat_gcp:long_name = "ground control points latitude" ;
        lat_gcp:standard_name = "latitude" ;
        lat_gcp:units = "degrees_north" ;
        lat:axis = "Y" ;
        lat:comment = "geographical coordinates, WGS84 projection" ;
    float lon_gcp(lon_gcp) ;
        lon_gcp:long_name = "ground control points longitude" ;
        lon_gcp:standard_name = "longitude" ;
        lon_gcp:units = "degrees_east" ;
        lon_gcp:axis = "X" ;
        lon_gcp:comment = "geographical coordinates, WGS84 projection" ;
    int index_lat_gcp(lat_gcp) ;
        index_lat_gcp:long_name = "index of ground control points in lat dimension" ;
    int index_lon_gcp(lon_gcp) ;
        index_lon_gcp:long_name = "index of ground control points in lon dimension" ;

The dimensions of a data variable are (time, lat, lon):

    ubyte variable(time, lat, lon) ;
        variable:_FillValue = 255UB ;
        variable:long_name = "variable long name" ;
        variable:standard_name = "variable_standard_name" ;
        variable:units = "variable Unit" ; variable:add_offset = offset (float);
        variable:scale_factor = offset (float) ;

Each tuple of (lat_gcp, lon_gcp, index_lat_gcp, index_lon_gcp) associates
geographical coordinates (lat_gcp, lon_gcp) with coordinates in the data space
(index_lat_gcp, index_lon_gcp).

This is the most simple case but you can already see that it adds quite a lot of non-standard dimensions and variables. It also makes the file impossible to use with other mapping software because there are no lat and lon variables (we avoided giving them standard names because it would be misleading for users and tools that expected coordinates full resolution in these variables).

@oceandatalab
Copy link
Author

We are currently revisiting this approach to support more data models and are trying to remove rules that are specific to our tools by adapting our format specifications so that they follow well-defined standards as much as possible.

In the example above, one possibility would be to rename lat_gcp and lon_gcp variables to lat and lon but also add an attribute that references the variable which describes the distribution of the coordinates (the "index" variable).

This is only a minor improvement over the previous version but it would avoid using a non-standard naming scheme for coordinate variables and allow software to detect that coordinates are not provided at full resolution.

@erget
Copy link
Member

erget commented Mar 24, 2020

In my understanding the variable names themselves are of less importance than the attributes attached to them and providing metadata, so choosing a different name for lat_gcp and co. wouldn't have a huge impact from a CF perspective.

This ties in with work driven by Aleksandar Jelenak - specifically the situation described in this figure. However, in that case the coordinates are still provided in full.

I'd also be happy to have a well-described way of doing this as we'd all save a lot of bandwidth! In your example, how do you relate the data points in variable back to geolocated coordinates?

@ethanrd
Copy link
Member

ethanrd commented Mar 24, 2020

Yes, Aleksandar and others also brought this up a number of times outside of the swath proposal.

@ajelenak - Was there ever a proposal or discussion in Trac? Or was it mostly conversations?

@JimBiardCics
Copy link

@oceandatalab I see two weaknesses in the scheme you describe. There is no explicit declaration of how the different coordinate-type variables and dimensions relate to one another. There is also no specification of how to interpolate between GCP points. I think both of these issues would need to be addressed.

@ajelenak
Copy link

It never reached the level of serious discussion that would produce a proposal. NASA's HDF-EOS convention supports this since inception and that's what we mentioned in our swath proposal. I think this issue is another indication of the need.

@oceandatalab
Copy link
Author

@erget - Thank you for the link, very interesting!

We use GDAL to interpolate geolocated coordinates for the variable data points (setting the dataset GCPs then using a transformer), but we will probably switch to another method in the future (see our reply to @JimBiardCics below).

@JimBiardCics - Thank you for the feedback!

We are not sure to understand your first comment: "how the different coordinate-type variables and dimensions relate to one another". Can you give us an example?

As for the second comment, the interpolation method used to compute coordinates between GCPs depends on the accuracy you need to achieve and on the data model.

For grids that can be described with a geotransform, using linear interpolation with the geotransform parameters is the obvious way.

For swaths we currently use thin plate splines (via GDAL) but we have tested other methods as well (Akima, bivariate spline approximation over a rectangular mesh, etc...) and noticed that these methods may provide better accuracy or reduce computation time in some cases.

So it would be possible to specify a recommended method, but it would not necessarily be optimal or suitable for the reader constraints (accuracy and processing time).

@JimBiardCics
Copy link

@oceandatalab There will need to be an unambiguous way to declare exactly what the relationship between the different dimensions in your example (lat, lon, lat_gcp, and lon_gcp). There will also need to be a way to specify the recommended interpolation method and a statement of the interpolation accuracy. Using an inappropriate interpolation method with some swath-type lat/lon fields can produce significantly incorrect results.

I envision a variable akin to a grid_mapping variable that would hold all the above information in attributes. Such a variable might also hold data values that would correspond, for example, to rational function coefficients.

@erget
Copy link
Member

erget commented Mar 25, 2020

I agree with @ajelenak that this is an indication that there's a need to address this in CF. Hopefully next week I can take a look - @oceandatalab would you be interested in contributing to a corresponding proposal?

@JimBiardCics
Copy link

@erget I'd be up for working on this as well.

@ajelenak
Copy link

We are talking about a proposal to support "sampled" coordinates in general, not specific to satellite observations, correct?

@oceandatalab
Copy link
Author

@erget - Sure, we are really interested in participating in a proposal and glad to see that this issue is pushed forward. Let us know how we can be of any help.

@JimBiardCics
Copy link

@ajelenak I'd say yes.

@ajelenak
Copy link

Good, that's the right approach.

@TomLav
Copy link

TomLav commented Mar 25, 2020

From earlier discussions I have seen on this topic, it was never clear if down-sampling lat/lon on GCPs was clearly advantageous wrt to letting internal compression do the size reduction. Has such a trade-off study been conducted for several types of EO swath data? From a user point of view, having access to the lat/lon in the files is clearly an advantage.

Also I note that we should maybe not limit ourselves to subsampling coordinates. In the satellite world, what if we want to specify Sun/view angles on GCPs, and provide a routine to interpolate to each FoV? In such a case, the angles are not a coordinate but a regular field to interpolate from GCPs.

Finally, several space missions already implement different downsampling strategies (e.g. MWI, AMSR2) so we should take a look at what they do.

I would also be interested in contributing to such a proposal.

@davidhassell
Copy link
Collaborator

Any acceptable solution to this issue will need to be made with some reflection on the CF data model. If the solution amounts to "fancy compression", then the data model is likely unaffected, but other cases that have been alluded to (as I understand them) could possibly require an extension.

At this stage, I think the discussion should carry without worrying about this - it's too early in the process - but this is just a reminder for later.

(Note that whilst extending the data model is certainly possible (as was done for simple geometries), it should be the last resort - unnecessary extensions will move the data model away from its minimillistic design principles and increase the chance of the data model not meeting future needs. If an acceptable solution can be found that fits in with the current model, then that is ideal.)

@oceandatalab
Copy link
Author

@TomLav - Several satellite observations already require the use of GCPs (Sentinel-1 SAR, Sentinel-3 SLSTR & OLCI) but each method and format are different as there is no consensus at the moment. The idea here is to come up with recommendations so that good practice can be provided for cases wherein GCPs or subsampled coordinates/data are required.
From our point of view, having relevant ancillary data on GCPs is indeed the best practice and is already the method employed for the aforementioned sensors. CF recommendations should also address this case to homogenize the format.

@erget
Copy link
Member

erget commented Mar 26, 2020

This is a lot of great input, I'm glad there's so much interest.

@TomLav - for our satellites we have indeed done tradeoffs of using compression vs. GCPs and found that GCPs do reduce data volume. There might be a way of regularising these data so that they compress better, but in general reduction seems to be more effective than simple compression.

@davidhassell - agreed, I think an optimal solution would be to implement this in a way that can ber considered "fancy compression" rather than an extension to the data model.

I think it might be worthwhile to thave a telco on this issue - who's interested and from what time zones?

@oceandatalab
Copy link
Author

@erget - we would be interested in participating in such teleconf, our time zone is CET (we switch to DST on March 29, so GMT + 1 until Saturday, then GMT + 2).

@JimBiardCics
Copy link

@erget UTC-4 (US Eastern Daylight Time)

@TomLav
Copy link

TomLav commented Mar 26, 2020

@erget - I am same as @oceandatalab .

@davidhassell
Copy link
Collaborator

@erget I am GMT until Saturday, then GMT + 1

@ajelenak
Copy link

@erget You say Apr 2-6 but the Doodle poll says Apr 9-13.

@JimBiardCics
Copy link

@erget Did I miss the meeting?

@JimBiardCics
Copy link

Yes. It didn't allow me to join. I'll try again.

@JimBiardCics
Copy link

The final "Join Meeting" button is grayed out.

@JimBiardCics
Copy link

I'll leave things in your collective capable hands and go walk my dog, who is standing here with her legs crossed.

@erget
Copy link
Member

erget commented Apr 3, 2020

Minutes of first discussion on subsampled coordinates in CF

Executive summary

We have collected different approaches that can be used in order to reduce data volumes in Earth Observation data (and, to some extent, to gridded data in general). There is interest and capacity to try to fit this cleanly into the CF Conventions.

A follow-up meeting will be scheduled for the week from 6-10 April 2020. Please mark your availability in this Doodle. @erget will schedule the meeting on Monday, 2020-04-06 at 16 UTC.

Introduction

Purpose: Understand common practices for providing subsampled coordinates in a standardised way.

Present:

The utility of subsampled coordinates

The group agreed that subsampled coordinates can have several benefits:

  • decreased storage volume
  • decreased transmission volume
  • fidelity to original data; often, ancillary data is provided only at reduced resolution, and resampling this ancillary data to the full resolution of observation data doesn't necessarily make sense.

This means that in some cases, subsampling coordinates can not only provide a description of data products that avoids unnecessary assumptions; it also can have large cost impacts on all users of the data products in question.

Known practices of providing only subsampled coordinates in data products

Three approaches were presented. I list them here in what I consider increasing complexity, out of the order that they were introduced in.

HDF EOS

@ajelenak provided an introduction to HDF EOS which is specified for HDF, not netCDF data. The approach could however be transferred.

In this approach, geolocation data is "thinned" and metadata is provided to let the user know what offsets and increments need to be applied in order to select applicable data from geolocation arrays and apply them to the longer observation data arrays. It is left to the user to interpolate coordinates, if desired, or to use another approach.

He also noted that it is important to ensure that if we adopt such an approach in CF, users should be given enough information to recognise that coordinates are subsampled, and it may be useful to provide guidance on how to interpolate between Ground Control Points (GCPs). GCPs were understood as the points at which geolocation data is provided; in the case of HDF EOS more observations are present than GCPs.

Providing offsets for channels with similar but non-identical view centres

@TomLav presented approaches used for conically-scanning passive microwave imagers. For such imagers, netCDF is not a common format but this is changing with the advent of new missions.

2 approaches were presented, both using the example of the AMSR2 instrument. The L1b products are provided with geolocation data only for 1 channel, whereas formulae are provided that allow users to reconstruct the view centre of every other channel for the same position in the data array. This is because the instrument observes multiple channels in a consistent geospatial relation to each other.

Additionally, a L1R product is available, for which the individual views of each channel have been resampled in order to simulate having the same coordinates as the baseline channel. The individual values are of course changed by the interpolation process.

It was also mentioned that the MWI instrument from EUMETSAT's EPS-SG mission will supply only tie-points; users are expected to interpolate between the tie-points in order to recover the coordinates of individual views. The coordinates are provided in Earth-Centre Earth Fixed coordinates.

Recovery of coordinates via parametrised interpolation within tie-point zones

Anders (EUMETSAT) presented the approach used to compress VIIRS Sensor Data Records (SDRs) within the EUMETSAT Advanced Retransmission Service. As the VIIRS instrument uses an irregular pixel aggregation scheme, linear interpolation of coordinates is not accurate. The specific non-linear aspects of the interpolation change along the instrument's scan, meaning that different terms must be applied at different points in the SDR.

To solve this problem, VIIRS SDRs are divided into Tie Point Zones (TPZs), for which the corner coordinates are provided for groups of pixels. The centre coordinates for all pixels within a TPZ can then be interpolated given the interpolation correction factors that are encoded into the product as metadata. These factors are applied to all TPZs within a given TPZ group; there may be several TPZ groups within a product.

Multiple implementations of this approach exist, including in the FOSS world (see PyTroll). Reducing the coordinate data in this way roundly beats classical compression. Compression and data reduction are of course not mutually exclusive.

It was also noted that the coordinates are interpreted on the ellipsoid; no terrain correction is applied, which is relevant due to the parallax effect, which is relevant for this and other instruments.

The full algorithm and the encoding parameters are described in the
Compact VIIRS SDR Product Format User Guide.pdf.

Way forward

A follow-up meeting is planned for next week. These topics will be pursued:

  • Should such data reduction techniques become part of the CF Convention, e.g. as grid mappings? It would be particularly interesting if it were possible to do this without affecting the CF Data Model.
  • How can we achieve this in a way that is sufficiently general that it can be applied to multiple instruments, yet still specific enough that it can be implemented? Many of the approaches discussed were fairly tailored to the instruments in question.

@erget

This comment has been minimized.

@oceandatalab
Copy link
Author

* @oceandatalab (S) asked about the choice behind the notation of interpolation coefficients - they now use the formula terms notation. @davidhassell  considers this is considered the cleanest approach here. It also has advantages in terms of implementation because a program can obtain all coefficients from the same place without having to scrub the variable to find them all.

Just to provide more details about this part of the discussion, there were two questions:

  • whether we need a :interpolation_coefficients and a :interpolation_configuration attributes: since they both use the formula_terms syntax they could be merged into a single :interpolation_configuration attribute. The terms in:interpolation_coefficients would only refer to variables containing numerical values whereas the ones in :interpolation_configuration could also refer to variables that contain flags (or even text). In the end it is mostly a matter of preference.

  • whether we should allow single numbers to be provided directly in :interpolation_configuration instead of using a reference to a variable which contains the number: we agreed that using a variable may seem overkill but is the best choice because interpolation_configuration value is a string and using it to pass numbers would involve more parsing and more validation, whereas a variable already has mechanisms to make sure that the value is a number with the expected type.

@AndersMS
Copy link

AndersMS commented Dec 2, 2020

Hi @oceandatalab

Thank you for clarifying, I agree with your text.

Hi All,

Just to add to the minutes that interpolation of tie point bounds have been added to the concept. The fact that we remove the tie point point offset has made it possible to add the bounds in an simple and elegant way.

@AndersMS
Copy link

AndersMS commented Dec 15, 2020

Minutes of the 19th discussion on subsampled coordinates in CF - 15. December 2020

Present: @davidhassell @oceandatalab (Sylvain and Lucile) @ajelenak @AndersMS

@AndersMS had prepared the first draft elements of the Appendix J, intended to document the details of the standard interpolation methods. It is available here as appj.adoc for review. As not everybody had had a chance to review @AndersMS provided a brief overview of the appendix, and it was agreed that everybody would review before the next meeting in January 2021. Anders will notify everybody when a stable version is available for review.
In response to immediate comments, it was agreed to move the section describing the methods to the front of the appendix and and the two tables describing the process of encoding and decoding to the end.

A number of detailed topics where brought up, helping to consolidate the overall approach:

The topic of missing coordinate data values was discussed. Missing data is addressed here in the CF conventions document. It was agreed to follow the principle that if any data required to reconstitute a (auxiliary) coordinate variable is marked as missing, then that coordinate variable will be marked as missing for the whole interpolation zone. (Post meeting: note that the NetCDF does not permit coordinates variables with missing data)

It was agreed that all coordinate variables that are to be interpolated by the same interpolation variable, as defined in the tie_points attribute, must have identical ordering of their dimensions.

It was agreed that in the general case, no mapping between the interpolation dimensions and the internal dimensions of the interpolation method is provided, requiring the interpolation methods to 'be symmetric' and provide identical results for any ordering of the interpolation dimensions. Should there be a future need for interpolation methods not meeting this requirement, the specific method must state a particular ordering of the dimensions as a constraint on its usage.

Based on a review of the Example 8.5 for VIIRS, it was noted that no mapping between the data variable interpolation dimension m_scan and the time tie point interpolation dimension time_scan is given. To address this, it was agreed to permit the tie_point_dimensions attribute to define multiple mappings for each data variable interpolation dimension, like in:

m_radiance:tie_point_dimensions = "m_track: tp_track zone_track m_scan: tp_scan zone_scan m_scan: time_scan"

where the correct mapping for m_scan would be clear from the context. This would permit having tie points at different resolution, all mapping to the same target domain.

Actions from this time
@AndersMS to notify everyone when a stable version of the Appendix J is available
@everybody to review and comment on Appendix J before or at the next meeting in January 2021

@AndersMS
Copy link

Dear All,

When updating the Example 8.5, I came to the conclusion that we need to permit multiple mappings for the same interpolation dimensions both for the tie_point_dimensions and the tie_point_indices attributes, like in

    m_radiance:tie_point_dimensions = "m_track: tp_track  zone_track  m_scan: tp_scan  zone_scan  m_scan: tp_time_scan"  ;
    m_radiance:tie_point_indices = "m_track: m_track_indices  m_scan:  m_scan_indices  m_scan: m_time_scan_indices" ;

Please let me know if you agree?

@erget
Copy link
Member

erget commented Jan 12, 2021

Minutes of the 20th discussion on subsampled coordinates in CF

Present: @davidhassell @oceandatalab (Sylvain) @ajelenak @AndersMS @erget

Actions from last time

None.

State of play & way forward

Drafts for Appendix and chapter 8 are underway, conformance document changes are yet to be proposed. Chapter 8 will be ready for review within the next few days; all present have time and are willing to review before the next meeting in 2 weeks. The appendix and conformance document will be reviewed subsequently.

No input has been received regarding supersampling from @TomLav so we will drop the approach from the proposal. This makes it simpler to propose the text and to review. Supersampling could always be re-introduced as a future enhancement if needed.

@davidhassell and @ajelenak will collaborate on the drafting of the updates to the conformance document.

It was agreed not to include MODIS examples, as these are in HDF4 and thus not straightforward to port to netCDF, thus CDL, and the format is not expected to change in the MODIS mission.

@AndersMS provided explanation concerning the interpolation accuracy being a function of the number of tie point zones. It was agreed to provide a short text explaining this as a decision on the part of the data producer.

@AndersMS
Copy link

Minutes of the 21th discussion on subsampled coordinates in CF, 26 January 2021

Present:

@davidhassell
@oceandatalab (Sylvain)
@ajelenak
@AndersMS

The meeting had two agenda points:

  • Team review of the consolidated text prepared for inclusion as section 8.3 in the CF Conventions document.
  • Introduction to draft Appendix J under preparation to complement section 8.3 with descriptions of actual interpolation methods

Team review of the consolidated text prepared for inclusion as section 8.3 in the CF Conventions document.

The comments made in the Review Pull Request erget/cf-conventions#1 was discussed. The comments made were generally agreed, with two points requiring clarification, see action 1 an 2 below. It was confirmed that the proposed glossary would be in part additions to the CF Conventions section 1.3. Terminology and in part additions to the Appendix A: Attributes.

The Review Pull Request remains open until next meeting in two weeks for further review and comments.

Introduction to draft Appendix J under preparation to complement section 8.3 with descriptions of actual interpolation methods

@AndersMS presented the current version of Appendix J, contained in the Review Pull Request referenced above. It was agreed that @AndersMS would notify the team in the next days when the draft is ready for a first review. It would then be discussed in more detail at the next meting.

List of Actions:

  • 1 @davidhassell 2021-01-26: Discuss with the CF Conventions Committee the best way to mange lines breaks and line length in AsciiDoc.
  • 2 @erget 2021-01-26: Explain in more detail the comment "Might this also be considered a requirement for interpolation variable?" made in the Review PR.
  • 3 @AndersMS 2021-01-26: Share articles on interpolation with team

@erget
Copy link
Member

erget commented Jan 27, 2021

@AndersMS thanks for these minutes! I've resolved the comment referenced in action (2).

@erget
Copy link
Member

erget commented Feb 23, 2021

Minutes of the 22th discussion on subsampled coordinates in CF

Present: @davidhassell @oceandatalab (Sylvain) @ajelenak @AndersMS @erget

The progress on the Appendix was reviewed. We are nearing convergence on the document and can address outstanding (minor) issues at the next meeting. In the meantime work on the Conformance document will begin. All will review the updated Appendix draft being prepared by @AndersMS.

  • 1 @davidhassell 2021-03-09: Begin work on Conformance document.
  • 2 @AndersMS 2021-03-09: Continue Appendix updates

@AndersMS
Copy link

AndersMS commented Mar 9, 2021

Minutes of the 23th discussion on subsampled coordinates in CF, 9 March 2021

Present:

@davidhassell
@oceandatalab (Sylvain)
@ajelenak
@AndersMS

The meeting had three agenda points, all relating to erget/cf-conventions#1:

Review of improvements to the method descriptions in Appendix J introduced by @AndersMS
The improvements were reviewed and agreed, including the new figures illustrating the interpolation parameters for the quadratic_geo and bi_quadratic_geo methods. It was proposed to change the names quadratic_geo and bi_quadratic_geo to quadratic_remote_sensing and bi_quadratic_remote_sensing. It was agreed to refer to the alignment and expansion coefficients for storage in the product as the parametric representation of the coefficients.

Review of proposed conformance rules as distributed by @davidhassell
The conformance rules were introduced by @davidhassell and agreed. A proposed additional rule distributed via email prior to the meeting would be added. It was found that the rules would help identifying incorrect implementations of the compression by coordinate subsampling, although not all aspects can be checked.

Identification of completed and outstanding tasks

Completed tasks:

  • Section on Lossy Compression by Coordinate Sampling for inclusion in chapter 8 in the CF Conventions document;
  • Mention of new section in chapter 8 introduction
  • First complete version of Appendix J on interpolation methods (improvements ongoing)
  • Inclusion of key attributes names in Appendix A
  • Preparation of section on Lossy Compression by Coordinate Sampling in CF Conformance Requirements and Recommendations

Outstanding tasks:

  • Further improvements to Appendix J on interpolation methods
  • Inclusion of figures, tables and examples in appropriate lists
  • Update of document history section
  • Inclusion of key terms in 1.3. Terminology

@AndersMS
Copy link

AndersMS commented Apr 9, 2021

Minutes of the 25th discussion on subsampled coordinates in CF, 9 March 2021

Present:

@davidhassell
@oceandatalab (Sylvain)
@ajelenak
@AndersMS

The meeting had four agenda points:

Work is progressing with the verification of the Biquadratic Interpolation of geographic coordinates method: Coding and check of accuracy of reconstituted coordinates. Results will be presented at next meeting (@AndersMS)

Issue identified with sharing of tie point between different spatial resolution grids, see description :
There was general preference for resolving the issue with the proposed option 2. @davidhassell would update Appendix A and the Conformance document accordingly. @ajelenak would update Chapter 8 and @AndersMS would update Appendix J.

Standard name for reflectance, see description:
There was uncertainty about the use of toa_bidirectional_reflectance for the VIIIRS reflectance. @ajelenak would investigate. @AndersMS would ask the relevant expert at EUMETSAT, Jörg Ackermann.

Figure 3 in Chapter 8:
The figure will be extended to also cover the missing 4th case of (y_tie_point_interpolation, x_tie_point_interpolation) dimensions of the interpolation parameter variable.
The graphical quality of the figure would be improved to make points circular. (@AndersMS)

@erget
Copy link
Member

erget commented Apr 20, 2021

Minutes of 26th discussion on subsampled coordinates

Present: @davidhassell @ajelenak @oceandatalab @AndersMS @erget

Items from last meeting

  • Accuracy check of reconstituted coordinates @AndersMS
    • Results look pretty good but there seems to be an error in (we hope) the code which produces high errors in the first interpolation zone in an interpolation area. Otherwise the error is generally <<5m. This is under investigation; we hypothesize that this is a bug.
    • Some discussion about omissions of middle points in interpolation zones. This will be concluded by experiment and will not have a major impact on the design or the schedule with which we propose the changes to the CF Community.
  • Updates to Appendix A @davidhassell - new variable type could be considered prudent. We will leave this as it is so that we don't have to over-complicate the table, it might be requested by one of the reviewers. Conformance document also slightly rearranged to accommodate new variables.
  • Updates to chapter 8 @ajelenak - reviewed, no issues or inconsistencies found. OK to go.
  • toa_bidirectional_reflectance @ajelenak @AndersMS
    • @ajelenak - "bidirectional" indicates that this is not an appropriate standard name, because this implies non-Lambertianness - we are looking at something more like albedo. However, we are not forced to use a standard name. This is something to consider implementing going forward as more remote sensing products come into CF. Agreement: For example we use long_name: reflectance in examples. Standard names are separate issue.

Adressed in this meeting

@AndersMS proposes restructuring chapter 8 to introduce tie point dimensions/indices terms at more appropriate points in the text. Revisions agreed.

Pending for next time

  • Bug analysis of interpolation errors
  • Experiment on omitted centre points
  • Updates to Appendix J @AndersMS
  • Figure 3 extension for 4th case @AndersMS
  • Profiling results to be reviewed at next session - it seems that 3d interpolation does not produce the expected performance hit, which is very intriguing

Intent: Next session send off first draft to CF Community!

@AndersMS
Copy link

Dear All,

Just to update you on the Bug analysis of interpolation errors. discussed yesterday.

The problem was related to the handling of interpolation zones with an odd number of coordinate points. After changing the formula in Appendix J for Biquadratic Interpolation of geographic coordinates from
cv_z = fcv( vac, vbd, vz, 0.5)
to
cv_z = fcv(vac, vbd, vz, s(ia1, ib1, i1))
things now work correctly both for interpolation zones with odd and even number of coordinate points.
Anders

@AndersMS
Copy link

AndersMS commented Apr 21, 2021

Dear All,

Just to update you on the Profiling results discussed yesterday.

It turned out that the code spent excessive time on memory allocation. After resolving this, the code now executes in

  • 15 ms when interpolating directly in latitude longitude
  • 90 ms when interpolating in cartesian coordinates

for a single VIIRS M-Band granule, when excluding all file read and write operations from the benchmarking.

So it is certainly worth keeping both methods.

A VIIRS M-Band granule contains around 85 s of observations or 2,457,600 coordinate points, so the execution time is quite impressive.

The VIIRS M-Band granule used for the benchmark contained observations between 26 and 36 degree North. The mean position error in the reconstituted data was

  • 1.32 m when interpolating directly in latitude longitude
  • 1.34 m when interpolating in cartesian coordinates

which is very promising. I will do some further benchmarking for a data set covering the polar region.

Anders

@davidhassell
Copy link
Collaborator

Hi @AndersMS,

Great news on your bug finding. I agree that factor of 6 makes it worth keeping both methods (I can't remember - need to check in appendix J - if it's now clear that you can bypass some of the calculations if interpolating directly in latitude longitude. Have I got that right?).

I presume that an error of ~1m is well within the acceptable tolerance?

David

@AndersMS
Copy link

Hi @davidhassell

Yes, the algorithm in Appendix J has branches for each of the two cases. Interpolating directly in latitude longitude requires no trigonometric function calls, apart from what is needed to initially convert the interpolation coefficients. That's the main reason for the performance difference as the trigonometric function calls are expensive.

The ~1m is well within the acceptable tolerance for applications of this type of data. For comparison the pixel size is around 750m x 750m at the sub-satellite point and the interpolation zone varies from around 12km x 12km at the sub-satellite point to around 25km x 25km at the swath edge.

Another reference is the fact that latitude and longitude is stored in Float32 in both the original and the compact product files. Float32 has a 24 bits significand precision. At the scale of the Earth radius of 6,371 km, that gives a precision of 0.4m, so we are quite close to that limit. As an experiment, we could see what would happen if we did the compression and expansion calculations in Float64, still with the latitude and longitude stored in Float32 in both the original and the compact product files. That could bring a small improvement, but execution would likely be slower.

Anders

@erget
Copy link
Member

erget commented Apr 27, 2021

Minutes of 27th discussion on subsampled coordinates

Present: @davidhassell @ajelenak @oceandatalab @AndersMS @erget
(preliminary)

Items from last meeting

  • Bug analysis of interpolation errors - 1st interpolation zone was having errors of ~500m. This was due to interpolating in a zone with an odd number of coordinates in the target coordinate system. @AndersMS fixed this.
  • Experiment on omitted centre points - This looks too complicated to pursue; it is cleaner to leave it as-is for now, if it is needed, we could pursue it when the need should arise.
  • Updates to Appendix J @AndersMS - update provided, this is almost done.
  • Figure 3 extension for 4th case @AndersMS - figure looks OK for initial review.
  • Profiling results to be reviewed at next session - it seems that 3d interpolation does not produce the expected performance hit, which is very intriguing

Intent: Next session send off first draft to CF Community!

Adressed in this meeting

@davidhassell noted that the 1.9 release is impending.
@erget proposes moving forward with putting forward a draft for the community, then making the minor corrections during the review process.

Pending for next time

  • @davidhassell will lead putting together the proposal
  • @AndersMS make final adjustments for Appendix J after proposal has been put forward

@erget
Copy link
Member

erget commented May 4, 2021

Minutes of 28th discussion on subsampled coordinates

Present: @davidhassell @AndersMS @erget @ajelenak

Items from last meeting

  • @davidhassell will lead putting together the proposal
  • @AndersMS make final adjustments for Appendix J after proposal has been put forward

Adressed in this meeting

  • @AndersMS showed some slides giving an overview of the state of the appendix. When updating the approach in order to change order of interpolation along axes, a bug was discovered that has now been fixed. This precipitates a change to the text, which should take around half a day of work.
  • Small accuracy errors in the reconstruction of coordinates were discovered that might be improved by interpolating in double rather than single precision. @AndersMS is investigating the effects on performance and accuracy when using this data type. This dovetails well with ongoing work @ajelenak is doing with NASA.
  • @AndersMS also showed how using different interpolation algorithms (XYZ vs ECEF) presents significant advantages; Cartesian interpolation performs very well around the poles, but is slower, while latitude-longitude interpolation can be used for >95% of the Earth. In practice this will be controlled by the interpolation parameters.

Pending for next time

@AndersMS
Copy link

AndersMS commented May 4, 2021

Dear All,

I have updated the verification software to do all internal calculation in 64 bit floating point rather than 32 bit floating point, while maintaining all lat/lon in the netCDF files as 32 bit floating point.

The change made the structure and the error outliers that we saw during the meeting today disappear, see error plot below. The errors are now in the range 0.22 m - 2.45 m with an average value of 0.847 m.

The software execution time only went up marginally, in the order of 10%.

Anders

error

@davidhassell
Copy link
Collaborator

Hi @AndersMS, that looks like good confirmation of what we hoped! Thanks.

@AndersMS
Copy link

AndersMS commented May 4, 2021

Hi @davidhassell ,

Yes, I agree :-)

I would suggest that we specify in the appendix J that internal calculations must be carried out in 64 bit floating point. Either generally or we could state it for each interpolation method where it would be required.

What do you think?

Anders

@davidhassell
Copy link
Collaborator

Hi @AndersMS,

I'd be OK with stating that (i.e internal calculations must be carried out in 64 bit floating point) in appendix J. I think a general statement at the top would suffice.

Since the precision of arithmetic is demonstrably of importance, and we explicitly say that the user should be able to recreate the coordinates as intended by the creator, we really have to say something about this precision.

An interpolation variable attribute-based solution that satisfactorily provides this so that the creator can set it to one of a number of possible values is nice in principle, but A) I don't think there is a use case for this flexibility and B) any such method I can think of has at least one drawback. So that avenue is probably best left alone at this time.

Thanks,
David

@AndersMS
Copy link

AndersMS commented May 5, 2021

Hi @erget and @davidhassell ,

I have added the sentence:

All floating point calculations defined in this appendix must be carried out in 64 bit floating point, even if the related coordinates and interpolation parameters are stored in the netCDF files with a lower precision.

in the first section of Appendix J.

I fully agree with @erget that even the precision we obtain with internal calculations in 32 bit floating point is well above what is required for practical purposes. However, to meet the aim stated in our first section in chapter 8:

The metadata that define the interpolation formula and its inputs are complete, so that the results of the coordinate reconstitution process are well defined and of a predictable accuracy.

it is correct to have the requirement for 64 bit floating point.

On a related topic, after the meeting yesterday I exercised the algorithm with different sizes of the interpolation zones. Standardly I have used 16x16 coordinate points, but now I tried out as well 4x4, 8x8, 16x32 and 16x64. With internal calculations in 32 bit floating point 4x4 and 8x8 failed with a division by zero due to rounding to 32 bit floating point. However with internal calculations in64 bit floating point it worked correctly and with high accuracy. Nevertheless, I will consider if we can improve this behavior: degraded accuracy might be acceptable, division by zero is not.

Cheers

Anders

@taylor13
Copy link

taylor13 commented May 5, 2021

I'm not comfortable that CF should specify the precision of data stored or the precision used in manipulating that data. Isn't that a scientific judgement that depends on application? In particular would 64 bit arithmetic be sufficient under all circumstances? I confess I haven't been following the details of this proposal, but if the original data is 32 bit and rounding of some operation leads to 0 difference, isn't 0 the correct answer (within the precision of the data)? If 0 is subsequently used as the devisor in an operation, rather than some (arbitrary) small number yielded by a 64 bit operation, then I think there must be a flaw in the algorithm.
If this is completely off base (because of my ignorance of what is being discussed), please say so, and move on.

@davidhassell
Copy link
Collaborator

davidhassell commented May 5, 2021

Hi Karl,

Not off base at all - your comment reflects some of the conversations we have been having off line, and it's good to air them here.

Where I am (and this may not be the view of others) is that given that the compression is lossy, the extent of the loss should be able to be determined by the data creator, and that can only be the case if the precision of arithmetic used in the interpolation algorithm is somehow prescribed. I see this as a similar situation to packing, where the accuracy is determined by the data type of the stored values and the scale_factor and add_offset attributes.

Ah! writing about the data type packing attributes reminds that I said an interpolation variable attribute approach to define the precision would be ideal, but had sufficient drawbacks (hence my endorsement of the general statement on precision). However, I hadn't though of using the data type of the such an attribute's numeric value to set it, just like data type scale_factor is also used. This approach I think would remove the drawbacks. It is also machine readable, so library code could easily apply the correct decompression without human intervention.

In this case if the precision is left undefined, the user uses what they like. If the creator has said "use double" precision, the user should do so.

A precision attribute whose value was zero in the required data type could do it:

// interpolation varaible
// With 'precision' attribute, so the creator is saying "you can the precision I set"
char interp ;
    interp:interpolation_name = "bi_linear" ;
    interp:precision = 0D ;  // you should use double precision when uncompressing
// interpolation variable.
// No 'precision' attribute, so the creator is saying "you can use whatever precision you like" 
char interp ;
    interp:interpolation_name = "bi_linear" ;

@erget
Copy link
Member

erget commented Jul 6, 2021

This issue has now resulted in a full proposal that is being pursued in cf-convention/cf-conventions#327

@erget
Copy link
Member

erget commented Jul 6, 2021

@oceandatalab would you object to closing this issue in favour of cf-convention/cf-conventions#327 in order to avoid confusion?

@oceandatalab
Copy link
Author

No objection, the discussion continues in cf-convention/cf-conventions#327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests