Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of weighting in cell_methods #447

Open
JonathanGregory opened this issue Aug 22, 2023 · 7 comments
Open

Clarification of weighting in cell_methods #447

JonathanGregory opened this issue Aug 22, 2023 · 7 comments
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Aug 22, 2023

Initiated:

2022-11-22 by @taylor13 in #414, 1st point referred to in Karl's summary extracted into this issue by @JonathanGregory on 22 Aug 2023

Moderator:

@bnlawrence

Moderator Status Review [last updated: YYYY-MM-DD]

Requirement Summary

Provide metadata about weighting for cell methods

Technical Proposal Summary

Introduce default interpretations on how weighting of means and other statistics should be applied. Existing datasets may have relied on the vagueness of the convention (to this point) in accommodating a different weighting (i.e., no default specification of the weighting).

Benefits:

Those writing and reading CF-compliant data will have clearer guidance and more definitive rules for interpreting the cell_methods.

Associated pull request:

None yet.

Detailed Proposal

After the first two paragraphs of Sect 7.3, @taylor13 suggests inserting a paragraph while not modifying the paragraph immediately following:
Screen Shot 2022-11-22 at 5 17 14 PM
The inserted paragraph explains how by default the grid-cell values have been computed from the contributing samples. This greatly reduces the need to include the so-called "non-standardized information" regarding the cell_methods.

In section the 3rd paragraph of 7.3.1, @taylor13 proposes to define default weighting for 2-d (area) means and 3-d means, which also apply to other statistics involving sums:

Without this default specification of weighting, data writers would have to provide parenthetical non-standardized information for most of the variables they write.

@JonathanGregory JonathanGregory added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Aug 22, 2023
@JonathanGregory
Copy link
Contributor Author

Copied from #414 (comment)

Dear Karl @taylor13

At the moment, as you say, there is no information in cell_methods about what weighting should be assumed. Weighting is certainly an aspect of the statistical computations which are described by cell_methods, and it makes sense to include it. If we define default interpretations, as you suggest, new data and old data with the same cell_methods would have different interpretations, since the old data has undefined weights, but the new data has defined weights. Although this is not strictly a backwards incompatibility, it is a potential pitfall for data-users of the kind that principle 9 in Sect 1.2 says we should avoid:

  1. Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

Therefore, rather than defining defaults, I think we should introduce new syntax for indicating the weights explicitly. If no weighting was indicated, it would mean the same as now i.e. undefined. The syntax could be e.g. "name: method weighted_by keyword", or perhaps just by instead of weighted_by, where keyword would be chosen from a new list of possibilities, such as extent for the simple weighting by the size of the cell calculated as the difference between its bounds, unity if all cells have the same weight, or mass for mass-weighting.

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor Author

Copied from #414 (comment) by @taylor13

Thanks, Jonathan, for your input on how weighting might be included without violating principle 9. We would want to consider whether to include it within the parentheses (the way we include "interval:") or whether it would follow directly the "where" directive. Also, we need to think about what "key words" would be needed and the procedure for expanding the list if need be (e.g., "weighted_by mass" might not be specific enough; might need "weighted_by mass_of_snow", or "weighted_by mass_of_seaice", etc.)

You are right that it is clearly specifying the weights that is highest priority.

@JonathanGregory
Copy link
Contributor Author

Dear Karl

I'm glad you sound comfortable with the suggestion of weighted_by. Actually I'd like to omit the underscore. Since we have within days etc. without underscore, for consistency weighted by would be preferable. It looks more natural too.

My preference is not to put this clause in parentheses. I'd put it as the last clause before the parenthesis (if there is one).

Regarding the definition of possible keywords, I suppose it depends on how many might be needed. If a very small number, they could be defined in sect 7. If a larger but fairly small number, they could be in a new appendix. If a large number, they could be given in a separate document, like the area_type table. The first two require a change to the convention for amendment, the latter is a controlled vocabulary and easier to amend. We have discussed two which do not refer to any quantity except a metric of the cell: extent and unity. I expect we might also want area and volume (for methods that affect both horizontal dimensions together, or all three spatial dimensions together). Four is a fairly small number, I'd say. What others might be needed for CMIP7, do you think?

For mass-weighting, a keyword is not adequate. We should probably indicate what substance's mass is being referred to. It might be obvious that we mean air for a vertical coordinate of air_pressure, but height doesn't mention air. If the height coordinate is negative, the weighting might refer to soil or sea water. To be precise, one possibility would be to use a standard name in the cell_methods e.g. weighted by air_density. For mass-weighting, the standard name must identify a quantity which gives mass (possibly intensive in other dimensions and multiplied by constants) when multiplied by the thickness of the cell in the coordinate concerned, like air_density multiplied by height, which gives kg m-2. For air pressure, mass-weighting is weighted by unity, because the thickness of the cell is in Pa = kg m-1 s-2 = kg m-2 * g.

You made a proposal for area-weighting, that it should be applied by default if area is stated. That changes the meaning of area, which was introduced to indicate a method that applies to both horizontal dimensions without saying what they mean. For an area-weighted mean, for instance, I would instead suggest area: mean weighted by area. That way area: mean alone won't change its meaning, of a mean that applies to an area, with weighting undefined.

Best wishes

Jonathan

@larsbarring
Copy link
Contributor

I am still trying to wrap my head around these matters. But one thing that strikes me with these additions is that the content of the cell_methods attribute is more and more turning into a formal language that not only is understandable by humans, but also can (and will) be parsed by software. Hence , I think that omitting the underscore, as @JonathanGregory suggests is less helpful

... suggestion of weighted_by. Actually I'd like to omit the underscore. Since we have within days etc. without underscore, for consistency weighted by would be preferable. It looks more natural too.

In the construction weighted_by X, "weighted_by" is a set phrase that is followed by a word X from a controlled vocabulary. This is indeed similar to within days, where "within" plays the same rolw as "weighted_by" and "days" plays same role of "X" (currently: X∈ {days,months,years}). If the underscore is dropped, this correspondence is lost and, essentially, there will be an independent "by" that can be understood only when/if directly related to the preceding word "weighted". Moreover, CF has since the beginning used the space as a delimiter between different entities, in which the underscore is used to indicate a space in the natural language equivalent representation of such entities. Hence, I suggest that the underscore is used, i.e. weighted_by.

@JonathanGregory
Copy link
Contributor Author

Dear @larsbarring et al.

Perhaps it would be better and simpler not to make the syntax look so much like English. Instead of weighted_by we could just have weight or by e.g. height: mean by air_density. I wonder what you think, Karl @taylor13?

Cheers

Jonathan

@davidhassell
Copy link
Contributor

Hello,

It might be useful to refer to cf-convention/discuss#173, which discusses, amongst other things, how to represent weights in cell methods, and what the existing defaults are (e.g. currently it is unspecified whether or not area-weighting was applied for area: mean, but area: mean where sea_ice is assumed to be area weighted).

My interpretation of the original text is that area: mean and lat: lon: mean are wholly equivalent, so I don't agree with bestowing them with different defaults for the weights used in the calculation.

I prefer the more explicit weighted_by <standard_name> option. I feel that "by" on its own is somewhat ambiguous. It reminds me of "pi by two", for which the "by" has a different meaning than intended here. With the new keyword, there is no need to change the existing (confusing, to me!) defaults, which means no danger of misinterpreting existing datasets.

Thanks,
David

@JonathanGregory
Copy link
Contributor Author

I agree with @davidhassell on this:

My interpretation of the original text is that area: mean and lat: lon: mean are wholly equivalent, so I don't agree with bestowing them with different defaults for the weights used in the calculation.

I prefer the more explicit weighted_by <standard_name> option.

OK. How about just weight e.g. height: mean weight air_density? Given the discussion previously, I feel it's better to avoid having a keyword with an underscore because it might get forgotten by confusing with within years etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

No branches or pull requests

3 participants