Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naming Conventions and periods (.) #256

Closed
sval-dev opened this issue Aug 1, 2023 · 13 comments
Closed

Naming Conventions and periods (.) #256

sval-dev opened this issue Aug 1, 2023 · 13 comments
Labels
question Further information is requested or discussion invited

Comments

@sval-dev
Copy link

sval-dev commented Aug 1, 2023

In section 2.3, "Naming Conventions" from CF Conventions 1.10, it states:

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows use of the hyphen character.

However, not mentioned above, is that the netCDF interface also allows the use of the period character (.) (as well as some others not mentioned by CF Conventions)

For our data products, we think the period is more meaningful than the alternative underscore representations.

In particular consider the following two variable names:
/PM_2.5_Total
/PM_10_Total

The alternative formulation of /PM_2_5_Total, in consultation with our user community, has been evaluated as less clear than the first construction and was not preferred.

Is there some reason why a period is not allowed in either in CF Conventions or in COARDS from which CF Conventions are derived?

What obstacles might we face if we tried to get in a change request to allow periods in the naming conventions of the CF Conventions?

@sval-dev sval-dev added the question Further information is requested or discussion invited label Aug 1, 2023
@taylor13
Copy link

taylor13 commented Aug 1, 2023

Many users who work with CF-compliant data like to adopt the variable names found in the netCDF files as the names for their variables in their computer codes (e.g., if specific humidity has the name "hus" in the file, they might use that in their codes when manipulating that variable). Some languages forbid the use of a period in variable names, so this would be one argument against allowing periods in the variable names of CF-compliant files (e.g., if specific humidity were allowed to be named "spec.hum", that name could not be used in their codes).

@sval-dev
Copy link
Author

sval-dev commented Aug 2, 2023

Thanks @taylor13!

That is an argument that hadn't come to mind, and may be why the dash (-) is also excluded though it is allowed in netCDF.

As a particular example, python has the same naming conventions as adopted by the CF community.

Any thoughts on whether the addition of periods to the allowed characters in identifiers has any chance on getting accepted given the above argument or any others not yet accounted for?

@larsbarring
Copy link

This reminds me of a very similar discussion in cf-conventions/#237. In addition to @taylor13's argument against allowing . and some other characters, there were a couple of counter-arguments here and here.

In my 'previous life' as an avid Matlab user I, too, was parsing netCDF variable names into program variables, and had to invent my own rules for illegal characters like . and - (and more as I was not confined to CF, netCDF or CMIP/CORDEX data). This was of a hurdle, but manageable.

I think that the question is whether CF should maintain the current restriction, which in practice means that data producers will have to translate reasonably natural names to conform with CF restrictions (as indicated in the initial post), or if CF should relax the current restriction in allowed characters to align with netCDF rules.

Finally, as was concluded in cf-conventions/#237 it is be useful clarify what a letter is in the sentence

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

Something for this year's workshop, ping @ethanrd, @davidhassell ?

@sval-dev
Copy link
Author

sval-dev commented Aug 3, 2023

Thanks for that link to the earlier discussion!
That discussion also links to a related discussion that begins at https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2014/006929.html

It looks like the earlier conclusions were that relaxing the restriction might be possible, but people wanted to see actual use cases.
The PM2.5 case above is one such case (with an example file at https://asdc.larc.nasa.gov/data/MAIA/L4_GFPM_VSIM001/2018/01/MAIA_L4_GFPM_20180101T000000Z_FB_NOM_R01_USA-Boston_F01_VSIM01p01p01p01.nc), but I don't know if it'll be convincing.

I hadn't actually picked up on the "should" vs. "must/shall" usage in the naming conventions section which, based on my understanding of the terms, means products are technically already CF conforming even if variable or group names include periods.
This is useful information in itself!

@markusfiebig
Copy link

I would in fact propose to relax the character restrictions for CF names considerably since these restrictions limit the usability of the convention. I will soon have to propose names for the concentrations of PCBs, so we are looking at names of the type

2,2',3,3',4,4',5,5'-octachlorobiphenyl mass concentration

The commas and quotation marks in this name are essential to denote the chemical, so they can't be replaced. PCBs and brominated flame retardants are clearly a relevant area of atmospheric research and need to have a place in the CF naming convention.

To limit the character set of a vocabulary to meet the needs of programming languages is rather outdated. Programming languages should serve the use cases, not limit them.

@JonathanGregory
Copy link
Contributor

Dear Sebastian @sval-dev and all

Thanks for this useful discussion and the references to the previous explorations of the same issue.

  • I am one of those who shares the opinion Sebastian expected to encounter! The whole purpose of the CF convention is to provide metadata to describe the contents of variables. That being so, it should not matter what the variables themselves are called. The convention attributes no meaning to variable names, and the generality of a program is reduced if it depends on them. I understand that it's sensible to give names to variables which will help a reader of the data, whereas giving them deliberately meaningless names would be perverse. However, being helpful doesn't require giving a variable exactly the name one might like to give the quantity in human language. In the example of "PM 2.5", would e.g. "PM_2point5" be helpful?

  • I am glad to learn that the CF rule about variable names is the same one as in Python. That's an attractive consistency. I agree we should clarify what "letter" means. The Python statement spells it out. We could use their words.

  • Unfortunately the text of the convention is inconsistent in the words it uses to express requirements. This has often been commented on, and obviously it would be beneficial to standardise the language, but that requires someone to spend substantial time on the task. Is anyone willing to have a go? I believe that "should" and "must" both express requirements (but I may be wrong). Requirements are distinct from recommendations, which also use a variety of words, such as "recommended" and "deprecated" (recommended not to).

  • I wonder whether the comment of @markusfiebig is about characters allowed in standard names, rather than in netCDF variable names?

Best wishes

Jonathan

@DocOtak
Copy link
Member

DocOtak commented Aug 15, 2023

When I see words like "should" and "must" I almost always assume that they are RFC2119. i.e. "should" means that you can ignore that particular requirement if there is a good reason and it is not spec breaking. CF itself doesn't have this sort of requirement level statement, thought it might be a good idea for some future version. I'm willing to take a crack at starting this effort if desired.

My thoughts about the security implications of variable names still applies. We should absolutely not consider the valid symbol name restrictions in programing languages to be constraining for CF netCDF variable names.

@larsbarring
Copy link

larsbarring commented Aug 16, 2023

A couple of comments/questions regarding rather different aspects of this discussion:

@sval-dev, @JonathanGregory: The link you both gives to the to the rules for python identifiers (also referred to as names) includes, for python3, a range of unicode characters. For example both µ and π are valid variable names:

Python 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> µ = 0.000001
>>> π = 3.1415

And at least π (I did not try anything else) is a valid netCDF name when used with the ncgen utility that comes with the library version I have: "netcdf library version 4.9.1 of Feb 11 2023 02:11:40"
Perhaps you were referring to the following more restrictive sentence in the python statement?

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

Again, I think that it is useful to clarify what CF means with the word "letter". And if the sentence above is used then we need to make reference to Unicode to explain what the "U+0001" etc. means.

I concur with @DocOtak, and @sval-dev as it seems, regarding in my own interpretation "should" etc. Mainly because of RFC2119 is more and more becoming the de facto standard way outside the original context of the "Internet Community". It has been referred at least a couple of times before in CF conversations (here and here). And as @JonathanGregory writes, the language for expressing requirements has been discussed many times. I think that it would be absolutely brilliant if you Andrew would be willing to make a good start at introducing RCF2119 into CF. Then I hope other would join in to support, at least I am willing.

I also agree with @DocOtak, and @zklaus, (here and here) regarding the potentially very serious problems of eval-uating netCDF variable names into program variables.

@JonathanGregory
Copy link
Contributor

Dear @DocOtak and @larsbarring

Thanks a lot for offering to have a go at making the language consistent for recommendations and requirements in the CF document, Andrew. That is very generous and helpful. Like @larsbarring, I would be willing to help. A definite source of guidance is the conformance document, which clearly distinguishes requirements and recommendations, where a requirement is something that must be done, and a recommendation is something which isn't compulsory but is advisable. (I understand a prohibition to be a requirement not to do something, and a deprecation as a recommendation not to do it.) Not all requirements and recommendations are stated there, because it includes only those which can be checked automatically, and in other cases it might be unclear whether the text means a requirement or a recommendation. We may have to discuss such cases to clarify them.I think we may also have things which are "strongly recommended" or other such words, but that doesn't imply a third category. They're just recommendations.

Thanks for clarifying the point about the Python characters, Lars. Yes, I was referring to the ASCII range. Sorry to be sloppy.

Best wishes

Jonathan

@larsbarring
Copy link

@sval-dev in you initial post you mention that you would like to have variable names like /PM_2.5_Total, where the period . now is illegal. But you also have the slash /, which in the previous issue cf-conventions/#237 was excluded. In addition to what @JonathanGregory suggests ("PM_2point5") there is in the standard name table the established construct "....pm2p5...." (all lowercase letters as is the case in standard names). Based on this, could variable names like "PM_2p5_Total" meet the needs of your user community?

For the record, I am not a fan of assigning too detailed a meaning to variable names. As has been pointed out before, this should instead go into standard names and other metadata.

At the same time I think that we should seriously (and carefully) consider to allow selected additional characters. But could this discussion continue over in cf-conventions/#237 with the intention to clarify the character set allowed for variable names?

@sval-dev
Copy link
Author

sval-dev commented Aug 16, 2023

Thanks @markusfiebig, @JonathanGregory, @DocOtak and @larsbarring for your thoughts!

On the comma containing chemical names, it is useful to have another illustrative use case!

I think goal of adding metadata to help product discoverability and understanding is certainly worthwhile.
However, it seems at least slightly contradictory to hold the view that it "should not matter what the variables themselves are called" but at the same time that it definitely matters if variables happen to have periods in the name and that we should avoid this.

In terms of driving the adoption of CF conventions, it also seems like it would be ideal if we could layer CF conventions onto existing legal netCDF4 products (e.g. like those that might have periods in their dataset names) while minimizing changes to those products (e.g. by forcing the modification of dataset names).

Just for reference, here is what netCDF4 has:

This specification extends the permitted characters in names to include multi-byte UTF-8 encoded Unicode and additional printing characters from the US-ASCII alphabet. The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_' (reserved for special names with meaning to implementations, such as the “_FillValue” attribute). Subsequent characters may also include printing special characters, except for '/' which is not allowed in names. Names that have trailing space characters are also not permitted.

I also agree with DocOtak and larsbarring that the common understanding of "should" and "must" from the RFC is a good thing to adopt if that isn't already the meaning imparted when those words are used here.
As part of that process, I would suggest keeping "should" for variable naming may be worthwhile (and modifying the conformance document to make this a recommendation) or otherwise increasing the flexibility as was the original motivation for this issue.
Continuing that discussion in #237 seems reasonable to me since this does seem to be a dupe.

On the python naming restriction, I was indeed referring to the more restrictive section outlined.

On the example naming with slashes, I included the slashes because that is how netCDF4 groups (starting from the root group at "/") are referenced in my software.
I believe this is also the convention adopted by some other software (e.g. h5py or Hyrax)

Here is an example of reading a file in Python with the netCDF4 Python package:

>>> from netCDF4 import Dataset
>>> ds = Dataset('MAIA_L4_GFPM_20180101T000000Z_FB_NOM_R01_USA-Boston_F01_VSIM01p01p01p01.nc')
>>> fields = {}
>>> fields["PM_2.5_Total_Mean"] = ds["/PM_2.5/PM_2.5_Total_Mean"]
>>> print(fields["PM_2.5_Total_Mean"])
<class 'netCDF4._netCDF4.Variable'>
float32 PM_2.5_Total_Mean(X_Dim, Y_Dim)
    _FillValue: 9.96921e+36
    units: µg / m^3
    description: Predicted PM 2.5 Total daily mean mass concentration at each L2 grid location
    grid_mapping: Albers_Equal_Area
path = /PM_2.5
unlimited dimensions: X_Dim, Y_Dim
current shape = (376, 490)
filling on
>>> print(fields["PM_2.5_Total_Mean"][100,100])
6.7899747

Note the above purposely preserves the ability to have variable names (really keys here), refer to a name with periods in it by using a dictionary.

There is nothing that stops us from using 2p5 above instead, but that was not the preferred usage and so wasn't what ended up becoming adopted.
It is nice to know that the standard names using 2p5 for some of our fields already exist to aid tools in find these datasets, and we should definitely add these into our templates.

Looking forward to further discussion in #237 !

@larsbarring
Copy link

larsbarring commented Sep 29, 2023

I am closing this now:

@sval-dev
Copy link
Author

The linked example MAIA_L4_GFPM_20180101T000000Z_FB_NOM_R01_USA-Boston_F01_VSIM01p01p01p01.nc file does indeed require a (free) NASA Earthdata login.
I'm attaching it here as well in case that's helpful (renamed with appended .txt etension because of extension support limitations).
MAIA_L4_GFPM_20180101T000000Z_FB_NOM_R01_USA-Boston_F01_VSIM01p01p01p01.nc.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested or discussion invited
Projects
None yet
Development

No branches or pull requests

6 participants