Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small update to text in section 2.3 regarding character sets #468

Closed
larsbarring opened this issue Nov 1, 2023 · 4 comments · Fixed by #470
Closed

Small update to text in section 2.3 regarding character sets #468

larsbarring opened this issue Nov 1, 2023 · 4 comments · Fixed by #470
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors

Comments

@larsbarring
Copy link
Contributor

larsbarring commented Nov 1, 2023

Title

Small update to text in section 2.3 character sets.

Moderator

not yet

Requirement Summary

In Section 2.3 the currently allowed character set must be described more precisely to avoid ambiguity in across different languages, and the Unicode capabilities of netCDF should be mentioned.

Technical Proposal Summary

The two first sentences of Section 2.3 currently reads as follows

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows use of the hyphen character.

  1. The word letter have different meaning in different languages. CF should therefore be more precise in what exactly is meant.
  2. The netCDF library now allows almost all unicode characters in variable names and attribute names. This needs to be clarified to keep the CF text up-to-date with the actual situation.

Benefits

Data producers and software developers will have an up-to-date and accurate description of the intention of the CF conventions.

Status Quo

The current imprecision may lead to errors and mistakes when writing data or software.

Associated pull request

#470

Detailed Proposal

The first paragraph of section 2.3 is suggested to change as follows (bold is new text, stricken over is deleted text):

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows use of the hyphen characteralmost all Unicode characters encoded as multibyte UTF-8 characters (NUG Appendix B). The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use.

Part of the suggested changes have already been discussed in #237, in particular in this comment and the subsequent ones.

Note that the inlined reference/link to the netCDF documentation goes directly to Appendix B because the local link within the NUG to Appendix B is dead.

EDIT 2023-11-8: Changed the link to NUG Appendix B as per email from @ethanrd.
EDIT 2023-11-11: Based on review comments the suggested text has be updated, see this comment

@larsbarring larsbarring added the defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors label Nov 1, 2023
@larsbarring larsbarring linked a pull request Nov 6, 2023 that will close this issue
4 tasks
@larsbarring
Copy link
Contributor Author

larsbarring commented Nov 8, 2023

As in the PR description (above) the suggested change has already been discussed in this comment of #237 and subsequent ones. The last comment was on November 1st, and support was voiced by @ChrisBarker-NOAA.

If there are no more comments, except regarding minor technical details, I suggest that proposed changes can be implemented via PR#470 on November 22.

@ethanrd
Copy link
Member

ethanrd commented Nov 9, 2023

As Unicode contains many digits outside of 0-9, I think the term digit needs to be clarified similar to letter.

I have made a suggestion in the PR simply adding a sentence after and similar to the sentence clarifying the meaning of letter.

@ethanrd
Copy link
Member

ethanrd commented Nov 9, 2023

OK. The ASCII "underscore" is the Unicode "low line". There is also a Unicode "combining low line" (and probably others).

So, perhaps to be really clear we should specify "underscore" as well.

By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z.
By the word digits we mean the standard ASCII digits 0 to 9.
By the word underscore we mean the standard ASCII underscore _.

@larsbarring
Copy link
Contributor Author

Thanks Ethan!

In an attempt to avoid triple repetitions I suggest a small rewording which I hope does not change the meaning. The first paragraph is then:

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z.
By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _.
Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows use of the hyphen characteralmost all Unicode characters encoded as multibyte UTF-8 characters (NUG Appendix B). The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use.

I will put this in the associated PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants