New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for attributes of type string #141

Open
JimBiardCics opened this Issue Jul 23, 2018 · 51 comments

Comments

Projects
None yet
@JimBiardCics
Copy link

JimBiardCics commented Jul 23, 2018

Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of string type instead of char type. It seems that people often assume that string is the correct type to use because they wish to store strings, not characters.

I propose to add verbiage to the Conventions to allow attributes that have a type of string. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.

  1. A string attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.
  2. A string attribute (and a string variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of type string.

Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.

To finalize the change to support string type attributes, we need to decide:

  1. Do we explicitly forbid string array attributes?
  2. Do we place any restrictions on the content of string attributes and (by extension) variables?

Now that I have the background out of the way, here's my proposal.

Allow string attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).

Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.

Trac ticket: #176

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Jul 23, 2018

I am generally in support of this string attributes proposal, including UTF-8 characters. However, for CF controlled attributes, I recommend an explicit preference for type char rather than string. This is for compatibility with large amounts of existing user code that access critical attributes directly, and would need to be reworked for type string.

I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes.

@ajelenak-thg

This comment has been minimized.

Copy link

ajelenak-thg commented Jul 24, 2018

How different is reading values from a string attribute compared to a string variable? If some software supports string variables shouldn't it support string attributes as well? If the CF is going to recommend char datatype for string-valued attributes, shouldn't the same be done for string-valued variables?

Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended.

Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved.

Stroring Unicode strings using the string datatype makes more sense since the number of bytes for such strings in UTF-8 encoding is variable.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 24, 2018

This issue and issue #139 are intertwined. There may be overlapping discussion in both.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 24, 2018

@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 24, 2018

@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 24, 2018

@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies?
It is true that applications written in C or FORTRAN will require code changes to handle string because the API and what is returned for string attributes and variables is different from that for char attributes and variables.
Would a warning about avoiding string for maximum compatibility be sufficient?

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Jul 24, 2018

@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies.

A warning about avoiding data type string is notification. An explicit preference is advocacy. I believe the compatibility issue is important enough that CF should adopt the explicit preference for type char for key attributes.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Jul 24, 2018

The restriction that char attributes and variables should contain only ASCII characters is not warranted. The Netcdf-C library is agnostic about the character set of data stored within char attributes and char variables. UTF-8 and other character sets are easily embedded within strings stored as char data.

Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both char and string data types) as the ASCII/UTF-8 conflation.

@DocOtak

This comment has been minimized.

Copy link

DocOtak commented Jul 24, 2018

Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:

  • UTF8 is backwards compatible with ASCII if the following are true: no byte order mark, all code points are between U+0000 and U+007F
  • UTF8 is not backwards comparable with Latin1 (ISO 8859-1) because code points above U+007F need two bytes to represent.
  • There are multiple ways of representing the same grapheme, the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)

My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes".

Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Jul 24, 2018

@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes.

@DocOtak

This comment has been minimized.

Copy link

DocOtak commented Jul 24, 2018

@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf?

Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Jul 24, 2018

@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted.

@DocOtak

This comment has been minimized.

Copy link

DocOtak commented Jul 24, 2018

@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded).

@hrajagers

This comment has been minimized.

Copy link

hrajagers commented Jul 25, 2018

@Dave-Allured and @DocOtak,

  1. Most of the character/string attributes in the CF conventions contain a concatenation of sub-strings selected from a standardized vocabulary, variable names, and some numbers and separator symbols. It seems that for those attributes the discuss about the encoding is not so relevant as these sub-strings contain only a very basic set of characters (assuming that variable names are not allowed to contain extended characters). Even for flag_meanings the CF conventions state "Each word or phrase should consist of characters from the alphanumeric set and the following five: '_', '-', '.', '+', '@'." If the alphanumeric set doesn't include extended characters this again doesn't create any problems for encoding. The only attributes that might contain extended characters (and thus be influenced by this encoding choice) are attributes like long_name, institution, title, history, ... However CF inherits most of them from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here? In short, I'm not sure the encoding is important for string/character attributes at this moment.

  2. I initially raised the encoding topic in the related issue #139 because we want our model users to use local names for observation points and they will end up in label variables. In that context I would like to make sure that what I store is properly described.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 25, 2018

@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing string type.

@ajelenak-thg

This comment has been minimized.

Copy link

ajelenak-thg commented Jul 25, 2018

I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold string values as well as char:

  • comment
  • external_variables
  • _FillValue
  • flag_meanings
  • flag_values
  • history
  • institution
  • long_name
  • references
  • source
  • title

All the other attributes should hold char values to maximize backward compatibility.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 25, 2018

@ajelenak-thg Are you suggesting the other attributes must always be of type char, or that they should only contain the ASCII subset of characters?

@ajelenak-thg

This comment has been minimized.

Copy link

ajelenak-thg commented Jul 25, 2018

Based on the expressed concern so far for backward compatibility I suggested the former: always be of type char. Leave the character set and encoding unspecified since the values of those attributes are controlled by the convention.

@ajelenak-thg

This comment has been minimized.

Copy link

ajelenak-thg commented Jul 25, 2018

On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute _Encoding for that in future implementations. The values of this attribute are not specified so anything could be used.

In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings. This effectively limits what could be allowed values of the (future) _Encoding attribute for maximal data interoperability between the two file formats.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Jul 26, 2018

@hrajagers said: However CF inherits most of them [attributes] from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here?

Yes, NUG Appendix A literally allows only char type attributes. My sense is that proponents believe that string type is compatible with the intent of the NUG, and also strings have enough advantages to warrant departure from the NUG.

Personally I think string type attributes are fine within collaborations where everyone is ready for any needed code upgrades. For exchanged and published data, char type CF attributes should be preferred explicitly by CF.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Jul 26, 2018

@ajelenak-thg said: In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings.

Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type char, both attributes and variables. See netcdf issue 298. Therefore, data type char remains fully interoperable between netcdf-3 and netcdf-4 formats.

For example, this netcdf-4 file contains a char attribute and a char variable in an alternate character set. You will need an app or console window enabled for ISO-8859-1 to properly view the ncdump of this file.

@JonathanGregory

This comment has been minimized.

Copy link
Contributor

JonathanGregory commented Jul 26, 2018

Dear Jim

Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they?

On strings, I agree with your proposal and subsequent comments by others that we should allow string, but we should recommend the continued use of char, giving as the reason that char will maximise the useability of the data, because of the existence of software that isn't expecting string. Recommend means that the cf-checker will give a warning if string is used. However it's not an error and a given project could decide to use string.

For the attributes whose contents are standardised by CF e.g. coordinates, if string is used we should require a scalar string. This is because software will not expect arrays of strings. These attributes are often critical and so it's essential they can be interpreted. For CF attributes whose contents aren't standardised e.g. comment, is there a strong use-case for allowing arrays of strings?

I recall that at the meeting in Reading the point was made that arrays would be natural for flag_values and flag_meanings. I agree that the argument is stronger in that case because the words in those two attributes correspond one-to-one. Still, it would break existing software to permit it. Is there a strong need for arrays?

Best wishes

Jonathan

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 26, 2018

@JonathanGregory I agree with you. I think it would be fine to leave string array attributes out of the running for now. I also prefer the recommendation route.

Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable.

@ethanrd

This comment has been minimized.

Copy link

ethanrd commented Jul 26, 2018

@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the history attribute as a string array so that each element could contain the description of an individual processing step. I think easier machine readability was mentioned as a motivation.

@JonathanGregory

This comment has been minimized.

Copy link
Contributor

JonathanGregory commented Jul 27, 2018

Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both char and string. For these attributes, we prescribe the possible values (they have controlled vocabulary) and so we don't need to make a rule in the convention about it for the sake of the users of the convention. If we put it in the convention, it would be as guidance for future authors of the convention. I don't have a view about whether we should do this. It would be worth noting to users that whitespace, which often appears in a "black-separated list of words", should be ASCII space. I agree that UTF-8 is fine for contents which aren't standardised.

Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, string attributes will not be expected by much existing software. Hence software has to be rewritten to support the use of strings in any case, and support for arrays of strings could be added at the same time, if it's really valuable. I don't see the particular value for the use of string arrays for comment - do other people? For flag_meanings, the argument was that it would allow a meaning to be a string which contained spaces (instead of being joined up with underscores, as is presently necessary); that is, it would be an enhancement to functionality.

Happy weekend - Jonathan

@JonathanGregory

This comment has been minimized.

Copy link
Contributor

JonathanGregory commented Jul 27, 2018

I meant to write, I don't see the particular value for the use of string arrays for history, which Ethan reminded us of. Why would this be more machine-readable?

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Jul 27, 2018

@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.

I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.

@cf-metadata-list

This comment has been minimized.

Copy link

cf-metadata-list commented Jul 27, 2018

@kenkehoe

This comment has been minimized.

Copy link

kenkehoe commented Jul 27, 2018

@ajelenak-thg

This comment has been minimized.

Copy link

ajelenak-thg commented Jul 27, 2018

I would also add:

  1. source attribute when holding many filenames, or
  2. references attribute with more than one reference identifier.

One long concatenated string is not the most appropriate container for a collection of string-valued things.

As @Dave-Allured's recent post and example file illustrate, specifying string encoding for the char datatype is burdened by the past. Adding the string datatype provides us a chance to do it a little bit better by explicitly stating Unicode character set and UTF-8 encoding.

A larger issue lurking in the background is how to signal file content that breaks backward compatibility. This is something we discussed at the Reading workshop but no way forward was laid out. This proposal is not going to be the only backward-incompatible one. For example, group hierarchies are coming.

@ethanrd

This comment has been minimized.

Copy link

ethanrd commented Jul 27, 2018

@JimBiardCics said:

I'm curious to know if the NUG authors looked at this section in light of allowing string type.

No, the NUG has NOT been systematically reviewed with respect to the string type or other enhanced data model features. Clearly, the NUG should support the use of enhanced data model types and features (with appropriate cautions about backward compatibility and broad usability) and leave further restrictions to conventions. So, NUG Appendix A should probably clarify that the values of those attributes are strings that can be encoded in netCDF-3 as char arrays and in netCDF-4 as either char arrays or string type.

The Unidata netCDF group will work on updating the NUG (with user community input) in the fairly near-term.

@JonathanGregory

This comment has been minimized.

Copy link
Contributor

JonathanGregory commented Aug 1, 2018

Dear Ken et al.

I think we should consider the case of each attribute individually, since the uses and arguments are different for each. Perhaps it would be simpler first of all to agree Jim's proposal to allow strings as equivalent to char arrays in attributes, without introducing arrays of strings. Once that is agreed, we can talk about whether to allow arrays in separate issues for various attributes.

Best wishes

Jonathan

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Aug 1, 2018

Here is a different compromise approach, in respect of multiple requests for string arrays. If this were a new design, then both scalar and array string attributes would be natural. Also, string support of any flavor will require code upgrades. I would prefer to make code upgrades once rather than twice. Adding string array support is not much harder than string scalar support by itself. Therefore:

  • Allow string scalar and array attributes.
  • State that char attributes are preferred for backward compatibility.
  • Don't mix parsing rules between char and string attributes. Require that CF simple lists be stored only as string arrays, not string scalars with delimiters.

The no-mix rule should make it easy to make general purpose parsing functions for CF simple list attributes, such that they can blindly distinguish and process both data types.

This approach sacrifices round trip generic conversions between attributes of the two data types. You would need to either have CF-aware utilities, or else simply don't convert. This restriction is not a problem for me.

@JonathanGregory

This comment has been minimized.

Copy link
Contributor

JonathanGregory commented Aug 2, 2018

Dear Dave

Maybe you'd do it like that if we were starting from scratch, but we aren't. We have to bear in mind the needs of users of the convention who write their own ad-hoc code. I would rather stick to our usual principle of not adding new possibilities in the convention unless there is a strong use-case, and even more so in situations, like this, when there is already an encoding that works fine. I'm sorry if that seems frustratingly conservative, but I believe it's a principle that has worked well for CF. There have been plenty of occasions when we've decided not to add a new way of doing something because we already have a satisfactory although less attractive way to do it.

Best wishes

Jonathan

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Aug 3, 2018

@JonathanGregory, above I did not mean to exclude the current encoding of simple lists in char attributes. I meant to say:

  • Don't mix parsing rules between char and string attributes. Require that CF simple list attributes be stored as either:

    • char attributes with delimiters, or
    • string arrays without delimiters,
      but not scalar strings with delimiters.

With that clarification, do you still find the option of string array attributes to be more objectionable than scalar strings with delimiters?

@cf-metadata-list

This comment has been minimized.

Copy link

cf-metadata-list commented Aug 3, 2018

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Aug 3, 2018

I am unfamiliar with py-netCDF4 and python. In py-netCDF4, is there an existing function that parses CF simple list char attributes such as coordinates or flag_values into component strings? How common is it for application level code to parse these attributes directly, as opposed to using the library function?

@DocOtak

This comment has been minimized.

Copy link

DocOtak commented Aug 3, 2018

@Dave-Allured This is my own personal experience, I do most of my netCDF work using the python xarray library, which is a wrapper around a few netCDF libraries, including the one from unidata. The char vs string in attributes is abstracted away such that I didn't even know that the netCDF "text" attributes weren't strings as they are cast/coerced into native python strings. This is different from how string vs char is handled in variable data in xarray. I very rarely use the python-netCDF4 library directly.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Aug 3, 2018

@DocOtak, "Abstractions are good." How does xarray handle coordinates and flag_values attributes?

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Aug 3, 2018

Oops, I meant, when these text attributes contain multiple values with delimiters in the input file, does xarray return them as the original single python string, or as an array of strings?

@cf-metadata-list

This comment has been minimized.

Copy link

cf-metadata-list commented Aug 3, 2018

@DocOtak

This comment has been minimized.

Copy link

DocOtak commented Aug 3, 2018

@Dave-Allured coordinates are handled special in that they interpreted and kept around for various operations you might want to do with them. See http://xarray.pydata.org/en/stable/data-structures.html#coordinates

It will attempt to "decode CF" by default, http://xarray.pydata.org/en/stable/generated/xarray.decode_cf.html but xarray is not a CF specific library. It doesn't do anything special as far as I know with flag_values. Or with ancillary_variables for that matter. The xarray maintainers have recommended iris if you want full a CF aware tool.

@DocOtak

This comment has been minimized.

Copy link

DocOtak commented Aug 3, 2018

@Dave-Allured I did some tests, a string attribute with multiple entries will be presented as an array of string by xarray in python. I don't think it has any concept of delimiters within the string itself (e.g. break on whites pace).

As for the actual topic of adding strings attributes to CF netcdf: are the CF version numbers meant to be semantic? (see https://semver.org/) if the answer is even close to "yes", then it would probably exclude adding the ability to have a string type represent any of the existing attributes currently defined in CF-1.x. Since all the more "complicated" values rely on some sort of character delimiter already, allowing them to exist in more than one data type is just added complexity without much benefit.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Aug 3, 2018

@DocOtak, thanks for testing python xarray. You said "a string attribute with multiple entries ". Please clarify. CF example 5.2 shows this attribute, which is data type char in common ncdump syntax:

T:coordinates = "lon lat" ;

Do you mean xarray currently presents this as a python array of two strings?

@DocOtak

This comment has been minimized.

Copy link

DocOtak commented Aug 3, 2018

@Dave-Allured coordinates are a bad example as xarray by default will remove the attribute and instead present a special coords property with a python dictionary (mapping data structure) with references to the actual data variables.

Assuming that it won't do the above, this is the behavior I've observed:

T:coordinates = "lon lat" ; will be a python string "lon lat"
string T:coordinates = "lon lat" ; will be a python string "lon lat"
string T:coordinates = "lon", "lat" ; will be a python list with strings ["lon", "lat"]

A python list with a single string ["lon lat"] appears to be encoded as a char array: T:coordinates = "lon lat" ;

I don't know how much of this is xarray doing magic, or the result of the python-netCDF4 library. I must admit that the last example would be very nice for the enumerated values (e.g. flag defs)

Do you or anyone else know what MATLAB does?

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Aug 6, 2018

@DocOtak, I agree the "coordinates" attribute in xarray is a bad example of simply reading a text attribute. But it is also a good example of a lower level fully encapsulating that functionality, therefore hiding the details. Encapsulating functions are part of my thinking about allowing string arrays for CF simple lists.

I do not know how MATLAB handles character and string attributes. However I found that NCL automatically converts char attributes to scalar strings. Because of this and lack of another inquiry function, there is no good way at the NCL user level, to distinguish char and string file attributes. The same may be true with python-netCDF4 and some other programming languages.

This ability to distinguish would be essential for my string array proposal to work. I come from a Fortran perspective where the raw file data type is right up front. Making a library function to handle this distinction would be natural. I feel this could be done for CF simple list attributes, for all languages, without much trouble.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Aug 29, 2018

@Dave-Allured @DocOtak @JonathanGregory Chris Barker (I'm finally back at it.) Thanks for your thoughts and investigations! As far as I'm aware, most general-purpose packages don't parse scalar string or char attributes into string arrays or anything of that sort. I think it's a good point that Chris and Andrew made that many modern netCDF APIs actively hide the difference between string and char attributes, in some cases making it hard to create a char attribute.

So, given all that, I like something along the lines of Jonathan's suggestion. Allow scalar string attributes as interchangeable with char attributes. Don't mention array string attributes. Note that older software may not handle string attributes. (Panoply, python-netCDF4, IDL, and MatLab all handle string attributes well.) Leave the more "exotic" concepts (using arrays for multiple-element things like flag_meanings and Conventions) to CF 2.0.

@ChrisBarker-NOAA

This comment has been minimized.

Copy link
Contributor

ChrisBarker-NOAA commented Sep 1, 2018

+! on @JimBiardCics's proposal.

@Dave-Allured

This comment has been minimized.

Copy link

Dave-Allured commented Sep 11, 2018

@JimBiardCics et al, I think string arrays for simple list attributes are the best single choice for the long term. It is likely that CF2 and other conventions will favor string arrays in the future. If you choose scalar strings for CF1, this will probably commit two different ways to handle string attributes later, in addition to the existing delimited character type. This is a messy future scenario that I want to avoid.

I assert without proof that the necessary upgrades to languages and user code for string arrays will be simple and straightforward. Add a function to detect the attribute's file data type, as needed. Use the data type, and nothing else, to decide when to parse on delimiters, and when to assume array. This way, there will be no future need to involve the convention version for this purpose.

There will be some short term inconveniences to adapt to string arrays. Code can be adapted gradually as string arrays are encountered in new data sets. Also as I said earlier, this entire process can be encapsulated in a CF aware function for the specific list attributes, to simplify user code upgrades.

My "vote" is I am abstaining from the consensus on this. Please take my comments as suggestions, and I leave the choice up to the rest of this capable group.

@JimBiardCics

This comment has been minimized.

Copy link
Author

JimBiardCics commented Oct 26, 2018

So, per @ChrisBarker-NOAA's comment on #139, I like the idea of stating that char attributes are constrained to ASCII (latin-1?), and that string attributes should be treated as utf-8. There's always the possibility of adding an encoding attribute at some later date if there is demand.

As much as I like @Dave-Allured's suggestion above, I think it's probably best to leave string array attributes to CF 2.0 - or at least until a later date. It's a pretty pervasive change. It's not hard from a technical standpoint (and my organizational brain loves the idea!), but I think it will be confusing to a number of 'less technical' scientists who I encounter that find netCDF and CF terribly confusing already. There's also quite a few questions that would need to be resolved about cell_methods and other attributes like it would be affected.

Thoughts?

@ChrisBarker-NOAA

This comment has been minimized.

Copy link
Contributor

ChrisBarker-NOAA commented Oct 26, 2018

+1

Honestly, I only recently learned that attributes could have types other than a single piece of text (leaving char vs string out it for now).

And there are current use cases of using delimited text to capture a similar concept.

So not allowing string arrays for now seems pretty darn straightforward.

CF2.0 can take advantage of more of these nifty features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment