Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature supporting of different charsets #184

Merged
merged 12 commits into from
Mar 1, 2020
Merged

Feature supporting of different charsets #184

merged 12 commits into from
Mar 1, 2020

Conversation

V-F
Copy link
Contributor

@V-F V-F commented Jan 14, 2020

Added supporting of definition custom charset for reading netcdf files since there are tools that create netcdf-files without considering of usage UTF-8 charset and just use the local charset to write text.
The definition of the custom charset is made by sending charset as a message to the IOSP object. If charset was not set the default UTF-8 charset will be used as before.
Added logic for simplifying creation of new iosp/header for custom netcdf-3 file format: Uniplot CDH format use netcdf3 file format with "CDH" magic, little endian byte order and ISO 88591-1 encoding.
These changes partially duplicate the changes from this PR.

Fixed bug by creation of Dimension in H5headerNew::addDimension(definition of unlimited flag before definition of length. Otherwise for unlimited dimension with length == 0 IllegalArgumentException will be thrown).

Vladislav Fuks added 2 commits January 14, 2020 15:21
…charset.

MOD H5iosp, H5headerNew, H4iosp, H4header, N3iosp, N3header: added logic for defining charset for reading netcdf files. This definition is made by sending charset as a message to the IOSP object. If charset was not set the default UTF-8 charset will be used as before. The definition of charset is needed because there are tools that create netcdf-files without considering of usage of UTF-8 charset and just use the local charset to write text. Added logic for simple creation of new iosp/header for different netcdf-3 file format: Uniplot CDH format that is netcdf3 with "CDH" magic, little endian byte order and ISO 88591-1 encoding.
… of length. Otherwise for unlimited dimension with length == 0 IllegalArgumentException will be thrown.
@claassistantio
Copy link

claassistantio commented Jan 14, 2020

CLA assistant check
All committers have signed the CLA.

@lesserwhirls
Copy link
Collaborator

Thank you for your contribution @V-F!

I just want to make sure I have a good understanding before proceeding with feedback.

From what I can tell, this basically enables IOSPs to read strings from files that potentially contain non-UTF-8 charsets strings (default still UTF-8). Then, the HDF4, HFD5, and netCDF3 IOSPs are modified to allow for the presence of non-UTF-8 charset strings (with the default remaining UTF-8). In this case, it looks like consideration is only given to data and attribute values, but not object names. The control of the charset used is done via the sendIospMessage() method. All of this is to enable new IOSPs in support of netCDF- and HDF-based files where strings were not encoded using UTF-8. Overall I think this makes sense. Finally, as a bonus, a bug fix when using the new builder-based API (thank you!).

Does that sound correct?

@V-F
Copy link
Contributor Author

V-F commented Jan 25, 2020

Absolutely. Thank you for your correction.

Copy link
Collaborator

@lesserwhirls lesserwhirls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments (some repeated across the various classes). Have you had a chance to look at the CLA? It looks like github isn't able to link the email address in your commits with your github account, so signing electronically will take a few additional steps. Once these things are taken care of, we'll be able to merge this in.

@ethanrd
Copy link
Member

ethanrd commented Jan 28, 2020

Does the Charset defined here only apply to data values or does it also affect netCDF variable, dimension, attribute, or group names? NetCDF object names are restricted to ASCII and UTF-8 (with some additional restrictions on particular characters) as described in the NUG section on "Characters in NetCDF Names".

This caught my eye because there is a conversation about characters allowed in netCDF object names going on in the CF Conventions repo, CF Issue #237.

Vladislav Fuks added 2 commits January 29, 2020 07:50
…etValueCharset method.

MOD N3iospNew, H4iosp, H5iospNew: added @nullable annotation for the parameter of the new added setValueCharset method.
… renamed class member valueCharset to the charset since it is used by reading not only values, but for each string.
@V-F
Copy link
Contributor Author

V-F commented Jan 29, 2020

Does the Charset defined here only apply to data values or does it also affect netCDF variable, dimension, attribute, or group names?

Initially we implemented a minimally invasive solution considering the charset only for values but not for names since we found no example files with names containing special characters like umlauts. So we couldn't decide whether the generator used UTF-8 or ISO-8859-1 to encode names.
Finally we decided to apply the charset to all strings to have a consistent solution. So we've consequently renamed "valueCharset" to "charset".

@lesserwhirls
Copy link
Collaborator

Thank you for the update @V-F! I'm wondering, could there be a possible case where the library in charge of writing enforced encoding of object names (say, UTF-8), but the array values were encoded differently? It feels like we should give the flexibility for object names and values to have a different encoding.

@lesserwhirls
Copy link
Collaborator

This caught my eye because there is a conversation about characters allowed in netCDF object names going on in the CF Conventions repo, CF Issue #237.

Here is another relevant issue @ethanrd (Unidata/netcdf-c#402)

@JSchnabel
Copy link

Thank you for the update @V-F! I'm wondering, could there be a possible case where the library in charge of writing enforced encoding of object names (say, UTF-8), but the array values were encoded differently? It feels like we should give the flexibility for object names and values to have a different encoding.

@lesserwhirls It might help if one can be sure the object names are encoded in UTF-8 regardless the value encoding (e.g. if the reader doesn't support the value charset). In this case I think the configurable charset should be restricted to the values and @V-F should get back to the definition of a value charset which is only applied to values but not to object names. For the example files, we (@V-F and me) know, it doesn't matter if the names are encoded in UTF-8 or ISO-8859 but ISO-8859 is used for the values. Since we are not really familar with netCDF you might rather decide this issue?
Do you see a use case where it may help to specify differing encodings for value and names?

@ethanrd
Copy link
Member

ethanrd commented Jan 30, 2020

The specification for netCDF Classic format is pretty clear that netCDF dimension, variable, and attribute names should be UTF-8 encoded strings. (I say should rather than must because I'm not sure how strong the enforcement is in various implementations, though I believe the name strings go through some kind of Unicode Normalization before the dataset is written, and that process I would guess makes some UTF-8 assumptions.)

HDF5 also assumes ASCII or UTF-8 though it doesn't sound like it is necessarily enforced. Here is a comment in another CF discussion describing how HDF handles strings. The overall discussion is about variable and attribute values rather than their names so not actually sure if HDF5 treats variable names and such the same as the data values.

@ajelenak-thg Can you tell us if the string handling you describe in the CF comment linked above applies to variable names as well as values?

@ajelenak
Copy link

@ajelenak-thg Can you tell us if the string handling you describe in the CF comment linked above applies to variable names as well as values?

Yes, it does. By default, HDF5 library assumes ASCII for attribute and link names (HDF5 links are what gives names to HDF5 objects in a file). The other option is UTF-8, and that is set with the H5Pset_char_encoding() function. In both cases what is stored are just bytes. The H5Pget_char_encoding() function provides the information which character set to decode those bytes to.

V-F and others added 2 commits February 3, 2020 08:09
# Conflicts:
#	cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5headerNew.java
@V-F
Copy link
Contributor Author

V-F commented Feb 13, 2020

To summarize the discussion: I need to change the implementation and use the "charset" as a "valueCharset" i.e. only for values, as required by the specification netCDF dimension, variable and attribute names should be UTF-8 encoded strings. Right?

@lesserwhirls
Copy link
Collaborator

Hello @V-F! Yes, I believe that is where things stand. I'm thinking that would mean we would want to keep everything up to commit de4f6d0d69c6777d083266ee738c00cb36aeab62, rebase to pick up the changes from master, and run ./gradlew spotlessApply to catch any style issues. If that sounds correct, I'd be happy to do that on my end and push those changes to your branch for a final check.

@V-F
Copy link
Contributor Author

V-F commented Feb 13, 2020

Hello @lesserwhirls!
Unfortunately, I have to fix the code, since the commit 8c86aa3 only change the name of the variable "valueCharset" to the "charset". It is necessary to analyze each use of the readString() method and reload it for reading the values.

…ince it is applied only by reading attribute values.

MOD H4iosp, H4header: renamed "charset" to the "valueCharset" since it is applied only by reading attribute values, and also text of the TagText, TagAnnotate and TagTextN.
MOD H5iospNew, H5headerNew: renamed "charset" to the "valueCharset" since it is applied only by reading attribute values.
@V-F
Copy link
Contributor Author

V-F commented Feb 13, 2020

I've renamed member variable "charset" to the "valueCharset".
In the N3headerNew the defined charset is used only by reading attribute values.
In the H4header the defined charset used by reading attribute values, by reading text in the TagText, TagAnnotate and TagTextN (name, classname, fld_name is read with UTF_8). Sructured metadata (readStructMetadata) is also read with UTF_8.
In the H5headerNew the defined charset is used only by reading attribute values.

@V-F
Copy link
Contributor Author

V-F commented Feb 27, 2020

Hello @lesserwhirls!
Could you please check the changes that I made. Thanks in advance.

P.S. I have errors with 2 local tests: "testThreading1" and "testThreadingN":
java.lang.AssertionError
at ucar.httpservices.HTTPConnections.validate(HTTPConnections.java:98)
at ucar.httpservices.HTTPSession.validatestate(HTTPSession.java:1360)
at ucar.nc2.util.net.TestThreading.testThreading1(TestThreading.java:178) / at ucar.nc2.util.net.TestThreading.testThreadingN(TestThreading.java:157)
I think these errors aren't related to my changes.

@lesserwhirls
Copy link
Collaborator

Greetings @V-F! No, those errors are not related to your changes, and as I found when moving from Travis CI to GitHub Actions, are actually dependent on the order the tests are run (fixed in #214). I will give your PR a run on our Jenkins instance, which runs the full suite of tests (takes over an hour or so).

@lesserwhirls
Copy link
Collaborator

Jenkins looks good (no new failures), so I think we've got it. Thank you @V-F and @JSchnabel for seeing this through!

@lesserwhirls lesserwhirls merged commit 7c76c60 into Unidata:master Mar 1, 2020
@V-F V-F deleted the feature-supporting_of_different_charsets branch March 1, 2020 14:33
@V-F
Copy link
Contributor Author

V-F commented Mar 1, 2020

Thank you.

@JohnLCaron
Copy link
Collaborator

Sorry, Im just looking at this feature now.

  1. So there's no way to detect that a NetcdfFile uses a nonstandard charset?
  2. Do you assume all values use the same charset?
  3. What about adding a Convention for this going forward? One could wrap such a file in NcML to add the Convention if desired, which means standard (java) tooling would work. If the calling routine passes the Charset in as in this PR, perhaps a global attribute should be added, so a copy of the file has it.
  4. Did anything become of a Uniplot CDH iosp?

@JohnLCaron
Copy link
Collaborator

Also, how does the C library handle non standard charsets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants