Feature supporting of different charsets #184

V-F · 2020-01-14T14:39:47Z

Added supporting of definition custom charset for reading netcdf files since there are tools that create netcdf-files without considering of usage UTF-8 charset and just use the local charset to write text.
The definition of the custom charset is made by sending charset as a message to the IOSP object. If charset was not set the default UTF-8 charset will be used as before.
Added logic for simplifying creation of new iosp/header for custom netcdf-3 file format: Uniplot CDH format use netcdf3 file format with "CDH" magic, little endian byte order and ISO 88591-1 encoding.
These changes partially duplicate the changes from this PR.

Fixed bug by creation of Dimension in H5headerNew::addDimension(definition of unlimited flag before definition of length. Otherwise for unlimited dimension with length == 0 IllegalArgumentException will be thrown).

…charset. MOD H5iosp, H5headerNew, H4iosp, H4header, N3iosp, N3header: added logic for defining charset for reading netcdf files. This definition is made by sending charset as a message to the IOSP object. If charset was not set the default UTF-8 charset will be used as before. The definition of charset is needed because there are tools that create netcdf-files without considering of usage of UTF-8 charset and just use the local charset to write text. Added logic for simple creation of new iosp/header for different netcdf-3 file format: Uniplot CDH format that is netcdf3 with "CDH" magic, little endian byte order and ISO 88591-1 encoding.

… of length. Otherwise for unlimited dimension with length == 0 IllegalArgumentException will be thrown.

claassistantio · 2020-01-14T14:44:51Z

All committers have signed the CLA.

… fixed formatting of java-doc comments.

lesserwhirls · 2020-01-25T03:13:55Z

Thank you for your contribution @V-F!

I just want to make sure I have a good understanding before proceeding with feedback.

From what I can tell, this basically enables IOSPs to read strings from files that potentially contain non-UTF-8 charsets strings (default still UTF-8). Then, the HDF4, HFD5, and netCDF3 IOSPs are modified to allow for the presence of non-UTF-8 charset strings (with the default remaining UTF-8). In this case, it looks like consideration is only given to data and attribute values, but not object names. The control of the charset used is done via the sendIospMessage() method. All of this is to enable new IOSPs in support of netCDF- and HDF-based files where strings were not encoded using UTF-8. Overall I think this makes sense. Finally, as a bonus, a bug fix when using the new builder-based API (thank you!).

Does that sound correct?

V-F · 2020-01-25T10:21:54Z

Absolutely. Thank you for your correction.

lesserwhirls

Just a few comments (some repeated across the various classes). Have you had a chance to look at the CLA? It looks like github isn't able to link the email address in your commits with your github account, so signing electronically will take a few additional steps. Once these things are taken care of, we'll be able to merge this in.

cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf4/H4header.java

cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf4/H4iosp.java

cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5headerNew.java

cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5iospNew.java

cdm/core/src/main/java/ucar/nc2/internal/iosp/netcdf3/N3iospNew.java

cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf4/H4iosp.java

ethanrd · 2020-01-28T22:43:19Z

Does the Charset defined here only apply to data values or does it also affect netCDF variable, dimension, attribute, or group names? NetCDF object names are restricted to ASCII and UTF-8 (with some additional restrictions on particular characters) as described in the NUG section on "Characters in NetCDF Names".

This caught my eye because there is a conversation about characters allowed in netCDF object names going on in the CF Conventions repo, CF Issue #237.

…etValueCharset method. MOD N3iospNew, H4iosp, H5iospNew: added @nullable annotation for the parameter of the new added setValueCharset method.

… renamed class member valueCharset to the charset since it is used by reading not only values, but for each string.

V-F · 2020-01-29T07:55:10Z

Does the Charset defined here only apply to data values or does it also affect netCDF variable, dimension, attribute, or group names?

Initially we implemented a minimally invasive solution considering the charset only for values but not for names since we found no example files with names containing special characters like umlauts. So we couldn't decide whether the generator used UTF-8 or ISO-8859-1 to encode names.
Finally we decided to apply the charset to all strings to have a consistent solution. So we've consequently renamed "valueCharset" to "charset".

…ions (spotlessApply).

lesserwhirls · 2020-01-30T13:24:10Z

Thank you for the update @V-F! I'm wondering, could there be a possible case where the library in charge of writing enforced encoding of object names (say, UTF-8), but the array values were encoded differently? It feels like we should give the flexibility for object names and values to have a different encoding.

lesserwhirls · 2020-01-30T13:27:03Z

This caught my eye because there is a conversation about characters allowed in netCDF object names going on in the CF Conventions repo, CF Issue #237.

Here is another relevant issue @ethanrd (Unidata/netcdf-c#402)

JSchnabel · 2020-01-30T15:24:16Z

Thank you for the update @V-F! I'm wondering, could there be a possible case where the library in charge of writing enforced encoding of object names (say, UTF-8), but the array values were encoded differently? It feels like we should give the flexibility for object names and values to have a different encoding.

@lesserwhirls It might help if one can be sure the object names are encoded in UTF-8 regardless the value encoding (e.g. if the reader doesn't support the value charset). In this case I think the configurable charset should be restricted to the values and @V-F should get back to the definition of a value charset which is only applied to values but not to object names. For the example files, we (@V-F and me) know, it doesn't matter if the names are encoded in UTF-8 or ISO-8859 but ISO-8859 is used for the values. Since we are not really familar with netCDF you might rather decide this issue?
Do you see a use case where it may help to specify differing encodings for value and names?

ethanrd · 2020-01-30T17:53:05Z

The specification for netCDF Classic format is pretty clear that netCDF dimension, variable, and attribute names should be UTF-8 encoded strings. (I say should rather than must because I'm not sure how strong the enforcement is in various implementations, though I believe the name strings go through some kind of Unicode Normalization before the dataset is written, and that process I would guess makes some UTF-8 assumptions.)

HDF5 also assumes ASCII or UTF-8 though it doesn't sound like it is necessarily enforced. Here is a comment in another CF discussion describing how HDF handles strings. The overall discussion is about variable and attribute values rather than their names so not actually sure if HDF5 treats variable names and such the same as the data values.

@ajelenak-thg Can you tell us if the string handling you describe in the CF comment linked above applies to variable names as well as values?

ajelenak · 2020-01-31T15:50:44Z

@ajelenak-thg Can you tell us if the string handling you describe in the CF comment linked above applies to variable names as well as values?

Yes, it does. By default, HDF5 library assumes ASCII for attribute and link names (HDF5 links are what gives names to HDF5 objects in a file). The other option is UTF-8, and that is set with the H5Pset_char_encoding() function. In both cases what is stored are just bytes. The H5Pget_char_encoding() function provides the information which character set to decode those bytes to.

Pull request

# Conflicts: # cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5headerNew.java

V-F · 2020-02-13T08:22:11Z

To summarize the discussion: I need to change the implementation and use the "charset" as a "valueCharset" i.e. only for values, as required by the specification netCDF dimension, variable and attribute names should be UTF-8 encoded strings. Right?

lesserwhirls · 2020-02-13T12:45:07Z

Hello @V-F! Yes, I believe that is where things stand. I'm thinking that would mean we would want to keep everything up to commit de4f6d0d69c6777d083266ee738c00cb36aeab62, rebase to pick up the changes from master, and run ./gradlew spotlessApply to catch any style issues. If that sounds correct, I'd be happy to do that on my end and push those changes to your branch for a final check.

V-F · 2020-02-13T13:13:12Z

Hello @lesserwhirls!
Unfortunately, I have to fix the code, since the commit 8c86aa3 only change the name of the variable "valueCharset" to the "charset". It is necessary to analyze each use of the readString() method and reload it for reading the values.

…ince it is applied only by reading attribute values. MOD H4iosp, H4header: renamed "charset" to the "valueCharset" since it is applied only by reading attribute values, and also text of the TagText, TagAnnotate and TagTextN. MOD H5iospNew, H5headerNew: renamed "charset" to the "valueCharset" since it is applied only by reading attribute values.

V-F · 2020-02-13T14:50:39Z

I've renamed member variable "charset" to the "valueCharset".
In the N3headerNew the defined charset is used only by reading attribute values.
In the H4header the defined charset used by reading attribute values, by reading text in the TagText, TagAnnotate and TagTextN (name, classname, fld_name is read with UTF_8). Sructured metadata (readStructMetadata) is also read with UTF_8.
In the H5headerNew the defined charset is used only by reading attribute values.

Merge

V-F · 2020-02-27T13:18:14Z

Hello @lesserwhirls!
Could you please check the changes that I made. Thanks in advance.

P.S. I have errors with 2 local tests: "testThreading1" and "testThreadingN":
java.lang.AssertionError
at ucar.httpservices.HTTPConnections.validate(HTTPConnections.java:98)
at ucar.httpservices.HTTPSession.validatestate(HTTPSession.java:1360)
at ucar.nc2.util.net.TestThreading.testThreading1(TestThreading.java:178) / at ucar.nc2.util.net.TestThreading.testThreadingN(TestThreading.java:157)
I think these errors aren't related to my changes.

lesserwhirls · 2020-02-29T14:08:59Z

Greetings @V-F! No, those errors are not related to your changes, and as I found when moving from Travis CI to GitHub Actions, are actually dependent on the order the tests are run (fixed in #214). I will give your PR a run on our Jenkins instance, which runs the full suite of tests (takes over an hour or so).

lesserwhirls · 2020-03-01T14:31:44Z

Jenkins looks good (no new failures), so I think we've got it. Thank you @V-F and @JSchnabel for seeing this through!

V-F · 2020-03-01T14:37:32Z

Thank you.

JohnLCaron · 2020-03-31T01:01:04Z

Sorry, Im just looking at this feature now.

So there's no way to detect that a NetcdfFile uses a nonstandard charset?
Do you assume all values use the same charset?
What about adding a Convention for this going forward? One could wrap such a file in NcML to add the Convention if desired, which means standard (java) tooling would work. If the calling routine passes the Charset in as in this PR, perhaps a global attribute should be added, so a copy of the file has it.
Did anything become of a Uniplot CDH iosp?

JohnLCaron · 2020-03-31T01:10:05Z

Also, how does the C library handle non standard charsets?

Vladislav Fuks added 2 commits January 14, 2020 15:21

BUG H5headerNew: fixed definition of unlimited flag before definition…

8d27ca6

… of length. Otherwise for unlimited dimension with length == 0 IllegalArgumentException will be thrown.

BUG H5headerNew, H5iospNew, H4header, H4iosp, N3headerNew, N3iospNew:…

5d3a347

… fixed formatting of java-doc comments.

lesserwhirls reviewed Jan 28, 2020

View reviewed changes

Vladislav Fuks added 2 commits January 29, 2020 07:50

MOD N3headerNew, H4header, H5header: fixed java doc comment for the g…

de4f6d0

…etValueCharset method. MOD N3iospNew, H4iosp, H5iospNew: added @nullable annotation for the parameter of the new added setValueCharset method.

MOD N3headerNew, N3iospNew, H4header, H4iosp, H5headerNew, H5iospNew:…

8c86aa3

… renamed class member valueCharset to the charset since it is used by reading not only values, but for each string.

MOD N3iospNew, H4iosp, H5headerNew, H5iospNew: fixed some code violat…

504f695

…ions (spotlessApply).

V-F and others added 2 commits February 3, 2020 08:09

Merge pull request #1 from Unidata/master

3d31fe6

Pull request

Merge branch 'master' into feature-supporting_of_different_charsets

a4df0d2

# Conflicts: # cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5headerNew.java

V-F added 3 commits February 13, 2020 16:05

Merge pull request #2 from Unidata/master

8b9d74c

Merge

Merge pull request #3 from Unidata/master

d888bc5

Merge

Merge pull request #4 from V-F/master

7a1e506

Merge

lesserwhirls merged commit 7c76c60 into Unidata:master Mar 1, 2020

V-F deleted the feature-supporting_of_different_charsets branch March 1, 2020 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature supporting of different charsets #184

Feature supporting of different charsets #184

V-F commented Jan 14, 2020 •

edited

Loading

claassistantio commented Jan 14, 2020 •

edited

Loading

lesserwhirls commented Jan 25, 2020

V-F commented Jan 25, 2020

lesserwhirls left a comment

ethanrd commented Jan 28, 2020

V-F commented Jan 29, 2020

lesserwhirls commented Jan 30, 2020

lesserwhirls commented Jan 30, 2020

JSchnabel commented Jan 30, 2020

ethanrd commented Jan 30, 2020

ajelenak commented Jan 31, 2020

V-F commented Feb 13, 2020

lesserwhirls commented Feb 13, 2020

V-F commented Feb 13, 2020

V-F commented Feb 13, 2020 •

edited

Loading

V-F commented Feb 27, 2020

lesserwhirls commented Feb 29, 2020

lesserwhirls commented Mar 1, 2020

V-F commented Mar 1, 2020

JohnLCaron commented Mar 31, 2020

JohnLCaron commented Mar 31, 2020

Feature supporting of different charsets #184

Feature supporting of different charsets #184

Conversation

V-F commented Jan 14, 2020 • edited Loading

claassistantio commented Jan 14, 2020 • edited Loading

lesserwhirls commented Jan 25, 2020

V-F commented Jan 25, 2020

lesserwhirls left a comment

Choose a reason for hiding this comment

ethanrd commented Jan 28, 2020

V-F commented Jan 29, 2020

lesserwhirls commented Jan 30, 2020

lesserwhirls commented Jan 30, 2020

JSchnabel commented Jan 30, 2020

ethanrd commented Jan 30, 2020

ajelenak commented Jan 31, 2020

V-F commented Feb 13, 2020

lesserwhirls commented Feb 13, 2020

V-F commented Feb 13, 2020

V-F commented Feb 13, 2020 • edited Loading

V-F commented Feb 27, 2020

lesserwhirls commented Feb 29, 2020

lesserwhirls commented Mar 1, 2020

V-F commented Mar 1, 2020

JohnLCaron commented Mar 31, 2020

JohnLCaron commented Mar 31, 2020

V-F commented Jan 14, 2020 •

edited

Loading

claassistantio commented Jan 14, 2020 •

edited

Loading

V-F commented Feb 13, 2020 •

edited

Loading