Interpretation of reserved _Unsigned attribute written by netCDF-Java #656

deeplycloudy · 2017-05-04T20:58:47Z

Unsigned integer data written by NetCDF-Java does not use an unsigned integer NetCDF type (e.g. uint16), but rather sets the reserved attribute _Unsigned on a (signed) int16 type. The result for the user of netcdf4-python is data which must be manually corrected after being read in.

It is my understanding that netCDF-C tries to "do no magic" so this is not a netcdf4-python bug per-se. #493 also notes the preference of netCDF4-python for leaving metadata interpretation to downstream applications. However, this is not a metadata convention in the CF sense, but rather a documented low level implementation decision within netCDF-Java: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/Netcdf4.html

Differences between netCDF-4 C and Java libraries for netCDF4 files
Unsigned types
The C library uses unsigned integer types: NC_UBYTE, NC_USHORT, NC_UINT, NC_UINT64.
The Java library does not have separate types for unsigned integers, but adds the reserved attribute _Unsigned = "true" when the variable is unsigned. One can check this with Variable.isUnsigned(), Attribute.isUnsigned(), and Array.isUnsigned(). Conversions done by the library are aware of this convention. Java does not have unsigned types, and we dont want to double the internal memory requirements by widening the data.

I am in favor of a feature to assist the users in correcting for signendess on read when this attribute is present. One possible route:

Add a method set_auto_signededness(True) to request the conversion be performed.
If the above is set, perform the conversion if requested within the read slice method.

I would be less in favor of adding a method to handle the conversion; it means every time a user reads data they must remember to call that method. But that would be more helpful than having every user write their own when encountering unsigned data written by netCDF-Java.

Thanks to @lesserwhirls for helping me understand some of the things above; any errors remain mine!

The text was updated successfully, but these errors were encountered:

jswhit · 2017-05-07T15:41:23Z

What exactly does the java library do when it reads data with _Unsigned=True? Does the C library recognize the this special attribute? - I don't see any mention of it in the NUG.

jswhit · 2017-05-07T15:52:10Z

I suppose we could return a view of the array with the appropriate unsigned dtype if the variable has _Unsigned=True. I don't like the idea of adding a set_auto_signededness - if the attribute is set there should be no ambiguity about what is expected.

On the other hand, it's not very hard for a user to just do data = data.view(np.uint32)...

I have to ask though - what is the point? Making a view or a copy to an unsigned type with the same number of bytes is not going the change the data at all.

deeplycloudy · 2017-05-08T04:19:39Z

The C library does not check for _Unsigned. That is in keeping with the "do no magic" principle relayed to me by @lesserwhirls. The string does not appear in code in the netcdf-c repository.

It looks like the java library checks for the _Unsigned attribute, and then calls either the signed or unsigned read functions as appropriate. See the code here. I believe these are calls to the (wrapped) C library functions. If you poke around the Java code, you find that thredds suppresses the _Unsigned attribute in some (all?) circumstances, perhaps because it's been dealt with by the IO code. I also found some further docs on the _Unsigned convention.

I agree that the data.view(np.uint32) operation is not tricky.

I suppose it comes down to the philosophy of this library:

Eliminate as many of the machine-interpretable steps as possible so as to deliver science-relevant numbers in an array.
Act as a mostly transparent wrapper around netcdf-c, serving as a byte delivery service.
I think this is a case for the former, because it corrects for a result of the machine encoding process.

Recent experience indicates that several experienced scientists who regularly use NetCDF but were dealing with data from a new instrument were surprised by the current behavior, to the level that we were worried we had corrupt data.

If a decision is made to automatically correct for signedness, it might be helpful for old user code (that already checks / corrects for the attribute) to have a way to toggle automatic interpretation. That would be the main purpose of set_auto_signedeness.

shoyer · 2017-05-08T05:51:30Z

@deeplycloudy Have you raised this issue with the maintainers of NetCDF-C? If this is a problem for Python users, it is certainly an issue for users of C-API as well, and the fixed would probably be more appropriately applied upstream And if the NetCDF-C maintainers don't think this is a good idea, we should consider why the Python interface should be any different. NetCDF-Python is not strictly a transparent wrapper around netCDF-C, but that's absolutely a nice property to preserve if possible.

Java does not have unsigned types, and we dont want to double the internal memory requirements by widening the data.

I don't know very much about Java, but this seems like a totally backwards solution to me. I can understand that this may have been a convenient choice, but the mere fact that Java does not have built-in unsigned types does not mean it cannot write unsigned data in the standard way -- it just makes it more difficult. By the same reasoning, Python cannot handle integers with fixed size, because the built-in int has unlimited size. Using an ad-hoc language specific convention is clearly a pretty terrible idea if you value cross-compatibility, and it isn't something that should be done silently. So conversely, have you brought this up with the netCDF-Java developers?

lesserwhirls · 2017-05-08T14:17:55Z

So I'm afraid I didn't give @deeplycloudy the whole picture on this (as I wasn't 100% clear myself). Much of this was done before my time, but here is the way I understand the situation.

The _Unsigned attribute came about when trying to represent other data models in netCDF-3 (for example, HDF5). This is a language agnostic question. In terms of reading, netCDF-Java decided to not widen unsigned data (one choice given the lack of native unsigned data types in Java), and has several methods that "do the right thing" to represent unsigned data, based on the presence of the _Unsigned attribute. When writing data in netCDF-3, the question became what do to with unsigned types, which are not directly supported by the netCDF-3 data model. The decision was made to keep the bit pattern of the unsigned type, write the data as a signed type, and add the _Unsigned attribute to indicate to the reader that the data are unsigned. Note that all of this predates netCDF-4 and CDF5, both of which support the unsigned types in their data model and can be read by netCDF-C. The _Unsigned convention is also used in the netcdf4 classic model. While the _Unsigned attribute is mentioned in the netCDF "Best Practices" guide (part of the netCDF-C docs), it is only one of a few choices. netCDF-C leaves it up to the downstream application as to how to handle unsigned data.

Fast-forward several years, and netCDF-Java bit the bullet and wrapped the c library to support writing netCDF-4 files (it became clear that the HDF C library would be the only way forward to write HDF5 files, for several reasons). Although Java does not have native unsigned types, netCDF-Java does write unsigned types properly as long as they are defined as such in the CDM. That is, when writing a netCDF-4 (extended data model) file (which is done through the C library), netCDF-Java will write unsigned types and does not need to use the _Unsigned convention.

It is possible, of course, that someone could use that convention when writing out a netCDF-4 extended data model file, but they certainly do not need to do so. If a developer is upgrading I/O code to write netCDF-4 extended model files, it's possible they overlooked the code for handling unsigned types that was needed in netCDF-3 (or netCDF-4 classic) and continue to write out unsigned data in the same way they used to.

In talks with the netCDF-C group about adding the ability to convert unsigned data in netCDF-3 (netCDF-4 classic) at the C level, we decided against it as the library would be changing the data type as encoded in the file - netCDF-C wants to represent as closely as possible what is actually encoded in the file, and leave the rest to downstream application (that is, do no magic or as little as possible).

So, bottom line (tl;dr): the _Unsigned attribute came about when trying to represent other data models in netCDF-3 (before netCDF-4, CDF5 existed), and was chosen over widening the data. Once netCDF-Java gained the ability to write data in a format that supported unsigned types, it did the correct thing (i.e. write unsigned types as unsigned without the need to use the _Unsigned attribute). Even though they are using data formats that support unsigned types, it appears some data providers continue to write unsigned data using the _Unsigned convention and @deeplycloudy is lucky enough to have some of those data.

The question at hand is what support, if any, should netCDF4-python give users in dealing with unsigned types written as signed types in netCDF-3 or netCDF-4 classic, as well as unsigned types encoded using the _Unsigned convention in netCDF-4 extended (even though it's completely unnecessary). To me, this sort of feels like the set_auto_* methods available on Dataset.

rsignell-usgs · 2017-05-08T15:09:38Z

@lesserwhirls, in the Writing NetCDF files: Best Practices section of the NetCDF Users Manual (NUG) it says:

A new proposed convention is to create a variable attribute _Unsigned = "true" to indicate that integer data should be treated as unsigned.

Does "new proposed convention" mean it's not actually an approved NUG convention yet?

jswhit · 2017-05-08T16:10:15Z

Does anyone have a sample file with _Unsigned=True so I can see what happens when netcdf4-python reads the data as is?

shoyer · 2017-05-08T16:11:44Z

OK, this makes more sense. Thanks @lesserwhirls for providing context here. NetCDF3 doesn't support unsigned integers, but NetCDF-Java does. For reference, if you try to save an unsigned integer to a NetCDF3 file with NetCDF4-Python or SciPy, you get an error message.

lesserwhirls · 2017-05-08T16:42:28Z

Here are two very simple netCDF files (netCDF-3 and netCDF-4 extended) with a byte and unsigned byte variable, as generated by netCDF-Java:

ubyte.nc3.zip
ubyte.nc4.zip

Here is the CDL for ubyte.nc4:

netcdf ubyte {
dimensions:
	d = 2 ;
variables:
	ubyte ub(d) ;
	byte sb(d) ;
data:

 ub = 0, 255 ;

 sb = -128, 127 ;
}

and for ubyte.nc3:

netcdf ubyte {
dimensions:
	d = 2 ;
variables:
	byte ub(d) ;
		ub:_Unsigned = "true" ;
	byte sb(d) ;
data:

 ub = 0, -1 ;

 sb = -128, 127 ;
}

jswhit · 2017-05-08T16:46:55Z

OK, it looks like creating a view of the data with dtype = uint8 does the right thing.

    >>> from netCDF4 import Dataset
    >>> f = Dataset('ubyte.nc3')
    >>> f
    <type 'netCDF4._netCDF4.Dataset'>
    root group (NETCDF3_CLASSIC data model, file format NETCDF3):
        dimensions(sizes): d(2)
        variables(dimensions): int8 ub(d), int8 sb(d)
        groups:
    >>> v = f['ub']
    >>> d = v[:]
    >>> d
    array([ 0, -1], dtype=int8)
    >>> import numpy as np
    >>> d.view(np.uint8)
    array([  0, 255], dtype=uint8)
    >>>

So, if the variable has an attribute _Unsigned=True and has a signed integer type, netcdf4-python could return a view of the data with the corresponding unsigned integer type.

I don't see any downsides to this right off the bat - surely this is what any user would want?

jswhit · 2017-05-08T19:14:11Z

went ahead and implemented this in pull request #658

deeplycloudy · 2017-05-08T19:47:26Z

I'll echo @shoyer's thanks for the context, @lesserwhirls. I agree with @jswhit that this is what any user would want. I'd be glad to test the PR with the dataset where I originally noticed this problem.

Is there a way to have conda use the travis builds locally so I can test this without setting up a build environment for the NetCDF stack?

dopplershift · 2017-05-08T21:03:09Z

@deeplycloudy sadly, no. It should be enough just to build netcdf4-python though, which isn't too bad IIRC.

deeplycloudy · 2017-05-08T21:04:43Z

@dopplershift Yeah, I was able to build it from source. Thanks for the reply!

jswhit · 2017-05-08T22:17:38Z

Two things are making me reconsider this pull request.

@shoyer pointed out that xarray wants the Variable dtype to be the same as the dtype of the data returned.
@deeplycloudy pointed out that if there is an interaction between the conversion to unsigned data and the application of scale and offset unpacking of integers.

Since it's so simple for the user to create a view with an unsigned dtype, perhaps it's better to avoid adding this 'magic' in the library. I'm sure there will be other unintended consequences.

deeplycloudy · 2017-05-09T02:45:54Z

Where there is documented, machine-readable, actionable metadata it is a delightful thing when software works to eliminate it for the end user. That has generally been my experience with netcdf4-python and xarray, so I'm hesitant to give up on this.

In cases where there is a scale factor and offset, it's then up to the user to both know to take an unsigned view of the data and apply the scaling. It's no longer a one-liner.

I noted a fix for the scale factor problem on the PR.

lesserwhirls · 2017-05-09T15:59:07Z

Not to hijack this issue, but...

Interestingly enough, the NUG (as well as CF) mentions that the attributes valid_min and valid_max can be used to indicate the signedness (signedness is a deprecated attribute for this). Also, under the best practices doc, although it's stated that

A new proposed convention is to create a variable attribute _Unsigned = "true" to indicate that integer data should be treated as unsigned.

the text goes on to say, later in the same document:

A conventional way to indicate whether a byte, short, or int variable is meant to be interpreted as unsigned, even for the netCDF-3 classic model that has no external unsigned integer type, is by providing the special variable attribute _Unsigned with value "true".

(this section also talks about what to do to packed data). Then again, _Unsigned isn't listed in the attribute conventions appendix. Not quite sure what to think.

@WardF, @DennisHeimbigner - what do you guys think?

deeplycloudy · 2017-05-09T21:32:39Z

@lesserwhirls the valid_min and valid_max issue has been mentioned in #493 as well.

recognize _Unsigned attribute and return unsigned integer data (issue #656)

jswhit · 2017-05-12T17:37:33Z

pull request merged - data is now returned as a view to an unsigned int by default if _Unsigned=True, can be disabled with set_autoscale(False).

jswhit added a commit that referenced this issue May 8, 2017

implement suggestion for dealing with _Unsigned attribute in issue #656

98e19ab

jswhit added a commit that referenced this issue May 8, 2017

add test for issue #656

8c030a1

jswhit added a commit that referenced this issue May 12, 2017

Merge pull request #658 from Unidata/issue656

aefa43e

recognize _Unsigned attribute and return unsigned integer data (issue #656)

jswhit closed this as completed May 12, 2017

deeplycloudy mentioned this issue Jun 6, 2017

Best practice when the _Unsigned attribute is present in NetCDF files pydata/xarray#1444

Closed

5tefan mentioned this issue Jun 6, 2017

Fallout of _Unsigned = "true": variable.dtype != variable[:].dtype #675

Closed

deeplycloudy mentioned this issue Jul 17, 2017

Automate interpretation of _Unsigned attribute pydata/xarray#1453

Merged

4 tasks

aleksandervines mentioned this issue Jan 11, 2018

Check CF-compliance of test files nansencenter/nansat#184

Closed

visr mentioned this issue Jun 27, 2021

use _Unsigned attribute Alexander-Barth/NCDatasets.jl#133

Open

Alexander-Barth mentioned this issue Jun 28, 2021

status of variable attribute _Unsigned = "true" as best practices Unidata/netcdf#40

Open

snowman2 mentioned this issue Sep 13, 2022

BUG: Support integer data types for netCDF _Unsigned convention OSGeo/gdal#6352

Closed

jswhit mentioned this issue Feb 12, 2023

Inconsistent Loading Results for netCDF4 version #1232

Closed

kmuehlbauer mentioned this issue Apr 27, 2023

xr.open_dataset() reading ubyte variables as float32 from DAP server pydata/xarray#7782

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpretation of reserved _Unsigned attribute written by netCDF-Java #656

Interpretation of reserved _Unsigned attribute written by netCDF-Java #656

deeplycloudy commented May 4, 2017 •

edited

Loading

jswhit commented May 7, 2017

jswhit commented May 7, 2017 •

edited

Loading

deeplycloudy commented May 8, 2017

shoyer commented May 8, 2017

lesserwhirls commented May 8, 2017

rsignell-usgs commented May 8, 2017

jswhit commented May 8, 2017

shoyer commented May 8, 2017

lesserwhirls commented May 8, 2017

jswhit commented May 8, 2017 •

edited

Loading

jswhit commented May 8, 2017

deeplycloudy commented May 8, 2017

dopplershift commented May 8, 2017

deeplycloudy commented May 8, 2017

jswhit commented May 8, 2017

deeplycloudy commented May 9, 2017

lesserwhirls commented May 9, 2017

deeplycloudy commented May 9, 2017

jswhit commented May 12, 2017

Interpretation of reserved _Unsigned attribute written by netCDF-Java #656

Interpretation of reserved _Unsigned attribute written by netCDF-Java #656

Comments

deeplycloudy commented May 4, 2017 • edited Loading

jswhit commented May 7, 2017

jswhit commented May 7, 2017 • edited Loading

deeplycloudy commented May 8, 2017

shoyer commented May 8, 2017

lesserwhirls commented May 8, 2017

rsignell-usgs commented May 8, 2017

jswhit commented May 8, 2017

shoyer commented May 8, 2017

lesserwhirls commented May 8, 2017

jswhit commented May 8, 2017 • edited Loading

jswhit commented May 8, 2017

deeplycloudy commented May 8, 2017

dopplershift commented May 8, 2017

deeplycloudy commented May 8, 2017

jswhit commented May 8, 2017

deeplycloudy commented May 9, 2017

lesserwhirls commented May 9, 2017

deeplycloudy commented May 9, 2017

jswhit commented May 12, 2017

deeplycloudy commented May 4, 2017 •

edited

Loading

jswhit commented May 7, 2017 •

edited

Loading

jswhit commented May 8, 2017 •

edited

Loading