-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpretation of reserved _Unsigned attribute written by netCDF-Java #656
Comments
What exactly does the java library do when it reads data with |
I suppose we could return a view of the array with the appropriate unsigned dtype if the variable has On the other hand, it's not very hard for a user to just do I have to ask though - what is the point? Making a view or a copy to an unsigned type with the same number of bytes is not going the change the data at all. |
The C library does not check for It looks like the java library checks for the I agree that the I suppose it comes down to the philosophy of this library:
Recent experience indicates that several experienced scientists who regularly use NetCDF but were dealing with data from a new instrument were surprised by the current behavior, to the level that we were worried we had corrupt data. If a decision is made to automatically correct for signedness, it might be helpful for old user code (that already checks / corrects for the attribute) to have a way to toggle automatic interpretation. That would be the main purpose of |
@deeplycloudy Have you raised this issue with the maintainers of NetCDF-C? If this is a problem for Python users, it is certainly an issue for users of C-API as well, and the fixed would probably be more appropriately applied upstream And if the NetCDF-C maintainers don't think this is a good idea, we should consider why the Python interface should be any different. NetCDF-Python is not strictly a transparent wrapper around netCDF-C, but that's absolutely a nice property to preserve if possible.
I don't know very much about Java, but this seems like a totally backwards solution to me. I can understand that this may have been a convenient choice, but the mere fact that Java does not have built-in unsigned types does not mean it cannot write unsigned data in the standard way -- it just makes it more difficult. By the same reasoning, Python cannot handle integers with fixed size, because the built-in |
So I'm afraid I didn't give @deeplycloudy the whole picture on this (as I wasn't 100% clear myself). Much of this was done before my time, but here is the way I understand the situation. The Fast-forward several years, and netCDF-Java bit the bullet and wrapped the c library to support writing netCDF-4 files (it became clear that the HDF C library would be the only way forward to write HDF5 files, for several reasons). Although Java does not have native unsigned types, netCDF-Java does write unsigned types properly as long as they are defined as such in the CDM. That is, when writing a netCDF-4 (extended data model) file (which is done through the C library), netCDF-Java will write unsigned types and does not need to use the It is possible, of course, that someone could use that convention when writing out a netCDF-4 extended data model file, but they certainly do not need to do so. If a developer is upgrading I/O code to write netCDF-4 extended model files, it's possible they overlooked the code for handling unsigned types that was needed in netCDF-3 (or netCDF-4 classic) and continue to write out unsigned data in the same way they used to. In talks with the netCDF-C group about adding the ability to convert unsigned data in netCDF-3 (netCDF-4 classic) at the C level, we decided against it as the library would be changing the data type as encoded in the file - netCDF-C wants to represent as closely as possible what is actually encoded in the file, and leave the rest to downstream application (that is, do no magic or as little as possible). So, bottom line (tl;dr): the The question at hand is what support, if any, should netCDF4-python give users in dealing with unsigned types written as signed types in netCDF-3 or netCDF-4 classic, as well as unsigned types encoded using the |
@lesserwhirls, in the Writing NetCDF files: Best Practices section of the NetCDF Users Manual (NUG) it says:
Does "new proposed convention" mean it's not actually an approved NUG convention yet? |
Does anyone have a sample file with |
OK, this makes more sense. Thanks @lesserwhirls for providing context here. NetCDF3 doesn't support unsigned integers, but NetCDF-Java does. For reference, if you try to save an unsigned integer to a NetCDF3 file with NetCDF4-Python or SciPy, you get an error message. |
Here are two very simple netCDF files (netCDF-3 and netCDF-4 extended) with a byte and unsigned byte variable, as generated by netCDF-Java: Here is the CDL for ubyte.nc4:
and for ubyte.nc3:
|
OK, it looks like creating a view of the data with dtype = uint8 does the right thing. >>> from netCDF4 import Dataset
>>> f = Dataset('ubyte.nc3')
>>> f
<type 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
dimensions(sizes): d(2)
variables(dimensions): int8 ub(d), int8 sb(d)
groups:
>>> v = f['ub']
>>> d = v[:]
>>> d
array([ 0, -1], dtype=int8)
>>> import numpy as np
>>> d.view(np.uint8)
array([ 0, 255], dtype=uint8)
>>> So, if the variable has an attribute I don't see any downsides to this right off the bat - surely this is what any user would want? |
went ahead and implemented this in pull request #658 |
I'll echo @shoyer's thanks for the context, @lesserwhirls. I agree with @jswhit that this is what any user would want. I'd be glad to test the PR with the dataset where I originally noticed this problem. Is there a way to have conda use the travis builds locally so I can test this without setting up a build environment for the NetCDF stack? |
@deeplycloudy sadly, no. It should be enough just to build netcdf4-python though, which isn't too bad IIRC. |
@dopplershift Yeah, I was able to build it from source. Thanks for the reply! |
Two things are making me reconsider this pull request.
Since it's so simple for the user to create a view with an unsigned dtype, perhaps it's better to avoid adding this 'magic' in the library. I'm sure there will be other unintended consequences. |
Where there is documented, machine-readable, actionable metadata it is a delightful thing when software works to eliminate it for the end user. That has generally been my experience with netcdf4-python and xarray, so I'm hesitant to give up on this. In cases where there is a scale factor and offset, it's then up to the user to both know to take an unsigned view of the data and apply the scaling. It's no longer a one-liner. I noted a fix for the scale factor problem on the PR. |
Not to hijack this issue, but... Interestingly enough, the NUG (as well as CF) mentions that the attributes
the text goes on to say, later in the same document:
(this section also talks about what to do to packed data). Then again, @WardF, @DennisHeimbigner - what do you guys think? |
@lesserwhirls the valid_min and valid_max issue has been mentioned in #493 as well. |
recognize _Unsigned attribute and return unsigned integer data (issue #656)
pull request merged - data is now returned as a view to an unsigned int by default if _Unsigned=True, can be disabled with set_autoscale(False). |
Unsigned integer data written by NetCDF-Java does not use an unsigned integer NetCDF type (e.g.
uint16
), but rather sets the reserved attribute_Unsigned
on a (signed)int16
type. The result for the user of netcdf4-python is data which must be manually corrected after being read in.It is my understanding that netCDF-C tries to "do no magic" so this is not a netcdf4-python bug per-se. #493 also notes the preference of netCDF4-python for leaving metadata interpretation to downstream applications. However, this is not a metadata convention in the CF sense, but rather a documented low level implementation decision within netCDF-Java: http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/CDM/Netcdf4.html
I am in favor of a feature to assist the users in correcting for signendess on read when this attribute is present. One possible route:
set_auto_signededness(True)
to request the conversion be performed.I would be less in favor of adding a method to handle the conversion; it means every time a user reads data they must remember to call that method. But that would be more helpful than having every user write their own when encountering unsigned data written by netCDF-Java.
Thanks to @lesserwhirls for helping me understand some of the things above; any errors remain mine!
The text was updated successfully, but these errors were encountered: