New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem interpreting short arrays #554
Comments
Just read through the xarray issue, and two things are not clear to me:
|
OK, it seems that it is a problem reading the raw short integer data. For some reason, the data returned by the netcdf C library looks different than the data returned by h5py. Can you let us know what versions of the netcdf-c and hdf5 libraries were used to create this dataset? |
I think I understand the problem now.
so the netcdf4-python reads the data into a big endian short integer array ( Not sure yet whether this is a bug in the creation of the dataset, the python interface or the netcdf C library. |
I now believe this is a bug in the dataset - the data is being stored in HDF5 as big endian data, but it is actually little endian. ncview and h5py return the 'right' answer since they are both assuming that the data is in the native endian format, netcdf4-python does not (it queries the file to determine the endian-ness). |
Jeff, thanks so much for your time looking into this! I will report it to the data producers and let you know how the files were created. |
I don't think the data is really little endian. Everything I see from the HDF5 side suggests that the data is indeed big endian. Here's what I see when I examine this file with
Note that there's no attribute Likewise, h5py shows the data is big endian, and reads it into NumPy as a big endian array:
|
Strange! Still, the pattern in the images looks like a typical endianess problem. |
The HDF5 variable was set to be big endian. However, neither netcdf-c or HDF5 ensures that the byte order is big endian when you write the data to the variable. I suspect the data was written on a little endian machine, and the bytes were not swapped before writing to the variable. netcdf4-python does ensure that the byte order of the numpy array is consistent with the netcdf variable before writing to the file, but I doubt that many other netcdf clients do. |
The netcdf-c library does appear to do correct byte swapping, the problem appears to be that netCDF4-python is attempting to do an additional byte swap on top. I have created a test file using:
Examining the output test.nc file I found one occurrence of the string
However, netcdf4-python does not read the file correctly: >>> nc=netCDF4.Dataset('test.nc')
>>> nc.variables['little'][:]
66051
>>> nc.variables['big'][:]
50462976
>>> |
OK - now I'm confused. The 'byteswap on write' was added when a user reported that this script did not work (a variant of this in included in test/tst_endian.py).
using github master, this produces
if 'byteswap on write' is disabled then you'll get
which is what h5py produces. Perhaps ncgen does the byteswapping, but a call to the nc_put_var in the C library does not? |
@oembury, your ncgen/ncdump example is consistent with the data being written little endian to both variables, regardless of the Perhaps we need one of the netcdf-c developers to chime in here. @WardF, can you confirm that neither the netcdf-c or hdf5 library does any byte swapping? In other words, if you try to write to a big endian variable on the little endian machine, the client should swap the bytes before passing the data to nc_put_var? |
netcdf-c always uses native endianness for variables in memory. The If I write the two variables to separate files:
The I find that little.nc contains the bytes: netcdf4-python should not have to do any byte-swapping when reading as netcdf-c will convert the on-disk representation to native endianness When writing, netcdf4-python (or user) should ensure data is in native endiannes before passing it to netcdf-c (i.e. byteswap if the numpy data is non-native) In both reading and writing netcdf-python should not need to know how the |
This is exactly what netcdf4-python is doing: no byte swapping on read, byte swapping on write if the byte order of the variable is different than byte order of the numpy array. My hypothesis is that the OP is dealing with a file that was created without byte swapping the data from the native endian format to the on-disk format. Note that netcdf4-python does need to know the |
OK, I re-read your post more carefully. I see you are suggesting that netcdf-c does do the byte swapping from the native format to the on-disk format when writing. This was not my understanding. I will need to confirm this from the unidata folks. |
Looks good to me. One thing I would suggest (maybe this should be a separate issue though?): When reading the default behaviour of netcdf4-python should be to return data in native-byte order. I think this is the behaviour most netcdf users will expect and will result in better performance (and consistency with other netcdf libraries). Having the numpy array match the on-disk byte order could be an option similar to mask/scale/fill etc. |
The was done originally to enable round-trips (to and from a netcdf file). If we always return a numpy array with native byte order, then you can't use the numpy dtype to infer the data type of the netcdf variable it came from. There may be code out there that relies on this - I would be hesitant to change the behavior at this point. Note that this is also what h5py does. |
cc @WardF |
I agree with @jswhit that we should return the data in whatever byte order it's saved on disk. This is also consistent with scipy.io.netcdf and h5py. If users need it in native order, they can copy it themselves. |
Here's an example showing that h5py and netcdf4-python have the same behavior with pull request #555 applied:
using the
|
As netCDF uses the HDF5 library for I/O, I suspect that netCDF also exhibits the same behavior observed in h5py and netcdf4-python + pull request #555. The internal swapping I alluded to earlier happens when the data is printed via ncdump, etc, but not as part of the data storage mechanism. |
merging pull request #555 now. |
Dear netCDF4 dev team,
This issue was originally filed as an xarray issue, but turned out to occur already at netCDF4 level. You can find the full description including test data and a Python notebook in xarray issue #822: value scaling wrong in special cases.
We hope this can be fixed easily as our project depends on it. If you could indicate where to look at in the code we might be able and happy to contribute.
The problem might be related to #493 because the problematic variable specifies
value_min
andvalue_max
attributes, however both have the same signedshort
datatype as the variable.Regards
-- Norman
The text was updated successfully, but these errors were encountered: