Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem interpreting short arrays #554

Closed
forman opened this issue Apr 11, 2016 · 22 comments
Closed

Problem interpreting short arrays #554

forman opened this issue Apr 11, 2016 · 22 comments

Comments

@forman
Copy link

forman commented Apr 11, 2016

Dear netCDF4 dev team,

This issue was originally filed as an xarray issue, but turned out to occur already at netCDF4 level. You can find the full description including test data and a Python notebook in xarray issue #822: value scaling wrong in special cases.

We hope this can be fixed easily as our project depends on it. If you could indicate where to look at in the code we might be able and happy to contribute.

The problem might be related to #493 because the problematic variable specifies value_min and value_max attributes, however both have the same signed short datatype as the variable.

Regards
-- Norman

@jswhit
Copy link
Collaborator

jswhit commented Apr 11, 2016

Just read through the xarray issue, and two things are not clear to me:

  1. Is the data read correctly if the auto-scaling is turned off (see http://unidata.github.io/netcdf4-python/#netCDF4.Dataset.set_auto_scale)?

  2. netcdf4-python simply applies the scale_factor and add_offset to the non-missing values in the data. What other steps are necessary to re-scale the data correctly in this case? Should the short integer data first be clipped to valid_min/valid_max?

@jswhit
Copy link
Collaborator

jswhit commented Apr 11, 2016

OK, it seems that it is a problem reading the raw short integer data. For some reason, the data returned by the netcdf C library looks different than the data returned by h5py.
The data returned by netcdf-c looks like this (https://cloud.githubusercontent.com/assets/1217238/14413677/2a1a0b7a-ff36-11e5-9de2-e7ee1c236c18.png) whereas the data returned by h5py looks like this (https://cloud.githubusercontent.com/assets/10050469/14421721/790ba00c-ffd3-11e5-9ffd-32daa03849df.png).

Can you let us know what versions of the netcdf-c and hdf5 libraries were used to create this dataset?

@jswhit
Copy link
Collaborator

jswhit commented Apr 11, 2016

I think I understand the problem now. ncdump -hs on the netcdf file shows

analysed_sst:_Endianness = "big"

so the netcdf4-python reads the data into a big endian short integer array (>i2). However, it appears that the data is really little endian. If I byteswap the data using data = data.byteswap() that is returned by slicing the netcdf variable, it looks fine.

Not sure yet whether this is a bug in the creation of the dataset, the python interface or the netcdf C library.

@jswhit
Copy link
Collaborator

jswhit commented Apr 11, 2016

I now believe this is a bug in the dataset - the data is being stored in HDF5 as big endian data, but it is actually little endian. ncview and h5py return the 'right' answer since they are both assuming that the data is in the native endian format, netcdf4-python does not (it queries the file to determine the endian-ness).

@forman
Copy link
Author

forman commented Apr 12, 2016

Jeff, thanks so much for your time looking into this! I will report it to the data producers and let you know how the files were created.

@shoyer
Copy link
Contributor

shoyer commented Apr 12, 2016

I don't think the data is really little endian. Everything I see from the HDF5 side suggests that the data is indeed big endian.

Here's what I see when I examine this file with h5dump -H:

   DATASET "analysed_sst" {
      DATATYPE  H5T_STD_I16BE
      DATASPACE  SIMPLE { ( 1, 3600, 7200 ) / ( H5S_UNLIMITED, 3600, 7200 ) }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
      }
      ATTRIBUTE "_FillValue" {
         DATATYPE  H5T_STD_I16LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "add_offset" {
         DATATYPE  H5T_IEEE_F32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "comment" {
         DATATYPE  H5T_STRING {
            STRSIZE 88;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "depth" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "long_name" {
         DATATYPE  H5T_STRING {
            STRSIZE 32;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "scale_factor" {
         DATATYPE  H5T_IEEE_F32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "source" {
         DATATYPE  H5T_STRING {
            STRSIZE 118;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "standard_name" {
         DATATYPE  H5T_STRING {
            STRSIZE 21;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "units" {
         DATATYPE  H5T_STRING {
            STRSIZE 6;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "valid_max" {
         DATATYPE  H5T_STD_I16LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "valid_min" {
         DATATYPE  H5T_STD_I16LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
   }

Note that there's no attribute _Endianness = "big" on the original data and that it has dtype H5T_STD_I16BE.

Likewise, h5py shows the data is big endian, and reads it into NumPy as a big endian array:

In [20]: sst
Out[20]: <HDF5 dataset "analysed_sst": shape (1, 3600, 7200), type ">i2">

In [21]: np.asarray(sst).dtype
Out[21]: dtype('>i2')

In [22]: np.max(sst)  # this the correct value, 33.29 K above add_offset
Out[22]: 3329

@forman
Copy link
Author

forman commented Apr 12, 2016

Strange! Still, the pattern in the images looks like a typical endianess problem.

@jswhit
Copy link
Collaborator

jswhit commented Apr 12, 2016

The HDF5 variable was set to be big endian. However, neither netcdf-c or HDF5 ensures that the byte order is big endian when you write the data to the variable. I suspect the data was written on a little endian machine, and the bytes were not swapped before writing to the variable.

netcdf4-python does ensure that the byte order of the numpy array is consistent with the netcdf variable before writing to the file, but I doubt that many other netcdf clients do.

@oembury
Copy link

oembury commented Apr 12, 2016

The netcdf-c library does appear to do correct byte swapping, the problem appears to be that netCDF4-python is attempting to do an additional byte swap on top. I have created a test file using:

ncgen -b << EOF
netcdf test {
  variables:
    int big ;
      big:_Endianness = "big" ;
    int little ;
      little:_Endianness = "little" ;
  data:
    big = 66051;
    little = 66051;
}
EOF

Examining the output test.nc file I found one occurrence of the string \x00\x01\x02\x03 and one of \x03\x02\x01\x00 indicating that netcdf has correctly byte swapped the data when writing to the file. Both ncdump and h5dump can read the file correctly:

ncdump -s test.nc 
netcdf test {
variables:
    int big ;
        big:_Endianness = "big" ;
    int little ;
        little:_Endianness = "little" ;

// global attributes:
        :_Format = "netCDF-4" ;
data:

 big = 66051 ;

 little = 66051 ;
}
h5dump test.nc 
HDF5 "test.nc" {
GROUP "/" {
   DATASET "big" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  SCALAR
      DATA {
      (0): 66051
      }
   }
   DATASET "little" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SCALAR
      DATA {
      (0): 66051
      }
   }
}
}

However, netcdf4-python does not read the file correctly:

>>> nc=netCDF4.Dataset('test.nc')
>>> nc.variables['little'][:]
66051
>>> nc.variables['big'][:]
50462976
>>> 

@jswhit
Copy link
Collaborator

jswhit commented Apr 12, 2016

OK - now I'm confused. The 'byteswap on write' was added when a user reported that this script did not work (a variant of this in included in test/tst_endian.py).

import netCDF4
import numpy as np

# this example should be run on a little endian machine.

dt = np.float32
s = np.arange(10, dtype=dt) # little endian data
print xs,xs.dtype
nc = netCDF4.Dataset('test.nc', mode='w')
nc.createDimension('x', size=xs.size)
nc.createVariable('x', dt, ('x'), endian='big') # big endian variable
nc.variables['x'][:] = xs # data byteswapped internally before write
nc.close()

nc = netCDF4.Dataset('test.nc')
xsout = nc.variables['x'][:]
print xsout, xsout.dtype
nc.close()

using github master, this produces

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] float32
testendian.py:11: UserWarning: endian-ness of dtype and endian kwarg do not match, using     endian kwarg
  nc.createVariable('x', dt, ('x'), endian='big') # big endian variable
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4

if 'byteswap on write' is disabled then you'll get

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4
[  0.00000000e+00   4.60060299e-41   8.96831017e-44   2.30485571e-41
   4.60074312e-41   5.74868682e-41   6.89663052e-41   8.04457422e-41
   9.10844002e-44   5.83080291e-42] >f4

which is what h5py produces. Perhaps ncgen does the byteswapping, but a call to the nc_put_var in the C library does not?

@jswhit
Copy link
Collaborator

jswhit commented Apr 12, 2016

@oembury, your ncgen/ncdump example is consistent with the data being written little endian to both variables, regardless of the _Endianness attribute. Neither ncdump or h5dump is flipping bytes, so what you see written on the screen is the little endian representation of the bytes in the file (assuming you are on the little endian machine). The fact that the 'correct' answer (66051) is seen confirms that the bytes are little endian in the hdf5 file, regardless of whether the datatype is big or little endian.

Perhaps we need one of the netcdf-c developers to chime in here. @WardF, can you confirm that neither the netcdf-c or hdf5 library does any byte swapping? In other words, if you try to write to a big endian variable on the little endian machine, the client should swap the bytes before passing the data to nc_put_var?

@oembury
Copy link

oembury commented Apr 12, 2016

netcdf-c always uses native endianness for variables in memory. The _Endianness attribute only applies to the on-disk layout - and only if you're using a netcdf4 file (netcdf3 files are always big endian as far as I'm aware)

If I write the two variables to separate files:

ncgen -b << EOF
netcdf big {
variables:
  int big ;
    big:_Endianness = "big" ;
data:
   big = 66051 ;
}
EOF

ncgen -b << EOF
netcdf little {
variables:
  int little ;
    little:_Endianness = "little" ;
data:
  little = 66051 ;
}
EOF

The I find that little.nc contains the bytes: \x03\x02\x01\x00 and big.nc contains: \x00\x01\x02\x03 i.e. netcdf-c has performed the byte swapping from native to the endiannes specified for on-disk storage

netcdf4-python should not have to do any byte-swapping when reading as netcdf-c will convert the on-disk representation to native endianness

When writing, netcdf4-python (or user) should ensure data is in native endiannes before passing it to netcdf-c (i.e. byteswap if the numpy data is non-native)

In both reading and writing netcdf-python should not need to know how the _Endianness attribute is set (same as compression, shuffle etc. it is only relevant to the HDF5 layer)

@jswhit
Copy link
Collaborator

jswhit commented Apr 12, 2016

This is exactly what netcdf4-python is doing: no byte swapping on read, byte swapping on write if the byte order of the variable is different than byte order of the numpy array. My hypothesis is that the OP is dealing with a file that was created without byte swapping the data from the native endian format to the on-disk format.

Note that netcdf4-python does need to know the _Endianness in order to make sure that the data is byte swapped on write if the on-disk byte order is different than the numpy array byte order.

@jswhit
Copy link
Collaborator

jswhit commented Apr 12, 2016

OK, I re-read your post more carefully. I see you are suggesting that netcdf-c does do the byte swapping from the native format to the on-disk format when writing. This was not my understanding. I will need to confirm this from the unidata folks.

@jswhit
Copy link
Collaborator

jswhit commented Apr 12, 2016

pull request #555 alters behavior to be consistent with what @oembury suggests netcdf-c expects.

Behavior of netcdf4-python now consistent with h5py. dataset posted by @forman now reads and plots correctly.

Still need to check with netcdf-c devs to see if I've got it right this time.

@oembury
Copy link

oembury commented Apr 13, 2016

Looks good to me. One thing I would suggest (maybe this should be a separate issue though?):

When reading the default behaviour of netcdf4-python should be to return data in native-byte order. I think this is the behaviour most netcdf users will expect and will result in better performance (and consistency with other netcdf libraries). Having the numpy array match the on-disk byte order could be an option similar to mask/scale/fill etc.

@jswhit
Copy link
Collaborator

jswhit commented Apr 13, 2016

The was done originally to enable round-trips (to and from a netcdf file). If we always return a numpy array with native byte order, then you can't use the numpy dtype to infer the data type of the netcdf variable it came from. There may be code out there that relies on this - I would be hesitant to change the behavior at this point.

Note that this is also what h5py does.

@dopplershift
Copy link
Member

cc @WardF

@shoyer
Copy link
Contributor

shoyer commented Apr 13, 2016

I agree with @jswhit that we should return the data in whatever byte order it's saved on disk. This is also consistent with scipy.io.netcdf and h5py. If users need it in native order, they can copy it themselves.

@jswhit
Copy link
Collaborator

jswhit commented Apr 13, 2016

Here's an example showing that h5py and netcdf4-python have the same behavior with pull request #555 applied:

import netCDF4, h5py, numpy
xs = numpy.arange(10, dtype='<f4') # little endian data
nc = netCDF4.Dataset('test.nc', mode='w')
nc.createDimension('x', size=xs.size)
nc.createVariable('x', '>f4', ('x'), endian='big') # big endian variable
nc.variables['x'][:] = xs
nc.close()
nc = netCDF4.Dataset('test.nc')
xsout = nc.variables['x'][:]
print 'netcdf4-python',xsout, xsout.dtype, repr(xsout[-1].tobytes())
nc.close()
f = h5py.File('test.nc')
xsout = f['x'][:]
print 'h5py',xsout, xsout.dtype, repr(xsout[-1].tobytes())
f.close()

using the byteswap branch, I get

netcdf4-python [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4 '\x00\x00\x10A'
h5py [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4 '\x00\x00\x10A

@WardF
Copy link
Member

WardF commented Apr 13, 2016

As netCDF uses the HDF5 library for I/O, I suspect that netCDF also exhibits the same behavior observed in h5py and netcdf4-python + pull request #555. The internal swapping I alluded to earlier happens when the data is printed via ncdump, etc, but not as part of the data storage mechanism.

@jswhit
Copy link
Collaborator

jswhit commented Apr 14, 2016

merging pull request #555 now.

@jswhit jswhit closed this as completed Apr 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants