Problem interpreting short arrays #554

forman · 2016-04-11T09:21:59Z

Dear netCDF4 dev team,

This issue was originally filed as an xarray issue, but turned out to occur already at netCDF4 level. You can find the full description including test data and a Python notebook in xarray issue #822: value scaling wrong in special cases.

We hope this can be fixed easily as our project depends on it. If you could indicate where to look at in the code we might be able and happy to contribute.

The problem might be related to #493 because the problematic variable specifies value_min and value_max attributes, however both have the same signed short datatype as the variable.

Regards
-- Norman

The text was updated successfully, but these errors were encountered:

jswhit · 2016-04-11T15:34:18Z

Just read through the xarray issue, and two things are not clear to me:

Is the data read correctly if the auto-scaling is turned off (see http://unidata.github.io/netcdf4-python/#netCDF4.Dataset.set_auto_scale)?
netcdf4-python simply applies the scale_factor and add_offset to the non-missing values in the data. What other steps are necessary to re-scale the data correctly in this case? Should the short integer data first be clipped to valid_min/valid_max?

jswhit · 2016-04-11T17:43:15Z

OK, it seems that it is a problem reading the raw short integer data. For some reason, the data returned by the netcdf C library looks different than the data returned by h5py.
The data returned by netcdf-c looks like this (https://cloud.githubusercontent.com/assets/1217238/14413677/2a1a0b7a-ff36-11e5-9de2-e7ee1c236c18.png) whereas the data returned by h5py looks like this (https://cloud.githubusercontent.com/assets/10050469/14421721/790ba00c-ffd3-11e5-9ffd-32daa03849df.png).

Can you let us know what versions of the netcdf-c and hdf5 libraries were used to create this dataset?

jswhit · 2016-04-11T18:13:26Z

I think I understand the problem now. ncdump -hs on the netcdf file shows

analysed_sst:_Endianness = "big"

so the netcdf4-python reads the data into a big endian short integer array (>i2). However, it appears that the data is really little endian. If I byteswap the data using data = data.byteswap() that is returned by slicing the netcdf variable, it looks fine.

Not sure yet whether this is a bug in the creation of the dataset, the python interface or the netcdf C library.

jswhit · 2016-04-11T20:49:51Z

I now believe this is a bug in the dataset - the data is being stored in HDF5 as big endian data, but it is actually little endian. ncview and h5py return the 'right' answer since they are both assuming that the data is in the native endian format, netcdf4-python does not (it queries the file to determine the endian-ness).

forman · 2016-04-12T04:45:19Z

Jeff, thanks so much for your time looking into this! I will report it to the data producers and let you know how the files were created.

shoyer · 2016-04-12T05:20:34Z

I don't think the data is really little endian. Everything I see from the HDF5 side suggests that the data is indeed big endian.

Here's what I see when I examine this file with h5dump -H:

   DATASET "analysed_sst" {
      DATATYPE  H5T_STD_I16BE
      DATASPACE  SIMPLE { ( 1, 3600, 7200 ) / ( H5S_UNLIMITED, 3600, 7200 ) }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
      }
      ATTRIBUTE "_FillValue" {
         DATATYPE  H5T_STD_I16LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "add_offset" {
         DATATYPE  H5T_IEEE_F32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "comment" {
         DATATYPE  H5T_STRING {
            STRSIZE 88;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "depth" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "long_name" {
         DATATYPE  H5T_STRING {
            STRSIZE 32;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "scale_factor" {
         DATATYPE  H5T_IEEE_F32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "source" {
         DATATYPE  H5T_STRING {
            STRSIZE 118;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "standard_name" {
         DATATYPE  H5T_STRING {
            STRSIZE 21;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "units" {
         DATATYPE  H5T_STRING {
            STRSIZE 6;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "valid_max" {
         DATATYPE  H5T_STD_I16LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "valid_min" {
         DATATYPE  H5T_STD_I16LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
   }

Note that there's no attribute _Endianness = "big" on the original data and that it has dtype H5T_STD_I16BE.

Likewise, h5py shows the data is big endian, and reads it into NumPy as a big endian array:

In [20]: sst
Out[20]: <HDF5 dataset "analysed_sst": shape (1, 3600, 7200), type ">i2">

In [21]: np.asarray(sst).dtype
Out[21]: dtype('>i2')

In [22]: np.max(sst)  # this the correct value, 33.29 K above add_offset
Out[22]: 3329

forman · 2016-04-12T06:13:06Z

Strange! Still, the pattern in the images looks like a typical endianess problem.

jswhit · 2016-04-12T12:32:35Z

The HDF5 variable was set to be big endian. However, neither netcdf-c or HDF5 ensures that the byte order is big endian when you write the data to the variable. I suspect the data was written on a little endian machine, and the bytes were not swapped before writing to the variable.

netcdf4-python does ensure that the byte order of the numpy array is consistent with the netcdf variable before writing to the file, but I doubt that many other netcdf clients do.

oembury · 2016-04-12T16:33:12Z

The netcdf-c library does appear to do correct byte swapping, the problem appears to be that netCDF4-python is attempting to do an additional byte swap on top. I have created a test file using:

ncgen -b << EOF
netcdf test {
  variables:
    int big ;
      big:_Endianness = "big" ;
    int little ;
      little:_Endianness = "little" ;
  data:
    big = 66051;
    little = 66051;
}
EOF

Examining the output test.nc file I found one occurrence of the string \x00\x01\x02\x03 and one of \x03\x02\x01\x00 indicating that netcdf has correctly byte swapped the data when writing to the file. Both ncdump and h5dump can read the file correctly:

ncdump -s test.nc 
netcdf test {
variables:
    int big ;
        big:_Endianness = "big" ;
    int little ;
        little:_Endianness = "little" ;

// global attributes:
        :_Format = "netCDF-4" ;
data:

 big = 66051 ;

 little = 66051 ;
}

h5dump test.nc 
HDF5 "test.nc" {
GROUP "/" {
   DATASET "big" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  SCALAR
      DATA {
      (0): 66051
      }
   }
   DATASET "little" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SCALAR
      DATA {
      (0): 66051
      }
   }
}
}

However, netcdf4-python does not read the file correctly:

>>> nc=netCDF4.Dataset('test.nc')
>>> nc.variables['little'][:]
66051
>>> nc.variables['big'][:]
50462976
>>>

jswhit · 2016-04-12T17:09:51Z

OK - now I'm confused. The 'byteswap on write' was added when a user reported that this script did not work (a variant of this in included in test/tst_endian.py).

import netCDF4
import numpy as np

# this example should be run on a little endian machine.

dt = np.float32
s = np.arange(10, dtype=dt) # little endian data
print xs,xs.dtype
nc = netCDF4.Dataset('test.nc', mode='w')
nc.createDimension('x', size=xs.size)
nc.createVariable('x', dt, ('x'), endian='big') # big endian variable
nc.variables['x'][:] = xs # data byteswapped internally before write
nc.close()

nc = netCDF4.Dataset('test.nc')
xsout = nc.variables['x'][:]
print xsout, xsout.dtype
nc.close()

using github master, this produces

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] float32
testendian.py:11: UserWarning: endian-ness of dtype and endian kwarg do not match, using     endian kwarg
  nc.createVariable('x', dt, ('x'), endian='big') # big endian variable
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4

if 'byteswap on write' is disabled then you'll get

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4
[  0.00000000e+00   4.60060299e-41   8.96831017e-44   2.30485571e-41
   4.60074312e-41   5.74868682e-41   6.89663052e-41   8.04457422e-41
   9.10844002e-44   5.83080291e-42] >f4

which is what h5py produces. Perhaps ncgen does the byteswapping, but a call to the nc_put_var in the C library does not?

jswhit · 2016-04-12T17:25:41Z

@oembury, your ncgen/ncdump example is consistent with the data being written little endian to both variables, regardless of the _Endianness attribute. Neither ncdump or h5dump is flipping bytes, so what you see written on the screen is the little endian representation of the bytes in the file (assuming you are on the little endian machine). The fact that the 'correct' answer (66051) is seen confirms that the bytes are little endian in the hdf5 file, regardless of whether the datatype is big or little endian.

Perhaps we need one of the netcdf-c developers to chime in here. @WardF, can you confirm that neither the netcdf-c or hdf5 library does any byte swapping? In other words, if you try to write to a big endian variable on the little endian machine, the client should swap the bytes before passing the data to nc_put_var?

oembury · 2016-04-12T20:12:05Z

netcdf-c always uses native endianness for variables in memory. The _Endianness attribute only applies to the on-disk layout - and only if you're using a netcdf4 file (netcdf3 files are always big endian as far as I'm aware)

If I write the two variables to separate files:

ncgen -b << EOF
netcdf big {
variables:
  int big ;
    big:_Endianness = "big" ;
data:
   big = 66051 ;
}
EOF

ncgen -b << EOF
netcdf little {
variables:
  int little ;
    little:_Endianness = "little" ;
data:
  little = 66051 ;
}
EOF

The I find that little.nc contains the bytes: \x03\x02\x01\x00 and big.nc contains: \x00\x01\x02\x03 i.e. netcdf-c has performed the byte swapping from native to the endiannes specified for on-disk storage

netcdf4-python should not have to do any byte-swapping when reading as netcdf-c will convert the on-disk representation to native endianness

When writing, netcdf4-python (or user) should ensure data is in native endiannes before passing it to netcdf-c (i.e. byteswap if the numpy data is non-native)

In both reading and writing netcdf-python should not need to know how the _Endianness attribute is set (same as compression, shuffle etc. it is only relevant to the HDF5 layer)

jswhit · 2016-04-12T21:19:15Z

This is exactly what netcdf4-python is doing: no byte swapping on read, byte swapping on write if the byte order of the variable is different than byte order of the numpy array. My hypothesis is that the OP is dealing with a file that was created without byte swapping the data from the native endian format to the on-disk format.

Note that netcdf4-python does need to know the _Endianness in order to make sure that the data is byte swapped on write if the on-disk byte order is different than the numpy array byte order.

jswhit · 2016-04-12T21:21:42Z

OK, I re-read your post more carefully. I see you are suggesting that netcdf-c does do the byte swapping from the native format to the on-disk format when writing. This was not my understanding. I will need to confirm this from the unidata folks.

jswhit · 2016-04-12T22:14:51Z

pull request #555 alters behavior to be consistent with what @oembury suggests netcdf-c expects.

Behavior of netcdf4-python now consistent with h5py. dataset posted by @forman now reads and plots correctly.

Still need to check with netcdf-c devs to see if I've got it right this time.

oembury · 2016-04-13T11:55:17Z

Looks good to me. One thing I would suggest (maybe this should be a separate issue though?):

When reading the default behaviour of netcdf4-python should be to return data in native-byte order. I think this is the behaviour most netcdf users will expect and will result in better performance (and consistency with other netcdf libraries). Having the numpy array match the on-disk byte order could be an option similar to mask/scale/fill etc.

jswhit · 2016-04-13T12:26:41Z

The was done originally to enable round-trips (to and from a netcdf file). If we always return a numpy array with native byte order, then you can't use the numpy dtype to infer the data type of the netcdf variable it came from. There may be code out there that relies on this - I would be hesitant to change the behavior at this point.

Note that this is also what h5py does.

dopplershift · 2016-04-13T17:09:33Z

cc @WardF

shoyer · 2016-04-13T17:12:19Z

I agree with @jswhit that we should return the data in whatever byte order it's saved on disk. This is also consistent with scipy.io.netcdf and h5py. If users need it in native order, they can copy it themselves.

jswhit · 2016-04-13T18:02:29Z

Here's an example showing that h5py and netcdf4-python have the same behavior with pull request #555 applied:

import netCDF4, h5py, numpy
xs = numpy.arange(10, dtype='<f4') # little endian data
nc = netCDF4.Dataset('test.nc', mode='w')
nc.createDimension('x', size=xs.size)
nc.createVariable('x', '>f4', ('x'), endian='big') # big endian variable
nc.variables['x'][:] = xs
nc.close()
nc = netCDF4.Dataset('test.nc')
xsout = nc.variables['x'][:]
print 'netcdf4-python',xsout, xsout.dtype, repr(xsout[-1].tobytes())
nc.close()
f = h5py.File('test.nc')
xsout = f['x'][:]
print 'h5py',xsout, xsout.dtype, repr(xsout[-1].tobytes())
f.close()

using the byteswap branch, I get

netcdf4-python [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4 '\x00\x00\x10A'
h5py [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.] >f4 '\x00\x00\x10A

WardF · 2016-04-13T18:25:11Z

As netCDF uses the HDF5 library for I/O, I suspect that netCDF also exhibits the same behavior observed in h5py and netcdf4-python + pull request #555. The internal swapping I alluded to earlier happens when the data is printed via ncdump, etc, but not as part of the data storage mechanism.

jswhit · 2016-04-14T14:38:34Z

merging pull request #555 now.

jswhit mentioned this issue Apr 12, 2016

make byteswap a copy, not in- place #555

Merged

jswhit closed this as completed Apr 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem interpreting short arrays #554

Problem interpreting short arrays #554

forman commented Apr 11, 2016

jswhit commented Apr 11, 2016

jswhit commented Apr 11, 2016

jswhit commented Apr 11, 2016

jswhit commented Apr 11, 2016

forman commented Apr 12, 2016

shoyer commented Apr 12, 2016

forman commented Apr 12, 2016

jswhit commented Apr 12, 2016

oembury commented Apr 12, 2016

jswhit commented Apr 12, 2016

jswhit commented Apr 12, 2016

oembury commented Apr 12, 2016

jswhit commented Apr 12, 2016

jswhit commented Apr 12, 2016

jswhit commented Apr 12, 2016

oembury commented Apr 13, 2016

jswhit commented Apr 13, 2016

dopplershift commented Apr 13, 2016

shoyer commented Apr 13, 2016

jswhit commented Apr 13, 2016

WardF commented Apr 13, 2016

jswhit commented Apr 14, 2016

Problem interpreting short arrays #554

Problem interpreting short arrays #554

Comments

forman commented Apr 11, 2016

jswhit commented Apr 11, 2016

jswhit commented Apr 11, 2016

jswhit commented Apr 11, 2016

jswhit commented Apr 11, 2016

forman commented Apr 12, 2016

shoyer commented Apr 12, 2016

forman commented Apr 12, 2016

jswhit commented Apr 12, 2016

oembury commented Apr 12, 2016

jswhit commented Apr 12, 2016

jswhit commented Apr 12, 2016

oembury commented Apr 12, 2016

jswhit commented Apr 12, 2016

jswhit commented Apr 12, 2016

jswhit commented Apr 12, 2016

oembury commented Apr 13, 2016

jswhit commented Apr 13, 2016

dopplershift commented Apr 13, 2016

shoyer commented Apr 13, 2016

jswhit commented Apr 13, 2016

WardF commented Apr 13, 2016

jswhit commented Apr 14, 2016