New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recognize _Encoding attribute for char and string arrays #665

Merged
merged 29 commits into from May 19, 2017

Conversation

Projects
None yet
3 participants
@jswhit
Copy link
Collaborator

jswhit commented May 17, 2017

Add check for _Encoding attribute for NC_STRING variables, otherwise use 'utf-8'. 'utf-8' is used everywhere else, 'default_encoding' global module variable is no longer used. getncattr method now takes optional kwarg 'encoding' (default 'utf-8') so encoding of attributes can be specified if desired. If _Encoding is specified for an NC_CHAR ('S1') variable,the chartostring utility function is used to convert the array of characters to an array of strings with one less dimension (the last dimension is interpreted as the length of each string) when reading the data. When writing the data, stringtochar is used to convert a numpy array of fixed length strings to an array of characters with one more dimension. chartostring and stringtochar now also have an 'encoding' kwarg.

The _Encoding attribute convention is being discussed in Unidata/netcdf-c#402.

jswhit added some commits May 16, 2017

get rid of 'default_encoding' and 'unicode_error' module variables.
Use 'utf-8' and 'replace' for everything except NC_STRING variable data.
For NC_STRING variable data, look for _Encoding variable attribute,
otherwise use 'utf-8'.

jswhit added some commits May 17, 2017

@jswhit

This comment has been minimized.

Copy link
Collaborator

jswhit commented May 18, 2017

@shoyer, I'm wondering how this change would impact xarray - especially the auto-conversion of char arrays to string arrays with the last dimension collapsed. This would only happen if the _Encoding attribute is set - but even so, would you want a way to disable this extra 'magic' for xarray?

jswhit added some commits May 18, 2017

@shoyer

This comment has been minimized.

Copy link
Contributor

shoyer commented May 18, 2017

@jswhit thanks for the heads up. Yes, I think this implementation as-is would break xarray, where we do our own char -> string array conversion.

There are two ways to fix this:

  1. Add some way to turn this off.
  2. Make the netCDF4.Variable objects have a consistent interface with the converted arrays. That means adjusting ndim, shape and dtype along with array values.

I like this second option better.

@jswhit

This comment has been minimized.

Copy link
Collaborator

jswhit commented May 18, 2017

The second option would be nice, but quite difficult since ndim, shape and dtype are assumed internally to represent the variable as stored in the file.

How about adding a set_auto_chartostring Dataset andVariable method? The default could be False if we want to be conservative.

@dopplershift

This comment has been minimized.

Copy link
Member

dopplershift commented May 18, 2017

That would fit the existing API of the library, where any interpretation of attributes is configurable...

@jswhit

This comment has been minimized.

Copy link
Collaborator

jswhit commented May 18, 2017

I went ahead and added a set_auto_chartostring method - the default is True for now, just like the auto mask and scale attributes.

@shoyer

This comment has been minimized.

Copy link
Contributor

shoyer commented May 18, 2017

I'm okay with methods for this.

But going forward, this is probably a case for separate low level and high level interfaces, even if only the high level interface is exposed publicly. h5py uses this approach and it works quite well.

@jswhit

This comment has been minimized.

Copy link
Collaborator

jswhit commented May 19, 2017

OK, merging now. @shoyer, good idea about the low level interface. I'll create a separate ticket for that.

@jswhit jswhit merged commit f8e55d8 into master May 19, 2017

4 checks passed

continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@jswhit jswhit deleted the encoding branch May 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment