Table unicode sandwich - make 'S' type useful in Python 3 #5700

taldcroft · 2017-01-14T12:24:35Z

Currently the numpy bytestring (S ) dtype is difficult to use in Python 3 because one cannot assign or compare to the natural str type. When dealing with most FITS or HDF5 or other binary type tabular data formats, text data will be read into a table that has S type columns.

This led to one workaround #1974 around 3 years ago to make it easy to convert S columns in a table to unicode. So you can write natural code that works, but you still pay a big price because memory use inflates by a factor of 4 by going to the numpy U type.

Now taking inspiration from the work that @embray did at https://github.com/embray/PyFITS/blob/stringy/lib/pyfits/fitsrec.py#L999, this PR implements the idea of the unicode sandwich for bytestring Table columns (see http://nedbatchelder.com/text/unipain.html for discussion of the sandwich concept).

In Python 3 the values are always decoded as UTF-8 when being accessed, and always encoded as UTF-8 when being set. The upshot is that bytestring columns have the same behavior in Python 2 and 3 with respect to natural usage with the default str type.

One important feature is that this PR takes pains to not do anything differently for Python 2. Basically all the code only applies to Python 3.

To be clear, this will be a significant API change because if people have been using bytestring literals in their Python 3 code, then that will break. But that is definitely the wrong way to do it. Astropy 2.0 seems like a nice opportunity for this API change.

taldcroft · 2017-01-14T12:36:25Z

About breakage, what will now break is a comparison like t['col'][10] == b'value'. But mask = t['col'] == b'value' will work, as will assignment.

taldcroft · 2017-01-14T18:09:43Z

Did some more work on this:

Added testing for the Py3-only use case of actually setting with non-ASCII text. In other words you can properly store and work with unicode text in a bytestring Column. This only works on Py3. I played around with this on Py2 but everything broke, so I decided that Py2 is legacy and we shall move forward with Py3!
Improved testing to cover MaskedColumn.
Rebased from master with string truncation warnings. This gives a warning if you set to multi-character unicode that doesn't fit.

mhvk · 2017-01-14T18:53:40Z

I like the idea of having a well-defined ascii string column type, but have both a general question and a specific one:

The general ons is whether instead of overriding the meaning of a regular S-dtype Column, making it deviate from what ndarray does, possibly breaking code and removing the option to have an actual byte-string column, it might be better to have a new AsciiColumn(Column) subclass that implements what you have here. It does mean adapting code in io, etc., to use this, but might end up being cleaner (and possibly work for both python2 and 3).

The specific one is about encoding as utf8 -- this means that the byte entries can/will hold variable numbers of characters. If so, is this really wanted? Maybe it is better to use ascii for the encoding? (decoding can of course be utf-8).

Also, more generally, for actual utf-8 strings, we could perhaps consider making a new mixin class (based on list). Regular python is in many ways better at handling strings than numpy is (although it is quite a bit of work to define the operations, etc.). Anyway, that is relevant here only to the extent of whether this would be a final goal to have in mind.

taldcroft · 2017-01-14T20:49:38Z

@mhvk - I purposely started with the most simple implementation that has no knobs or configurability, expecting perhaps some pushback. But here is my reasoning:

On Python 2, numpy S-type arrays work just great.
Key binary formats like FITS and HDF5 only formally support ASCII string data, and the Python interfaces all read these into S-type arrays (for both Py2 and 3).
In Python 3, numpy S-type arrays are basically not useful. You have two options:
- Start changing all your code to basically do the opposite of this PR and implement a bytestring sandwich and bytestring literals everywhere in order to talk to bytestring arrays. In practice this does not workable to me.
- Convert to U type and suffer a factor of 4 memory bloat. For me personally this is a poor option.

So that is why I think that the default in astropy Table should change. I would be surprised if anyone is using the first option because it is very fragile and turns into whack-a-bytes pretty quickly. (I actually tried this once for our production code). The second option would still work exactly the same with no breakage. But maybe I'm missing a way that people are using bytestring arrays in practice in Py3?

Another way to think about it is that this PR simply makes the S-type Column behave almost exactly the same in Py2 and Py3. That's pretty much what we want since this enables a more seamless migration to Py3. Right now most Py2 codes that deal with FITS or HDF5 tables with strings will not run on Py3 after going through 2to3.

We might consider turning your idea around and providing some kind of knob to retain the useless raw S-type Column in Py3. I would want to see genuine use cases / breakage scenarios before introducing code complexity to support something I think is already broken.

taldcroft · 2017-01-14T20:58:27Z

About the backend encoding, using ASCII is certainly negotiable. My main complaint there is that while using ASCII is certainly more strict and proper, if you use UTF-8 then you get added capability for free.

I'm a lot less worried about stuffing UTF8 into bytestrings now that #5624 is merged. You can't do this without getting a warning. Basically I don't see a fundamental difference between truncation problems with UTF-8 and ASCII, but if you get a warning then it's not going unnoticed.

In the output layer (in pprint) we have always supported the possibility of UTF-8 encoded data in the bytestring. So why not make that more explicit. Again, this all only happens on Py3.

taldcroft · 2017-01-14T21:04:39Z

If people really want variable-length UTF8 strings, then I think an array of numpy objects (of str) is the way to go. Basically all the numpy machinery is built-in and it pretty much works AFAIK. We just need to provide a little front-end machinery or even documentation to make that happen. Rolling our own would be a lot of work as you noted.

taldcroft · 2017-01-14T21:09:13Z

Another reason I'm not so excited about AsciiColumn(Column) is that we also need MaskedAsciiColumn(MaskedColumn). The you need some table-level attribute that says whether to create a normal or an Ascii column, and this needs to propagate out to every unified I/O table reader (unless we make new Table subclasses ATable and QATable analogous to QTable to control that). Maybe I'm missing a clean pathway, but that all just gets really ugly.

mhvk · 2017-01-14T21:13:10Z

@taldcroft - I think I'm coming around to your point of view. Since Column is already a subclass of ndarray, we can and should change behaviour to suit our needs. Inverting my logic of using a mixin, if one wants to actually have bytes, one can use the very simple ndarray mixin that you use in tests (and I think also have in the documentation).

As for utf-8 vs ascii, one option would be to leave the encoding as Column property (analogous to a unit perhaps). I do like the idea that by setting it to ascii, one would have python2 and python3 behave more similarly (always getting str back, etc.).

Anyway, next for me is probably to look more closely at the implementation!

mhvk · 2017-01-14T21:15:35Z

Pushing the idea of a bytes column having an encoding a little further: could this be None to indicate a pure byte column? Though perhaps this is just making things too complicated -- as you say, maybe it is best to first know whether anyone even uses S columns to store bytes...

hamogu · 2017-01-15T02:03:44Z

Does anyone know if there are planes in numpy to address this, e.g. a new dtype, or accepting any encoding as a dtype or something else?
We might want to check with numpy-dev or look through their issue list to see if there are any discussions in this direction. We don't want to deviate from numpy now, if they are planning to provide a way to deal with this in the forseeable future.

taldcroft · 2017-01-15T12:06:00Z

I terms of numpy-dev discussion on the topic, what I am aware of is 2 to 3 years old. I was an active participant and pushing hard for this:

The upshot is that there was general agreement that a new numpy one-byte character dtype was a good idea. This dtype (tentatively called s) would be similar to U but using either ascii or latin1 encoding. Where this ended, unfortunately, is that I spent a day investigating how to actually implement this and found that digging into numpy internals is very daunting for a C newbie. So I dropped it and nobody else has stepped up AFAIK. Also there was some announcement from Travis that his team was starting some initiative to overhaul the dtype system. Don't know if anything happened on that.

However, the point is that this PR essentially implements what was agreed to in the numpy threads from the user perspective. Of course if there was a numpy s type then this PR would be much simpler, likely consisting of just automatically changing any Py3 S Column into s via astype. That could also (and likely would) happen closer to the metal, i.e. in io.fits and HDF5 Python libraries.

But in the meantime we are still supporting a 4-year old numpy release (1.7), so even if numpy does add an s type we might want this shim to make astropy behave (mostly) consistently across numpy versions. I was thinking about proposing the project for numpy as a GSoC project.

mhvk · 2017-02-09T14:45:04Z

@taldcroft - I looked again at this as I saw your note about it: I do now think it is a very good idea. In principle, I still feel it would be even better if one could set the encoding and it seems this would need little more than move _encode_str_utf8(value) to a method _encode_str(self, value) (which gets its encoding from self.encoding or so). But it is something that can be done in follow-up as well, so this does not have to hold up the PR.

mhvk · 2017-02-09T17:24:03Z

One detail: I just noticed numpy/numpy#8592, which removes __setslice__ and __getslice__ from numpy, since they should not be needed anymore for Python >=2.7. But here you noted that they still do get called. Any idea in which context this happens?

eric-wieser · 2017-02-09T17:53:19Z

@mhvk: __setslice__ will be still be called in 2.7 if it exists, I think.

mhvk · 2017-02-09T17:56:04Z

Ah, I see, so since MaskedArray defines it, we should override it. Makes sense. At least it can go for astropy 3.0!

eric-wieser · 2017-02-09T18:02:47Z

Yep, and ditto for ndarray definining it, I think

embray · 2017-03-30T12:48:03Z

Ah, glad to see this finally happening. Anything I can do to help advance this PR?

embray · 2017-03-30T12:50:34Z

To be clear, this will be a significant API change because if people have been using bytestring literals in their Python 3 code, then that will break. But that is definitely the wrong way to do it. Astropy 2.0 seems like a nice opportunity for this API change.

Maybe this has already been suggested (I haven't read the full thread yet) but in theory we could actually make this type return a custom subclass of str that can handle comparison with bytes objects. There are many examples of this in the wild.

taldcroft · 2017-03-30T16:04:19Z

@embray - good to hear from you! As for moving this forward, I got sidetracked in a big way with my day job but I think I have time to get back to this now. I think that the main open action was just documentation. So once I do that then doing a review would be a great help. (Or code review even now would be super if you have some time).

As for the enhanced str object, I would like to try to get this merged in roughly its present form and then we could think about making it better.

mhvk · 2017-03-30T16:13:54Z

Great! My only comment was that I think _encode_str_utf8 should be a method _encode_str; for now, fine to always let the encoding be utf8.

taldcroft · 2017-05-26T15:13:11Z

Rebased again and added What's New:

taldcroft · 2017-05-26T15:36:09Z

I added a couple of follow-up issues to track some open thoughts from discussions here and allow this PR to be merged as-is.

mhvk · 2017-05-26T21:28:13Z

Maybe I'm overthinking, but looking at #6117, I still wonder if we're not better off giving some way to set the decoding/encoding, with the default being None for Py2 and lambda x: x.decode('utf-8') (and similarly for encode) for Py3. That can then include what to do with decoding errors. (But I don't like my concrete suggested implementation...)

taldcroft · 2017-05-26T23:11:31Z

I think you are over-designing it. I'm really in favor of starting with the simplest implementation and letting the users lead us to necessary API enhancements based on actual need. What you've suggested is something that might come up, but maybe nobody needs it? In any case I would really like to merge this PR as-is and take that incremental approach from here. Perhaps you can convince me post-5700 (and pre 2.0) of an improved interface, maybe with a PR? 😄

mhvk · 2017-05-30T18:48:23Z

@taldcroft - bit slow reply - yes, it's fine to go incrementally - I guess one possibility then would be to indicate in the changelog that if comparison with str instead of bytes is a problem -- and bytes is really more logical -- to raise an issue rather than necessarily change one's code.

Anyway, I approved the PR already a while back, so fine also to just leave it as is.

taldcroft · 2017-05-30T19:09:54Z

OK, merged, with pending follow-ups #6121, #6122, #6138.

@astrofrog - I think I addressed your review issues so can you approve for the record?

mhvk · 2017-05-30T23:34:04Z

@taldcroft - you've probably seen it already, but the combination of this and #6117 seems to have led to a broken master. At least, the test failures for python3 only in #6137 seem unrelated to my to_value changes...

saimn · 2017-05-31T07:32:35Z

Yes it's broken with the combination of the test added in #6117 and the decode added here in _column_mixins.pyx.

taldcroft · 2017-05-31T10:50:05Z

Oh dear. Will look at this later today.

taldcroft · 2017-07-24T17:48:53Z

Epilogue: I just spent an hour chasing down a problem deep in a complicated analysis notebook where running in Py3 gave different answers from Py2. Answer: I was reading in a table from HDF5 and comparing a bytes column to a str literal was giving False, and I had not yet updated Astropy in that environment. Using Astropy 2.0 it just worked out of the box, duh!

taldcroft added Affects-release Enhancement table labels Jan 14, 2017

taldcroft added this to the v2.0.0 milestone Jan 14, 2017

taldcroft force-pushed the table-unicode-sandwich branch from e8e3bd7 to 3fe2528 Compare January 14, 2017 18:05

taldcroft mentioned this pull request Jan 14, 2017

Emit warning when assigned string value is truncated #5624

Merged

taldcroft mentioned this pull request Feb 9, 2017

Feature request: mix-in column for Table with automatic unicode conversion #5618

Closed

taldcroft mentioned this pull request Feb 10, 2017

Define __getslice__ and __setslice__ only on PY2 (if needed). #5796

Merged

taldcroft added 5 commits May 26, 2017 10:18

Add leading spaces so doctest-skip works

4a8aa97

Address comments from @mhvk

9b6a6cd

Remove unnecessary __class__ from encode_str calls

56b37d4

Add empty line in changelog after rebase

29e9212

Add What's New entry

1ef6f8f

taldcroft force-pushed the table-unicode-sandwich branch from 0a3b1c1 to 1ef6f8f Compare May 26, 2017 15:11

This was referenced May 26, 2017

Table unicode sandwich follow-up: use replace for decoding errors? #6121

Closed

Table unicode sandwich follow-up: allow Table FITS reader to return bytestrings #6122

Closed

mhvk mentioned this pull request May 26, 2017

Make pprint/pformat tolerant of non-UTF-8 bytes data #6117

Merged

taldcroft merged commit 014df86 into astropy:master May 30, 2017

taldcroft mentioned this pull request May 30, 2017

Table unicode sandwich follow-up: changelog and What's New clarification #6138

Closed

pllim mentioned this pull request May 31, 2017

Write temp.png from wcsaxes test_no_numpy_warnings in tmpdir #6141

Merged

taldcroft mentioned this pull request May 31, 2017

Use errors='replace' in decoding bytes data in Py3 #6145

Merged

pllim mentioned this pull request Jun 15, 2017

Clarify What's New and change log for Table unicode sandwich #6214

Merged

taldcroft deleted the table-unicode-sandwich branch July 24, 2017 17:48

taldcroft mentioned this pull request Nov 27, 2017

Time allow initialization with bytes or dtype='S' ndarray input #6823

Merged

mhvk mentioned this pull request Dec 2, 2017

FITS Table string related fails with Astropy dev gammapy/gammapy#1228

Merged

taldcroft mentioned this pull request Oct 11, 2018

WIP: Re-prototype use of mica stats instead of sybase sot/starcheck#257

Closed

pllim mentioned this pull request May 16, 2019

Extended Table unicode sandwich for varchar data access from votable #8716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table unicode sandwich - make 'S' type useful in Python 3 #5700

Table unicode sandwich - make 'S' type useful in Python 3 #5700

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

mhvk commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

mhvk commented Jan 14, 2017

mhvk commented Jan 14, 2017

hamogu commented Jan 15, 2017

taldcroft commented Jan 15, 2017

mhvk commented Feb 9, 2017

mhvk commented Feb 9, 2017

eric-wieser commented Feb 9, 2017 •

edited

Loading

mhvk commented Feb 9, 2017

eric-wieser commented Feb 9, 2017

embray commented Mar 30, 2017

embray commented Mar 30, 2017

taldcroft commented Mar 30, 2017

mhvk commented Mar 30, 2017

taldcroft commented May 26, 2017

taldcroft commented May 26, 2017

mhvk commented May 26, 2017

taldcroft commented May 26, 2017

mhvk commented May 30, 2017

taldcroft commented May 30, 2017

mhvk commented May 30, 2017

saimn commented May 31, 2017

taldcroft commented May 31, 2017

taldcroft commented Jul 24, 2017 •

edited

Loading

Table unicode sandwich - make 'S' type useful in Python 3 #5700

Table unicode sandwich - make 'S' type useful in Python 3 #5700

Conversation

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

mhvk commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

taldcroft commented Jan 14, 2017

mhvk commented Jan 14, 2017

mhvk commented Jan 14, 2017

hamogu commented Jan 15, 2017

taldcroft commented Jan 15, 2017

mhvk commented Feb 9, 2017

mhvk commented Feb 9, 2017

eric-wieser commented Feb 9, 2017 • edited Loading

mhvk commented Feb 9, 2017

eric-wieser commented Feb 9, 2017

embray commented Mar 30, 2017

embray commented Mar 30, 2017

taldcroft commented Mar 30, 2017

mhvk commented Mar 30, 2017

taldcroft commented May 26, 2017

taldcroft commented May 26, 2017

mhvk commented May 26, 2017

taldcroft commented May 26, 2017

mhvk commented May 30, 2017

taldcroft commented May 30, 2017

mhvk commented May 30, 2017

saimn commented May 31, 2017

taldcroft commented May 31, 2017

taldcroft commented Jul 24, 2017 • edited Loading

eric-wieser commented Feb 9, 2017 •

edited

Loading

taldcroft commented Jul 24, 2017 •

edited

Loading