Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table unicode sandwich - make 'S' type useful in Python 3 #5700

Merged
merged 16 commits into from
May 30, 2017

Conversation

taldcroft
Copy link
Member

Currently the numpy bytestring (S ) dtype is difficult to use in Python 3 because one cannot assign or compare to the natural str type. When dealing with most FITS or HDF5 or other binary type tabular data formats, text data will be read into a table that has S type columns.

This led to one workaround #1974 around 3 years ago to make it easy to convert S columns in a table to unicode. So you can write natural code that works, but you still pay a big price because memory use inflates by a factor of 4 by going to the numpy U type.

Now taking inspiration from the work that @embray did at https://github.com/embray/PyFITS/blob/stringy/lib/pyfits/fitsrec.py#L999, this PR implements the idea of the unicode sandwich for bytestring Table columns (see http://nedbatchelder.com/text/unipain.html for discussion of the sandwich concept).

In Python 3 the values are always decoded as UTF-8 when being accessed, and always encoded as UTF-8 when being set. The upshot is that bytestring columns have the same behavior in Python 2 and 3 with respect to natural usage with the default str type.

One important feature is that this PR takes pains to not do anything differently for Python 2. Basically all the code only applies to Python 3.

To be clear, this will be a significant API change because if people have been using bytestring literals in their Python 3 code, then that will break. But that is definitely the wrong way to do it. Astropy 2.0 seems like a nice opportunity for this API change.

@taldcroft
Copy link
Member Author

About breakage, what will now break is a comparison like t['col'][10] == b'value'. But mask = t['col'] == b'value' will work, as will assignment.

@taldcroft
Copy link
Member Author

Did some more work on this:

  • Added testing for the Py3-only use case of actually setting with non-ASCII text. In other words you can properly store and work with unicode text in a bytestring Column. This only works on Py3. I played around with this on Py2 but everything broke, so I decided that Py2 is legacy and we shall move forward with Py3!
  • Improved testing to cover MaskedColumn.
  • Rebased from master with string truncation warnings. This gives a warning if you set to multi-character unicode that doesn't fit.

@mhvk
Copy link
Contributor

mhvk commented Jan 14, 2017

I like the idea of having a well-defined ascii string column type, but have both a general question and a specific one:

The general ons is whether instead of overriding the meaning of a regular S-dtype Column, making it deviate from what ndarray does, possibly breaking code and removing the option to have an actual byte-string column, it might be better to have a new AsciiColumn(Column) subclass that implements what you have here. It does mean adapting code in io, etc., to use this, but might end up being cleaner (and possibly work for both python2 and 3).

The specific one is about encoding as utf8 -- this means that the byte entries can/will hold variable numbers of characters. If so, is this really wanted? Maybe it is better to use ascii for the encoding? (decoding can of course be utf-8).

Also, more generally, for actual utf-8 strings, we could perhaps consider making a new mixin class (based on list). Regular python is in many ways better at handling strings than numpy is (although it is quite a bit of work to define the operations, etc.). Anyway, that is relevant here only to the extent of whether this would be a final goal to have in mind.

@taldcroft
Copy link
Member Author

@mhvk - I purposely started with the most simple implementation that has no knobs or configurability, expecting perhaps some pushback. But here is my reasoning:

  • On Python 2, numpy S-type arrays work just great.
  • Key binary formats like FITS and HDF5 only formally support ASCII string data, and the Python interfaces all read these into S-type arrays (for both Py2 and 3).
  • In Python 3, numpy S-type arrays are basically not useful. You have two options:
    • Start changing all your code to basically do the opposite of this PR and implement a bytestring sandwich and bytestring literals everywhere in order to talk to bytestring arrays. In practice this does not workable to me.
    • Convert to U type and suffer a factor of 4 memory bloat. For me personally this is a poor option.

So that is why I think that the default in astropy Table should change. I would be surprised if anyone is using the first option because it is very fragile and turns into whack-a-bytes pretty quickly. (I actually tried this once for our production code). The second option would still work exactly the same with no breakage. But maybe I'm missing a way that people are using bytestring arrays in practice in Py3?

Another way to think about it is that this PR simply makes the S-type Column behave almost exactly the same in Py2 and Py3. That's pretty much what we want since this enables a more seamless migration to Py3. Right now most Py2 codes that deal with FITS or HDF5 tables with strings will not run on Py3 after going through 2to3.

We might consider turning your idea around and providing some kind of knob to retain the useless raw S-type Column in Py3. I would want to see genuine use cases / breakage scenarios before introducing code complexity to support something I think is already broken.

@taldcroft
Copy link
Member Author

About the backend encoding, using ASCII is certainly negotiable. My main complaint there is that while using ASCII is certainly more strict and proper, if you use UTF-8 then you get added capability for free.

I'm a lot less worried about stuffing UTF8 into bytestrings now that #5624 is merged. You can't do this without getting a warning. Basically I don't see a fundamental difference between truncation problems with UTF-8 and ASCII, but if you get a warning then it's not going unnoticed.

In the output layer (in pprint) we have always supported the possibility of UTF-8 encoded data in the bytestring. So why not make that more explicit. Again, this all only happens on Py3.

@taldcroft
Copy link
Member Author

If people really want variable-length UTF8 strings, then I think an array of numpy objects (of str) is the way to go. Basically all the numpy machinery is built-in and it pretty much works AFAIK. We just need to provide a little front-end machinery or even documentation to make that happen. Rolling our own would be a lot of work as you noted.

@taldcroft
Copy link
Member Author

Another reason I'm not so excited about AsciiColumn(Column) is that we also need MaskedAsciiColumn(MaskedColumn). The you need some table-level attribute that says whether to create a normal or an Ascii column, and this needs to propagate out to every unified I/O table reader (unless we make new Table subclasses ATable and QATable analogous to QTable to control that). Maybe I'm missing a clean pathway, but that all just gets really ugly.

@mhvk
Copy link
Contributor

mhvk commented Jan 14, 2017

@taldcroft - I think I'm coming around to your point of view. Since Column is already a subclass of ndarray, we can and should change behaviour to suit our needs. Inverting my logic of using a mixin, if one wants to actually have bytes, one can use the very simple ndarray mixin that you use in tests (and I think also have in the documentation).

As for utf-8 vs ascii, one option would be to leave the encoding as Column property (analogous to a unit perhaps). I do like the idea that by setting it to ascii, one would have python2 and python3 behave more similarly (always getting str back, etc.).

Anyway, next for me is probably to look more closely at the implementation!

@mhvk
Copy link
Contributor

mhvk commented Jan 14, 2017

Pushing the idea of a bytes column having an encoding a little further: could this be None to indicate a pure byte column? Though perhaps this is just making things too complicated -- as you say, maybe it is best to first know whether anyone even uses S columns to store bytes...

@hamogu
Copy link
Member

hamogu commented Jan 15, 2017

Does anyone know if there are planes in numpy to address this, e.g. a new dtype, or accepting any encoding as a dtype or something else?
We might want to check with numpy-dev or look through their issue list to see if there are any discussions in this direction. We don't want to deviate from numpy now, if they are planning to provide a way to deal with this in the forseeable future.

@taldcroft
Copy link
Member Author

I terms of numpy-dev discussion on the topic, what I am aware of is 2 to 3 years old. I was an active participant and pushing hard for this:

The upshot is that there was general agreement that a new numpy one-byte character dtype was a good idea. This dtype (tentatively called s) would be similar to U but using either ascii or latin1 encoding. Where this ended, unfortunately, is that I spent a day investigating how to actually implement this and found that digging into numpy internals is very daunting for a C newbie. So I dropped it and nobody else has stepped up AFAIK. Also there was some announcement from Travis that his team was starting some initiative to overhaul the dtype system. Don't know if anything happened on that.

However, the point is that this PR essentially implements what was agreed to in the numpy threads from the user perspective. Of course if there was a numpy s type then this PR would be much simpler, likely consisting of just automatically changing any Py3 S Column into s via astype. That could also (and likely would) happen closer to the metal, i.e. in io.fits and HDF5 Python libraries.

But in the meantime we are still supporting a 4-year old numpy release (1.7), so even if numpy does add an s type we might want this shim to make astropy behave (mostly) consistently across numpy versions. I was thinking about proposing the project for numpy as a GSoC project.

@mhvk
Copy link
Contributor

mhvk commented Feb 9, 2017

@taldcroft - I looked again at this as I saw your note about it: I do now think it is a very good idea. In principle, I still feel it would be even better if one could set the encoding and it seems this would need little more than move _encode_str_utf8(value) to a method _encode_str(self, value) (which gets its encoding from self.encoding or so). But it is something that can be done in follow-up as well, so this does not have to hold up the PR.

@mhvk
Copy link
Contributor

mhvk commented Feb 9, 2017

One detail: I just noticed numpy/numpy#8592, which removes __setslice__ and __getslice__ from numpy, since they should not be needed anymore for Python >=2.7. But here you noted that they still do get called. Any idea in which context this happens?

@eric-wieser
Copy link

eric-wieser commented Feb 9, 2017

@mhvk: __setslice__ will be still be called in 2.7 if it exists, I think.

@mhvk
Copy link
Contributor

mhvk commented Feb 9, 2017

Ah, I see, so since MaskedArray defines it, we should override it. Makes sense. At least it can go for astropy 3.0!

@eric-wieser
Copy link

Yep, and ditto for ndarray definining it, I think

@embray
Copy link
Member

embray commented Mar 30, 2017

Ah, glad to see this finally happening. Anything I can do to help advance this PR?

@embray
Copy link
Member

embray commented Mar 30, 2017

To be clear, this will be a significant API change because if people have been using bytestring literals in their Python 3 code, then that will break. But that is definitely the wrong way to do it. Astropy 2.0 seems like a nice opportunity for this API change.

Maybe this has already been suggested (I haven't read the full thread yet) but in theory we could actually make this type return a custom subclass of str that can handle comparison with bytes objects. There are many examples of this in the wild.

@taldcroft
Copy link
Member Author

@embray - good to hear from you! As for moving this forward, I got sidetracked in a big way with my day job but I think I have time to get back to this now. I think that the main open action was just documentation. So once I do that then doing a review would be a great help. (Or code review even now would be super if you have some time).

As for the enhanced str object, I would like to try to get this merged in roughly its present form and then we could think about making it better.

@mhvk
Copy link
Contributor

mhvk commented Mar 30, 2017

Great! My only comment was that I think _encode_str_utf8 should be a method _encode_str; for now, fine to always let the encoding be utf8.

@taldcroft
Copy link
Member Author

Rebased again and added What's New:
image

@taldcroft
Copy link
Member Author

I added a couple of follow-up issues to track some open thoughts from discussions here and allow this PR to be merged as-is.

@mhvk
Copy link
Contributor

mhvk commented May 26, 2017

Maybe I'm overthinking, but looking at #6117, I still wonder if we're not better off giving some way to set the decoding/encoding, with the default being None for Py2 and lambda x: x.decode('utf-8') (and similarly for encode) for Py3. That can then include what to do with decoding errors. (But I don't like my concrete suggested implementation...)

@taldcroft
Copy link
Member Author

I think you are over-designing it. I'm really in favor of starting with the simplest implementation and letting the users lead us to necessary API enhancements based on actual need. What you've suggested is something that might come up, but maybe nobody needs it? In any case I would really like to merge this PR as-is and take that incremental approach from here. Perhaps you can convince me post-5700 (and pre 2.0) of an improved interface, maybe with a PR? 😄

@mhvk
Copy link
Contributor

mhvk commented May 30, 2017

@taldcroft - bit slow reply - yes, it's fine to go incrementally - I guess one possibility then would be to indicate in the changelog that if comparison with str instead of bytes is a problem -- and bytes is really more logical -- to raise an issue rather than necessarily change one's code.

Anyway, I approved the PR already a while back, so fine also to just leave it as is.

@taldcroft
Copy link
Member Author

OK, merged, with pending follow-ups #6121, #6122, #6138.

@astrofrog - I think I addressed your review issues so can you approve for the record?

@mhvk
Copy link
Contributor

mhvk commented May 30, 2017

@taldcroft - you've probably seen it already, but the combination of this and #6117 seems to have led to a broken master. At least, the test failures for python3 only in #6137 seem unrelated to my to_value changes...

@saimn
Copy link
Contributor

saimn commented May 31, 2017

Yes it's broken with the combination of the test added in #6117 and the decode added here in _column_mixins.pyx.

@taldcroft
Copy link
Member Author

Oh dear. Will look at this later today.

@taldcroft
Copy link
Member Author

taldcroft commented Jul 24, 2017

Epilogue: I just spent an hour chasing down a problem deep in a complicated analysis notebook where running in Py3 gave different answers from Py2. Answer: I was reading in a table from HDF5 and comparing a bytes column to a str literal was giving False, and I had not yet updated Astropy in that environment. Using Astropy 2.0 it just worked out of the box, duh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants