-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table unicode sandwich - make 'S' type useful in Python 3 #5700
Conversation
About breakage, what will now break is a comparison like |
e8e3bd7
to
3fe2528
Compare
Did some more work on this:
|
I like the idea of having a well-defined ascii string column type, but have both a general question and a specific one: The general ons is whether instead of overriding the meaning of a regular The specific one is about encoding as Also, more generally, for actual |
@mhvk - I purposely started with the most simple implementation that has no knobs or configurability, expecting perhaps some pushback. But here is my reasoning:
So that is why I think that the default in astropy Table should change. I would be surprised if anyone is using the first option because it is very fragile and turns into whack-a-bytes pretty quickly. (I actually tried this once for our production code). The second option would still work exactly the same with no breakage. But maybe I'm missing a way that people are using bytestring arrays in practice in Py3? Another way to think about it is that this PR simply makes the We might consider turning your idea around and providing some kind of knob to retain the useless raw |
About the backend encoding, using ASCII is certainly negotiable. My main complaint there is that while using ASCII is certainly more strict and proper, if you use UTF-8 then you get added capability for free. I'm a lot less worried about stuffing UTF8 into bytestrings now that #5624 is merged. You can't do this without getting a warning. Basically I don't see a fundamental difference between truncation problems with UTF-8 and ASCII, but if you get a warning then it's not going unnoticed. In the output layer (in pprint) we have always supported the possibility of UTF-8 encoded data in the bytestring. So why not make that more explicit. Again, this all only happens on Py3. |
If people really want variable-length UTF8 strings, then I think an array of numpy objects (of str) is the way to go. Basically all the numpy machinery is built-in and it pretty much works AFAIK. We just need to provide a little front-end machinery or even documentation to make that happen. Rolling our own would be a lot of work as you noted. |
Another reason I'm not so excited about |
@taldcroft - I think I'm coming around to your point of view. Since As for Anyway, next for me is probably to look more closely at the implementation! |
Pushing the idea of a bytes column having an encoding a little further: could this be |
Does anyone know if there are planes in numpy to address this, e.g. a new dtype, or accepting any encoding as a dtype or something else? |
I terms of numpy-dev discussion on the topic, what I am aware of is 2 to 3 years old. I was an active participant and pushing hard for this:
The upshot is that there was general agreement that a new numpy one-byte character dtype was a good idea. This dtype (tentatively called However, the point is that this PR essentially implements what was agreed to in the numpy threads from the user perspective. Of course if there was a numpy But in the meantime we are still supporting a 4-year old numpy release (1.7), so even if numpy does add an |
@taldcroft - I looked again at this as I saw your note about it: I do now think it is a very good idea. In principle, I still feel it would be even better if one could set the encoding and it seems this would need little more than move |
One detail: I just noticed numpy/numpy#8592, which removes |
@mhvk: |
Ah, I see, so since |
Yep, and ditto for |
Ah, glad to see this finally happening. Anything I can do to help advance this PR? |
Maybe this has already been suggested (I haven't read the full thread yet) but in theory we could actually make this type return a custom subclass of |
@embray - good to hear from you! As for moving this forward, I got sidetracked in a big way with my day job but I think I have time to get back to this now. I think that the main open action was just documentation. So once I do that then doing a review would be a great help. (Or code review even now would be super if you have some time). As for the enhanced |
Great! My only comment was that I think |
0a3b1c1
to
1ef6f8f
Compare
I added a couple of follow-up issues to track some open thoughts from discussions here and allow this PR to be merged as-is. |
Maybe I'm overthinking, but looking at #6117, I still wonder if we're not better off giving some way to set the decoding/encoding, with the default being |
I think you are over-designing it. I'm really in favor of starting with the simplest implementation and letting the users lead us to necessary API enhancements based on actual need. What you've suggested is something that might come up, but maybe nobody needs it? In any case I would really like to merge this PR as-is and take that incremental approach from here. Perhaps you can convince me post-5700 (and pre 2.0) of an improved interface, maybe with a PR? 😄 |
@taldcroft - bit slow reply - yes, it's fine to go incrementally - I guess one possibility then would be to indicate in the changelog that if comparison with str instead of bytes is a problem -- and bytes is really more logical -- to raise an issue rather than necessarily change one's code. Anyway, I approved the PR already a while back, so fine also to just leave it as is. |
OK, merged, with pending follow-ups #6121, #6122, #6138. @astrofrog - I think I addressed your review issues so can you approve for the record? |
@taldcroft - you've probably seen it already, but the combination of this and #6117 seems to have led to a broken master. At least, the test failures for python3 only in #6137 seem unrelated to my |
Yes it's broken with the combination of the test added in #6117 and the |
Oh dear. Will look at this later today. |
Epilogue: I just spent an hour chasing down a problem deep in a complicated analysis notebook where running in Py3 gave different answers from Py2. Answer: I was reading in a table from HDF5 and comparing a bytes column to a |
Currently the numpy bytestring (
S
) dtype is difficult to use in Python 3 because one cannot assign or compare to the naturalstr
type. When dealing with most FITS or HDF5 or other binary type tabular data formats, text data will be read into a table that hasS
type columns.This led to one workaround #1974 around 3 years ago to make it easy to convert
S
columns in a table to unicode. So you can write natural code that works, but you still pay a big price because memory use inflates by a factor of 4 by going to the numpyU
type.Now taking inspiration from the work that @embray did at https://github.com/embray/PyFITS/blob/stringy/lib/pyfits/fitsrec.py#L999, this PR implements the idea of the unicode sandwich for bytestring Table columns (see http://nedbatchelder.com/text/unipain.html for discussion of the sandwich concept).
In Python 3 the values are always decoded as UTF-8 when being accessed, and always encoded as UTF-8 when being set. The upshot is that bytestring columns have the same behavior in Python 2 and 3 with respect to natural usage with the default
str
type.One important feature is that this PR takes pains to not do anything differently for Python 2. Basically all the code only applies to Python 3.
To be clear, this will be a significant API change because if people have been using bytestring literals in their Python 3 code, then that will break. But that is definitely the wrong way to do it. Astropy 2.0 seems like a nice opportunity for this API change.