-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't indicate byte order for UTF-16/32 with given BOM #73
Conversation
This is fundamentally backwards incompatible. People may already be relying on just such a wrapper as you wrote and any special casing besides the simple calls you've outlined may be lost on an else or they may start receiving exceptions if their else raises an exception. I would rather document this work around and the bug and mark it to be fixed in a major release than accept this and release it in something that isn't semantically backwards incompatible. |
Wow, this is news to me. It honestly doesn't make any sense to me that Python I mean, look at this:
Why in the world would you want
|
…ity with decode()
As for the backward-incompatible nature of this change, this might be something we should handle with configuration object proposed in #71. There are really two fundamentally different use cases for chardet:
|
This feels like a Python bug to me, so I filed an issue. |
@sigmavirus24: Almost all code I've seen so far, that uses And documenting that behavior (instead fixing it), would not only leave everybody who doesn't adapt for that issue, with broken behavior. But also makes @dan-blanchard: I think the way Python deals with endianess makes absolutely sense. Note that this way Moreover, there is no reason to indicate the byte order in the protocols, if it's already given in the data. Therefore, regardless of compatibility with |
Sorry if this comes across as snarky, but if nobody would mind explicitly stripping the BOM, then I don't see the point of your PR.
|
Okay, after looking through the Unicode FAQ after being sent there by some Python core devs. It appears out current behavior is incorrect, and we should do things as in this PR, because the
Furthermore it goes on to say that:
@sigmavirus24, This is truly an error on our part, as the old behavior was reporting the wrong codec. |
Also, the test failures are unrelated to the changes here, as we've still got a few lingering failures. |
Interesting @dan-blanchard. I didn't know that to be very frank. Are we planning on doing a 2.0.0 release any time soon? I'd be happy to include this there. |
Yes, a 3.0.0 release is definitely going to be the next one (we're currently at 2.3.0). I'm still trying to fix up everything that was in #52 in the |
Anyway, I'm going to merge this since our next release will be a major one. |
Don't indicate byte order for UTF-16/32 with given BOM, since this is against the Unicode spec.
Thanks for the PR and bringing this to our attention @snoack! |
If passed a string starting with
\xff\xfe
(low endian byte order mark) or\xfe\xff
(big endian byte order mark) the encoding is detected asUTF-16LE
,UTF-32LE
,UTF-16BE
orUTF-32BE
respectively.However, as the byte order mark is given in the string, the encoding should be simply
UTF-16
orUTF-32
. Otherwisebytes.decode()
will fail or preserve the byte order mark:Hence code that uses
chardet
in order to detect the encoding to decode data, would need to wrapchardet.detect
in following inconvenient and counter-intuitive way:This PR changes the behavior to return simply
UTF-16
orUTF-32
respectively when a byte order mark were found, that the detected encoding can be passed unchanged tobytes.decode()
.