Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't indicate byte order for UTF-16/32 with given BOM #73

Merged
merged 1 commit into from
Oct 6, 2015
Merged

Don't indicate byte order for UTF-16/32 with given BOM #73

merged 1 commit into from
Oct 6, 2015

Conversation

snoack
Copy link
Contributor

@snoack snoack commented Oct 6, 2015

If passed a string starting with \xff\xfe (low endian byte order mark) or \xfe\xff (big endian byte order mark) the encoding is detected as UTF-16LE, UTF-32LE, UTF-16BE or UTF-32BE respectively.

However, as the byte order mark is given in the string, the encoding should be simply UTF-16 or UTF-32. Otherwise bytes.decode() will fail or preserve the byte order mark:

s = 'foo'.encode('UTF-16')
encoding = chardet.detect(s)['encoding']  # "UTF-16LE"
s.decode(encoding)                        # "\ufefffoo"

s = codecs.BOM_BE + 'foo'.encode('UTF-16BE')
encoding = chardet.detect(s)['encoding']  # "UTF-16BE"
s.decode(encoding)                        # "\ufefffoo"

Hence code that uses chardet in order to detect the encoding to decode data, would need to wrap chardet.detect in following inconvenient and counter-intuitive way:

encoding = chardet.detect(enc)['encoding']
if encoding in ('UTF-16LE', 'UTF16BE'):
  dec = enc.decode('UTF-16')
elif encoding in ('UTF-32LE', 'UTF-32BE'):
  dec = enc.decode('UTF-32')
else:
  dec = enc.decode(encoding)

This PR changes the behavior to return simply UTF-16or UTF-32 respectively when a byte order mark were found, that the detected encoding can be passed unchanged to bytes.decode().

@snoack snoack changed the title Don't indicate byte order for UTF-16/32 with given BOM for compatibility with decode() Don't indicate byte order for UTF-16/32 with given BOM Oct 6, 2015
@sigmavirus24
Copy link
Member

This is fundamentally backwards incompatible. People may already be relying on just such a wrapper as you wrote and any special casing besides the simple calls you've outlined may be lost on an else or they may start receiving exceptions if their else raises an exception.

I would rather document this work around and the bug and mark it to be fixed in a major release than accept this and release it in something that isn't semantically backwards incompatible.

@dan-blanchard
Copy link
Member

Wow, this is news to me. It honestly doesn't make any sense to me that Python bytes.decode and str.encode work this way with UTF-16 and UTF-32.

I mean, look at this:

In [7]: 'foo'.encode('UTF-16le')
Out[7]: 'f\x00o\x00o\x00'

In [8]: 'foo'.encode('UTF-16')
Out[8]: '\xff\xfef\x00o\x00o\x00'

In [9]: 'foo'.encode('UTF-16be')
Out[9]: '\x00f\x00o\x00o'

Why in the world would you want UTF-16 to add the BOM when encoding, but not UTF-16LE or UTF-16BE? This especially odd when you consider that the BOM-specific codec for UTF-8, UTF-8-SIG, removes the BOM when decoding whereas the basic version doesn't:

In [16]: (codecs.BOM_UTF8 + 'foo'.encode('UTF-8')).decode('utf-8-sig')
Out[16]: u'foo'

In [17]: (codecs.BOM_UTF8 + 'foo'.encode('UTF-8')).decode('utf-8')
Out[17]: u'\ufefffoo'

@dan-blanchard
Copy link
Member

As for the backward-incompatible nature of this change, this might be something we should handle with configuration object proposed in #71.

There are really two fundamentally different use cases for chardet:

  1. Detecting Python-specific encodings so that you can immediately decode in Python.
  2. Detecting any encoding for processing with some external tool.

@dan-blanchard
Copy link
Member

This feels like a Python bug to me, so I filed an issue.

@snoack
Copy link
Contributor Author

snoack commented Oct 6, 2015

@sigmavirus24: Almost all code I've seen so far, that uses chardet, behaves incorrectly when dealing with UTF-16 and UTF-32, due to this bug. And I can only hardly imagine reasonable uses cases that would rely on this bug, in a way that fixing it would introduce worse issues. The only scenario that would come to mind is when people explicitly and exclusively handle UTF-{16,32}{LE,BE}. No idea why somebody should do that, but anyway this is certainly a less common scenario than these that are currently negatively effected by this issue.

And documenting that behavior (instead fixing it), would not only leave everybody who doesn't adapt for that issue, with broken behavior. But also makes chardet rather inconvenient to use, and appear broken.

@dan-blanchard: I think the way Python deals with endianess makes absolutely sense. Note that this way s.encode(e).decode(e) always equals to s. This is important. Otherwise, things would be way more unexpected. Nobody would mind care to explicitly strip the BOM when decoding UTF-16/32, neither do I think this would be a preferable.

Moreover, there is no reason to indicate the byte order in the protocols, if it's already given in the data. Therefore, regardless of compatibility with bytes.decode(), I think there is no reason for chardet to indicate the byte order if a BOM is given. This information is even less relevant than the form/normalization used by the unicode characters. And if people really want/need to know the byte order, they can easily check the BOM, e.g. data.startdwith(codec.BOM_BE). The exception is when there is no BOM. Then, and only then, you are supposed to use the UTF-{16,32}{LE|BE} encodings.

@dan-blanchard
Copy link
Member

I think the way Python deals with endianess makes absolutely sense. Note that this way s.encode(e).decode(e) always equals to s. This is important. Otherwise, things would be way more unexpected. Nobody would mind to explicitly strip the BOM when decoding UTF-16/32, neither do I think this would be a preferable.

Sorry if this comes across as snarky, but if nobody would mind explicitly stripping the BOM, then I don't see the point of your PR.

s.encode(e).decode(e) == s is always true with the current setup, but it would also always be true if the BE and LE versions of the codecs added/removed the BOM as necessary.

@dan-blanchard
Copy link
Member

Okay, after looking through the Unicode FAQ after being sent there by some Python core devs. It appears out current behavior is incorrect, and we should do things as in this PR, because the UTF-{16,32}{BE,LE} codecs are only meant to be used when there is no BOM! The FAQ says:

Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16.

Furthermore it goes on to say that:

Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a ZWNBSP.

@sigmavirus24, This is truly an error on our part, as the old behavior was reporting the wrong codec.

@dan-blanchard
Copy link
Member

Also, the test failures are unrelated to the changes here, as we've still got a few lingering failures.

@sigmavirus24
Copy link
Member

Interesting @dan-blanchard. I didn't know that to be very frank. Are we planning on doing a 2.0.0 release any time soon? I'd be happy to include this there.

@dan-blanchard
Copy link
Member

Yes, a 3.0.0 release is definitely going to be the next one (we're currently at 2.3.0). I'm still trying to fix up everything that was in #52 in the abandoned_pr branch and get that in there. There were many good parts to that PR, and the ability to retrain models is a big contribution. That said, I don't think the models were trained on enough data, which is something I'm working on addressing.

@dan-blanchard
Copy link
Member

Anyway, I'm going to merge this since our next release will be a major one.

dan-blanchard added a commit that referenced this pull request Oct 6, 2015
Don't indicate byte order for UTF-16/32 with given BOM, since this is against the Unicode spec.
@dan-blanchard dan-blanchard merged commit 26982c5 into chardet:master Oct 6, 2015
@dan-blanchard
Copy link
Member

Thanks for the PR and bringing this to our attention @snoack!

This was referenced Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants