Don't indicate byte order for UTF-16/32 with given BOM #73

snoack · 2015-10-06T14:55:58Z

If passed a string starting with \xff\xfe (low endian byte order mark) or \xfe\xff (big endian byte order mark) the encoding is detected as UTF-16LE, UTF-32LE, UTF-16BE or UTF-32BE respectively.

However, as the byte order mark is given in the string, the encoding should be simply UTF-16 or UTF-32. Otherwise bytes.decode() will fail or preserve the byte order mark:

s = 'foo'.encode('UTF-16')
encoding = chardet.detect(s)['encoding']  # "UTF-16LE"
s.decode(encoding)                        # "\ufefffoo"

s = codecs.BOM_BE + 'foo'.encode('UTF-16BE')
encoding = chardet.detect(s)['encoding']  # "UTF-16BE"
s.decode(encoding)                        # "\ufefffoo"

Hence code that uses chardet in order to detect the encoding to decode data, would need to wrap chardet.detect in following inconvenient and counter-intuitive way:

encoding = chardet.detect(enc)['encoding']
if encoding in ('UTF-16LE', 'UTF16BE'):
  dec = enc.decode('UTF-16')
elif encoding in ('UTF-32LE', 'UTF-32BE'):
  dec = enc.decode('UTF-32')
else:
  dec = enc.decode(encoding)

This PR changes the behavior to return simply UTF-16or UTF-32 respectively when a byte order mark were found, that the detected encoding can be passed unchanged to bytes.decode().

sigmavirus24 · 2015-10-06T15:38:11Z

This is fundamentally backwards incompatible. People may already be relying on just such a wrapper as you wrote and any special casing besides the simple calls you've outlined may be lost on an else or they may start receiving exceptions if their else raises an exception.

I would rather document this work around and the bug and mark it to be fixed in a major release than accept this and release it in something that isn't semantically backwards incompatible.

dan-blanchard · 2015-10-06T16:10:17Z

Wow, this is news to me. It honestly doesn't make any sense to me that Python bytes.decode and str.encode work this way with UTF-16 and UTF-32.

I mean, look at this:

In [7]: 'foo'.encode('UTF-16le')
Out[7]: 'f\x00o\x00o\x00'

In [8]: 'foo'.encode('UTF-16')
Out[8]: '\xff\xfef\x00o\x00o\x00'

In [9]: 'foo'.encode('UTF-16be')
Out[9]: '\x00f\x00o\x00o'

Why in the world would you want UTF-16 to add the BOM when encoding, but not UTF-16LE or UTF-16BE? This especially odd when you consider that the BOM-specific codec for UTF-8, UTF-8-SIG, removes the BOM when decoding whereas the basic version doesn't:

In [16]: (codecs.BOM_UTF8 + 'foo'.encode('UTF-8')).decode('utf-8-sig')
Out[16]: u'foo'

In [17]: (codecs.BOM_UTF8 + 'foo'.encode('UTF-8')).decode('utf-8')
Out[17]: u'\ufefffoo'

…ity with decode()

dan-blanchard · 2015-10-06T16:23:34Z

As for the backward-incompatible nature of this change, this might be something we should handle with configuration object proposed in #71.

There are really two fundamentally different use cases for chardet:

Detecting Python-specific encodings so that you can immediately decode in Python.
Detecting any encoding for processing with some external tool.

dan-blanchard · 2015-10-06T16:39:49Z

This feels like a Python bug to me, so I filed an issue.

snoack · 2015-10-06T16:52:28Z

@sigmavirus24: Almost all code I've seen so far, that uses chardet, behaves incorrectly when dealing with UTF-16 and UTF-32, due to this bug. And I can only hardly imagine reasonable uses cases that would rely on this bug, in a way that fixing it would introduce worse issues. The only scenario that would come to mind is when people explicitly and exclusively handle UTF-{16,32}{LE,BE}. No idea why somebody should do that, but anyway this is certainly a less common scenario than these that are currently negatively effected by this issue.

And documenting that behavior (instead fixing it), would not only leave everybody who doesn't adapt for that issue, with broken behavior. But also makes chardet rather inconvenient to use, and appear broken.

@dan-blanchard: I think the way Python deals with endianess makes absolutely sense. Note that this way s.encode(e).decode(e) always equals to s. This is important. Otherwise, things would be way more unexpected. Nobody would ~~mind~~ care to explicitly strip the BOM when decoding UTF-16/32, neither do I think this would be a preferable.

Moreover, there is no reason to indicate the byte order in the protocols, if it's already given in the data. Therefore, regardless of compatibility with bytes.decode(), I think there is no reason for chardet to indicate the byte order if a BOM is given. This information is even less relevant than the form/normalization used by the unicode characters. And if people really want/need to know the byte order, they can easily check the BOM, e.g. data.startdwith(codec.BOM_BE). The exception is when there is no BOM. Then, and only then, you are supposed to use the UTF-{16,32}{LE|BE} encodings.

dan-blanchard · 2015-10-06T18:02:54Z

I think the way Python deals with endianess makes absolutely sense. Note that this way s.encode(e).decode(e) always equals to s. This is important. Otherwise, things would be way more unexpected. Nobody would mind to explicitly strip the BOM when decoding UTF-16/32, neither do I think this would be a preferable.

Sorry if this comes across as snarky, but if nobody would mind explicitly stripping the BOM, then I don't see the point of your PR.

s.encode(e).decode(e) == s is always true with the current setup, but it would also always be true if the BE and LE versions of the codecs added/removed the BOM as necessary.

dan-blanchard · 2015-10-06T18:25:56Z

Okay, after looking through the Unicode FAQ after being sent there by some Python core devs. It appears out current behavior is incorrect, and we should do things as in this PR, because the UTF-{16,32}{BE,LE} codecs are only meant to be used when there is no BOM! The FAQ says:

Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16.

Furthermore it goes on to say that:

Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a ZWNBSP.

@sigmavirus24, This is truly an error on our part, as the old behavior was reporting the wrong codec.

dan-blanchard · 2015-10-06T18:26:44Z

Also, the test failures are unrelated to the changes here, as we've still got a few lingering failures.

sigmavirus24 · 2015-10-06T18:34:14Z

Interesting @dan-blanchard. I didn't know that to be very frank. Are we planning on doing a 2.0.0 release any time soon? I'd be happy to include this there.

dan-blanchard · 2015-10-06T19:43:54Z

Yes, a 3.0.0 release is definitely going to be the next one (we're currently at 2.3.0). I'm still trying to fix up everything that was in #52 in the abandoned_pr branch and get that in there. There were many good parts to that PR, and the ability to retrain models is a big contribution. That said, I don't think the models were trained on enough data, which is something I'm working on addressing.

dan-blanchard · 2015-10-06T19:44:19Z

Anyway, I'm going to merge this since our next release will be a major one.

Don't indicate byte order for UTF-16/32 with given BOM, since this is against the Unicode spec.

dan-blanchard · 2015-10-06T19:44:53Z

Thanks for the PR and bringing this to our attention @snoack!

snoack changed the title ~~Don't indicate byte order for UTF-16/32 with given BOM for compatibility with decode()~~ Don't indicate byte order for UTF-16/32 with given BOM Oct 6, 2015

Don't indicate byte order for UTF-16/32 with given BOM for compatibil…

7205b08

…ity with decode()

dan-blanchard added a commit that referenced this pull request Oct 6, 2015

Merge pull request #73 from snoack/fix-left-endian

26982c5

Don't indicate byte order for UTF-16/32 with given BOM, since this is against the Unicode spec.

dan-blanchard merged commit 26982c5 into chardet:master Oct 6, 2015

This was referenced Apr 11, 2017

Release 3.0.0 #110

Merged

UTF detection when missing Byte Order Mark #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't indicate byte order for UTF-16/32 with given BOM #73

Don't indicate byte order for UTF-16/32 with given BOM #73

snoack commented Oct 6, 2015

sigmavirus24 commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

snoack commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

sigmavirus24 commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

Don't indicate byte order for UTF-16/32 with given BOM #73

Don't indicate byte order for UTF-16/32 with given BOM #73

Conversation

snoack commented Oct 6, 2015

sigmavirus24 commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

snoack commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

sigmavirus24 commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015

dan-blanchard commented Oct 6, 2015