Inconsistent behavior on small strings #37

mitya57 · 2014-11-14T10:40:12Z

UPD: Deleted python2.7 example because it was not working properly. See a comment below for a better test case.

This is all on Debian GNU/Linux unstable with the current master:

$ python3.4 -c "import chardet; print(chardet.detect(u'é'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python3.4 -c "import chardet; print(chardet.detect(u'éé'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}

The second line should be utf-8 as well, not windows-1252.

The text was updated successfully, but these errors were encountered:

dan-blanchard · 2014-11-14T14:03:04Z

Detection for small strings in general is pretty flakey, as the algorithm is intended to work on larger samples. That said, I'm surprised to see the output is so different between major Python versions. Thanks for bringing that to our attention.

sigmavirus24 · 2014-11-14T14:23:20Z

Hey @mitya57 thanks for reporting this. I just want you to realize that both @dan-blanchard and I are the maintainers and we both have a lot of other responsibilities and very little time to devote to chardet. If you wait for us to attempt to fix this, it will probably be a big wait time. If you have time to spare and care to investigate this, you would be really accelerating how quickly this is fixed.

Cheers

mitya57 · 2014-11-14T16:07:44Z

Hmm, looks like Python 2.7 has some problems with interpreting unicode given inside -c argument. Replacing é with its \xe9 equivalent gives me a behaviour that is consistent between Python 2 and Python 3, but still inconsistent between one-char and two-chars strings.

$ python2.7 -c "import chardet; print(chardet.detect(u'\xe9'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python3.4 -c "import chardet; print(chardet.detect(u'\xe9'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python2.7 -c "import chardet; print(chardet.detect(u'\xe9\xe9'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}
$ python3.4 -c "import chardet; print(chardet.detect(u'\xe9\xe9'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}

dan-blanchard · 2014-11-14T16:28:48Z

@mitya57 That makes a lot more sense. I didn't think about the fact that you were calling it from the command-line with -c before.

Anyway, the difference in results between one and two character strings is to be expected. The more data you give it, the more accurate it will be, so it is wrong for one-character and correct for two characters.

mitya57 · 2014-11-14T16:53:37Z

OK, that's understandable.

Actually I filed this issue because currently beautifylsoup4's tests fail with an error, I will now file a pull request against beautifulsoup4 to use double \xe9 in their tests instead of a single.

dan-blanchard added the bug label Nov 14, 2014

dan-blanchard changed the title ~~Inconsistent behavior on small strings~~ Inconsistent behavior on small strings between Python 2 and 3 Nov 14, 2014

mitya57 changed the title ~~Inconsistent behavior on small strings between Python 2 and 3~~ Inconsistent behavior on small strings Nov 14, 2014

dan-blanchard closed this as completed Nov 14, 2014

dan-blanchard added invalid and removed bug labels Nov 14, 2014

guillermogf mentioned this issue Feb 27, 2022

Non-ASCII characters not shown properly on text preview ranger/ranger#1948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior on small strings #37

Inconsistent behavior on small strings #37

mitya57 commented Nov 14, 2014

dan-blanchard commented Nov 14, 2014

sigmavirus24 commented Nov 14, 2014

mitya57 commented Nov 14, 2014

dan-blanchard commented Nov 14, 2014

mitya57 commented Nov 14, 2014

Inconsistent behavior on small strings #37

Inconsistent behavior on small strings #37

Comments

mitya57 commented Nov 14, 2014

dan-blanchard commented Nov 14, 2014

sigmavirus24 commented Nov 14, 2014

mitya57 commented Nov 14, 2014

dan-blanchard commented Nov 14, 2014

mitya57 commented Nov 14, 2014