Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior on small strings #37

Closed
mitya57 opened this issue Nov 14, 2014 · 5 comments
Closed

Inconsistent behavior on small strings #37

mitya57 opened this issue Nov 14, 2014 · 5 comments
Labels

Comments

@mitya57
Copy link

mitya57 commented Nov 14, 2014

UPD: Deleted python2.7 example because it was not working properly. See a comment below for a better test case.

This is all on Debian GNU/Linux unstable with the current master:

$ python3.4 -c "import chardet; print(chardet.detect(u'é'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python3.4 -c "import chardet; print(chardet.detect(u'éé'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}

The second line should be utf-8 as well, not windows-1252.

@dan-blanchard
Copy link
Member

Detection for small strings in general is pretty flakey, as the algorithm is intended to work on larger samples. That said, I'm surprised to see the output is so different between major Python versions. Thanks for bringing that to our attention.

@dan-blanchard dan-blanchard changed the title Inconsistent behavior on small strings Inconsistent behavior on small strings between Python 2 and 3 Nov 14, 2014
@sigmavirus24
Copy link
Member

Hey @mitya57 thanks for reporting this. I just want you to realize that both @dan-blanchard and I are the maintainers and we both have a lot of other responsibilities and very little time to devote to chardet. If you wait for us to attempt to fix this, it will probably be a big wait time. If you have time to spare and care to investigate this, you would be really accelerating how quickly this is fixed.

Cheers

@mitya57
Copy link
Author

mitya57 commented Nov 14, 2014

Hmm, looks like Python 2.7 has some problems with interpreting unicode given inside -c argument. Replacing é with its \xe9 equivalent gives me a behaviour that is consistent between Python 2 and Python 3, but still inconsistent between one-char and two-chars strings.

$ python2.7 -c "import chardet; print(chardet.detect(u'\xe9'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python3.4 -c "import chardet; print(chardet.detect(u'\xe9'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python2.7 -c "import chardet; print(chardet.detect(u'\xe9\xe9'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}
$ python3.4 -c "import chardet; print(chardet.detect(u'\xe9\xe9'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}

@mitya57 mitya57 changed the title Inconsistent behavior on small strings between Python 2 and 3 Inconsistent behavior on small strings Nov 14, 2014
@dan-blanchard
Copy link
Member

@mitya57 That makes a lot more sense. I didn't think about the fact that you were calling it from the command-line with -c before.

Anyway, the difference in results between one and two character strings is to be expected. The more data you give it, the more accurate it will be, so it is wrong for one-character and correct for two characters.

@mitya57
Copy link
Author

mitya57 commented Nov 14, 2014

OK, that's understandable.

Actually I filed this issue because currently beautifylsoup4's tests fail with an error, I will now file a pull request against beautifulsoup4 to use double \xe9 in their tests instead of a single.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants