Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoding of Degree Symbol #50

Open
glennkitchellcaci opened this issue Sep 24, 2019 · 1 comment
Open

UTF-8 encoding of Degree Symbol #50

glennkitchellcaci opened this issue Sep 24, 2019 · 1 comment

Comments

@glennkitchellcaci
Copy link

The issue I'm having is because of the degree symbol:
UTF-8 \xc2\xb0
http://www.fileformat.info/info/unicode/char/b0/index.htm

Below, I include the boiled-down calls. My true testing data sample includes properly formatted XML; but through testing I found that having more and more text does not affect the confidence or output of the "jschardet.detect()" call.

With 1, 2, or 3 degree symbols, it detects as windows-1252 (which parses with an extra \xc2 for each, since it's supposed to be UTF-8)
jschardet.detect('\xc2\xb0');

With 4 degree symbols, it detects as EUC-KR
jschardet.detect('\xc2\xb0\xc2\xb0\xc2\xb0\xc2\xb0');

@lingsamuel
Copy link
Contributor

Fixed in #57 and #59 @aadsm .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants