New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22
Comments
Actually, I realize that it‘s a different task guessing the charset used in an array of bytes when you know the original string, than when you don’t. Or at least it’s easier to be unhappy about the result! For example,
This very short array of bytes can be successfully decoded using a wide variety of Charsets, you have now way of knowing which one was used in the first place. So you pick one. I don’t know why Hebrew or Thai are picked first, but they are legit. An improvement may be not to pick just one, but return all Charsets that have passed successfully the decoding check. The task of picking one would be left to the caller, not the callee... Your code is working as intended, it’s me who put too much hope in it and did not realize what I hoped for was in fact impossible. |
Hi Latin-1 detection (windows-1252 / ISO-8859-1), is detected by a statistical analysis, so your code is confused, too much accentuated characters. In short, the analyser think the data is impossible to be latin because it has too much accentuated chars... It gives windows-1252 the minimum weight 0.0 beacuse of that. |
|
Also, detecting encoding in short data is harder than detencting in large file, so it's more error prone. |
Hello
I did a simple basic test, and I’m not very happy with the result. What am I doing wrong?
In pom.xml:
In source code:
Basically, I encode a String into an array of bytes with a given Charset. I then use UniversalDetector to guess the charset used. I’m lenient, I don’t expect the exact Charset, but I least I expect a Charset which can successfully encode and decode the initial string giving back that string!
It fails this simple test, as
"àéèÇ"
encoded in iso-8859-1 is guessed as Hebrew (Windows-1255), and"aeaCàéèÇ"
as Thai (TIS-620), none of those Charsets having those accented chars in them!The text was updated successfully, but these errors were encountered: