Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

Closed
sxilderik opened this issue Jan 9, 2018 · 4 comments
Closed

ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

sxilderik opened this issue Jan 9, 2018 · 4 comments
Assignees

Comments

@sxilderik
Copy link

sxilderik commented Jan 9, 2018

Hello
I did a simple basic test, and I’m not very happy with the result. What am I doing wrong?

In pom.xml:

	<dependencies>
		<!-- https://mvnrepository.com/artifact/com.github.albfernandez/juniversalchardet -->
		<dependency>
			<groupId>com.github.albfernandez</groupId>
			<artifactId>juniversalchardet</artifactId>
			<version>2.1.0</version>
		</dependency>
	</dependencies>

In source code:

Basically, I encode a String into an array of bytes with a given Charset. I then use UniversalDetector to guess the charset used. I’m lenient, I don’t expect the exact Charset, but I least I expect a Charset which can successfully encode and decode the initial string giving back that string!
It fails this simple test, as "àéèÇ" encoded in iso-8859-1 is guessed as Hebrew (Windows-1255), and "aeaCàéèÇ" as Thai (TIS-620), none of those Charsets having those accented chars in them!

	@Test
	public void test_decodeBytes() {

		final String string = "aeaCàêäÇ";
		Charset s;
		byte[] bytes;
		try {
			bytes = string.getBytes(StandardCharsets.ISO_8859_1);
			s = this.guessCharset(bytes); // detected charset = TIS-620, Thai charset ???!!!
			Assert.assertEquals(string, new String(string.getBytes(s), s)); // FAILS of course !

			bytes = string.getBytes(StandardCharsets.UTF_8);
			s = this.guessCharset(bytes);
			Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
		} catch (final UnsupportedEncodingException e) {
			e.printStackTrace();
		}
	}

	private Charset guessCharset(final byte[] bytes) {

		final UniversalDetector detector = new UniversalDetector();
		detector.handleData(bytes, 0, bytes.length);
		detector.dataEnd();
		return Charset.forName(detector.getDetectedCharset());
	}

@sxilderik sxilderik changed the title ISO-8859-1 detected as Windows-1255 (Hebrew) ? ISO-8859-1 detected incorrectly (Hebrew or Thai) ! Jan 9, 2018
@albfernandez albfernandez self-assigned this Jan 9, 2018
@sxilderik
Copy link
Author

Actually, I realize that it‘s a different task guessing the charset used in an array of bytes when you know the original string, than when you don’t. Or at least it’s easier to be unhappy about the result!

For example, "àéè".getBytes("ISO-8859-1") gives [160, 151, 152]
This array of bytes can very well be interpreted as Hebrew, giving "איטַ".

new String("àéè".getBytes("ISO-8859-1"), "Windows-1255");
	 (java.lang.String) איט

This very short array of bytes can be successfully decoded using a wide variety of Charsets, you have now way of knowing which one was used in the first place.

So you pick one. I don’t know why Hebrew or Thai are picked first, but they are legit.

An improvement may be not to pick just one, but return all Charsets that have passed successfully the decoding check. The task of picking one would be left to the caller, not the callee...

Your code is working as intended, it’s me who put too much hope in it and did not realize what I hoped for was in fact impossible.

@albfernandez
Copy link
Owner

Hi

Latin-1 detection (windows-1252 / ISO-8859-1), is detected by a statistical analysis, so your code is confused, too much accentuated characters.
With a simpler example (more realistics) as "Château" works fine.

In short, the analyser think the data is impossible to be latin because it has too much accentuated chars... It gives windows-1252 the minimum weight 0.0 beacuse of that.
So your suggestion of return a list of possible detected charsets would not work :(

@albfernandez
Copy link
Owner

	// Tets case for https://github.com/albfernandez/juniversalchardet/issues/22
	// With less accute characters, it's improved detection
	@Test
	public void testDecodeBytesBetterStats() {

		final String string = "Château";
		Charset s;
		byte[] bytes;

		bytes = string.getBytes(StandardCharsets.UTF_8);
		s = this.guessCharset(bytes);
		Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS

		bytes = string.getBytes(StandardCharsets.ISO_8859_1);
		s = this.guessCharset(bytes); 
		Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
	}

private Charset guessCharset(final byte[] bytes) {
	final UniversalDetector detector = new UniversalDetector();
	detector.handleData(bytes, 0, bytes.length);
	detector.dataEnd();
	return Charset.forName(detector.getDetectedCharset());
}

@albfernandez
Copy link
Owner

Also, detecting encoding in short data is harder than detencting in large file, so it's more error prone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants