ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

sxilderik · 2018-01-09T13:52:56Z

Hello
I did a simple basic test, and I’m not very happy with the result. What am I doing wrong?

In pom.xml:

	<dependencies>
		<!-- https://mvnrepository.com/artifact/com.github.albfernandez/juniversalchardet -->
		<dependency>
			<groupId>com.github.albfernandez</groupId>
			<artifactId>juniversalchardet</artifactId>
			<version>2.1.0</version>
		</dependency>
	</dependencies>

In source code:

Basically, I encode a String into an array of bytes with a given Charset. I then use UniversalDetector to guess the charset used. I’m lenient, I don’t expect the exact Charset, but I least I expect a Charset which can successfully encode and decode the initial string giving back that string!
It fails this simple test, as "àéèÇ" encoded in iso-8859-1 is guessed as Hebrew (Windows-1255), and "aeaCàéèÇ" as Thai (TIS-620), none of those Charsets having those accented chars in them!

	@Test
	public void test_decodeBytes() {

		final String string = "aeaCàêäÇ";
		Charset s;
		byte[] bytes;
		try {
			bytes = string.getBytes(StandardCharsets.ISO_8859_1);
			s = this.guessCharset(bytes); // detected charset = TIS-620, Thai charset ???!!!
			Assert.assertEquals(string, new String(string.getBytes(s), s)); // FAILS of course !

			bytes = string.getBytes(StandardCharsets.UTF_8);
			s = this.guessCharset(bytes);
			Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
		} catch (final UnsupportedEncodingException e) {
			e.printStackTrace();
		}
	}

	private Charset guessCharset(final byte[] bytes) {

		final UniversalDetector detector = new UniversalDetector();
		detector.handleData(bytes, 0, bytes.length);
		detector.dataEnd();
		return Charset.forName(detector.getDetectedCharset());
	}

The text was updated successfully, but these errors were encountered:

sxilderik · 2018-01-11T11:15:12Z

Actually, I realize that it‘s a different task guessing the charset used in an array of bytes when you know the original string, than when you don’t. Or at least it’s easier to be unhappy about the result!

For example, "àéè".getBytes("ISO-8859-1") gives [160, 151, 152]
This array of bytes can very well be interpreted as Hebrew, giving "איטַ".

new String("àéè".getBytes("ISO-8859-1"), "Windows-1255");
	 (java.lang.String) איט

This very short array of bytes can be successfully decoded using a wide variety of Charsets, you have now way of knowing which one was used in the first place.

So you pick one. I don’t know why Hebrew or Thai are picked first, but they are legit.

An improvement may be not to pick just one, but return all Charsets that have passed successfully the decoding check. The task of picking one would be left to the caller, not the callee...

Your code is working as intended, it’s me who put too much hope in it and did not realize what I hoped for was in fact impossible.

albfernandez · 2018-01-14T14:39:17Z

Hi

Latin-1 detection (windows-1252 / ISO-8859-1), is detected by a statistical analysis, so your code is confused, too much accentuated characters.
With a simpler example (more realistics) as "Château" works fine.

In short, the analyser think the data is impossible to be latin because it has too much accentuated chars... It gives windows-1252 the minimum weight 0.0 beacuse of that.
So your suggestion of return a list of possible detected charsets would not work :(

albfernandez · 2018-01-14T14:39:48Z

	// Tets case for https://github.com/albfernandez/juniversalchardet/issues/22
	// With less accute characters, it's improved detection
	@Test
	public void testDecodeBytesBetterStats() {

		final String string = "Château";
		Charset s;
		byte[] bytes;

		bytes = string.getBytes(StandardCharsets.UTF_8);
		s = this.guessCharset(bytes);
		Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS

		bytes = string.getBytes(StandardCharsets.ISO_8859_1);
		s = this.guessCharset(bytes); 
		Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
	}

private Charset guessCharset(final byte[] bytes) {
	final UniversalDetector detector = new UniversalDetector();
	detector.handleData(bytes, 0, bytes.length);
	detector.dataEnd();
	return Charset.forName(detector.getDetectedCharset());
}

albfernandez · 2018-01-14T14:41:50Z

Also, detecting encoding in short data is harder than detencting in large file, so it's more error prone.

sxilderik changed the title ~~ISO-8859-1 detected as Windows-1255 (Hebrew) ?~~ ISO-8859-1 detected incorrectly (Hebrew or Thai) ! Jan 9, 2018

albfernandez self-assigned this Jan 9, 2018

albfernandez closed this as completed in 5e3922f Jan 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

sxilderik commented Jan 9, 2018 •

edited

sxilderik commented Jan 11, 2018

albfernandez commented Jan 14, 2018

albfernandez commented Jan 14, 2018

albfernandez commented Jan 14, 2018

ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

Comments

sxilderik commented Jan 9, 2018 • edited

sxilderik commented Jan 11, 2018

albfernandez commented Jan 14, 2018

albfernandez commented Jan 14, 2018

albfernandez commented Jan 14, 2018

sxilderik commented Jan 9, 2018 •

edited