Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISO-8859-2 should be detected #7

Closed
Jehan opened this issue Aug 5, 2015 · 3 comments
Closed

ISO-8859-2 should be detected #7

Jehan opened this issue Aug 5, 2015 · 3 comments

Comments

@Jehan
Copy link
Collaborator

Jehan commented Aug 5, 2015

In your README, ISO-8859-2 is not supported. Yet I can find a model for it in src/LangHungarianModel.cpp. I tried it with a ISO-8859-2 file I built myself:
https://cloud.libreart.info/public.php?service=files&t=40140bd3fd105b2c03d7716dfe4b498a
And it fails detecting it as "windows-1252".

On the other hand python-chardet was able to properly detect the ISO-8859-2 encoding:

$ chardetect iso-8859-2.smi
iso-8859-2.smi: ISO-8859-2 with confidence 0.850807928898

Considering they are both supposed to be based on the same algorithm from Mozilla and that you have mention of this encoding in your code, I'm thinking it would be cool if it were supported.

@Jehan
Copy link
Collaborator Author

Jehan commented Nov 16, 2015

For info, I can see in src/nsSBCSGroupProber.cpp that it has been disabled lines 83-84:

// disable latin2 before latin1 is available, otherwise all latin1
// will be detected as latin2 because of their similarity.
//mProbers[10] = new nsSingleByteCharSetProber(&Latin2HungarianModel);
//mProbers[11] = new nsSingleByteCharSetProber(&Win1250HungarianModel);

I'm unsure why we'd need latin1 (ISO 8859-1) support first. Anyway files in latin1 will be wrongly detected, now they would still be wrongly detected, but in another encoding.
Well anyway, I'll have a closer look, and will investigate if needed.

@Jehan
Copy link
Collaborator Author

Jehan commented Nov 17, 2015

I have added some ISO-8859-1 and ISO-8859-2 test files and now I understand the issue.
Both files are currently detected as WINDOWS-1252. If I activate ISO-8859-2, they both are activated as ISO-8859-2.
The problem is that Windows-1252 is a superset of ISO-8859-1, so it is actually not a completely wrong answer, but it becomes completely wrong once ISO-8859-2 is activated. So let's indeed wait for clear ISO-8859-1 support.

@Jehan
Copy link
Collaborator Author

Jehan commented Dec 2, 2015

Fixed with commit 6832552.

@Jehan Jehan closed this as completed Dec 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant