Skip to content

Comments

Fixed latin1 detector in chardet#5

Open
ablegrape wants to merge 1 commit intodcramer:masterfrom
ablegrape:latin1fix
Open

Fixed latin1 detector in chardet#5
ablegrape wants to merge 1 commit intodcramer:masterfrom
ablegrape:latin1fix

Conversation

@ablegrape
Copy link

Hi David,

Following up on our exchange from a few weeks ago. I've commited the bug fix for the integer math bug (causes a file with even one "low confidence" character to have 0 confidence level) and also updated the confidence multiplier for latin 1 to one that works better (the original multiplier causes many files to be incorrectly detected as iso-latin-2). Testing on both the problem case mentioned ( http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML ) and the character set tables from http://www.columbia.edu/kermit/csettables.html suggests that the fixed code performs better.

Hope this is useful - it's certainly helped fix a few problem documents in the application I'm working on.

Best,

Doug

puzzlet pushed a commit to puzzlet/python-chardet that referenced this pull request Jan 25, 2013
@rspeer
Copy link

rspeer commented Aug 26, 2013

So that's why!

I would love to see a release of chardet that fixes this bug. As it is, chardet basically can't be used for latin-1, and that's the most common single-byte encoding. (Well, Windows-1252 is, but chardet doesn't really distinguish those.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants