Try UnicodeDammit instead of using chardet library (or maybe combine the approaches). #42
You mean guess the correct encoding if the wrong encoding or no encoding was specified in the HTTP headers? The problem with the encoding specification in the HTTP header is that people often forget to set it. The default according to the HTTP specification is ISO-8859-1, which is rarely the actual encoding especially when you’re dealing with non-English pages.
I’m using the requests library in just about all of my programs, which “correctly” assumes ISO-8859-1 in such cases. What my programs do to actually decode the page is that they confirm with the actual content-type header whether the charset was set explicitly or whether it was inferred. If it was inferred, they ignore it. Then they call
Two problems remain, however. If none of the tried encodings could decode the response, I’m still in luck because the program actually knows that the decoding was unsuccessful; if, however, one of the incorrect encodings was coincidentally able to decode the page, no decoding errors are raised but the text is garbled. Barring comparisons of language-specific character n-gram models, I can’t think of any automatic method to detect such errors.
After all the sources of explicit encoding specifications have been tried,
Here’s some similar code from a private project of mine: https://bitbucket.org/Telofy/resyndicator/src/b0fdce864919bbbf68561142442428e09fb26112/resyndicator/fetchers.py?at=master#cl-52.
I use this package to parse content in 20 languages, and had to write my own shim to ensure that I only feed unicode to the Document class.
I tried many ways to develop this shim, and finally found something that works across all tested languages
Attempt 1: Using chardet: Chardet worked very well for European languages, but fell short when it comes to CJK encodings which have a superset. Also, the amount of ASCII code in the first parts of HTML content throw it
Attempt 2 Reading headers and charset encoding declarations: I would look for these flags in the response headers and text, then decode. Unfortunately, people lie, especially CJK sites. Many Chinese/Korean sites would state that they use big5/utf-8 but not actually respond with that content. That, or the headers say 'utf-8' and the "charset encoding='big5"".
Attempt 3 Using UnicodeDammit: Unicode Dammit is pretty cool. It's aware that HTML tags/cruft need to be stripped before guessing encoding. And it tries many encodings before giving up. Unfortunately, it is useless for Korean pages that often feature broken encodings.
Final Solution So Far: I used the code in encoding.py to strip tags using regex. Then I ran cchardet (it's a bit faster) and have a list of superset encodings that I have encountered (for instance: 'gb18030' is a superset of 'gb2312') that are commonly "lied about" in the CJK space.
Because of this, I think the existing encoding support is a sound algorithm. It seems as though a common problem are these "lookalike" encodings. A call to cchardet instead of chardet could save time, and the alternate encodings list could be used
For instance, just replace detected encodings:
Thanks, that’s very interesting! I should run a few more tests with cchardet and preprocessed HTML (with tags stripped); that’ll probably restore my confidence into that method. I’ve only been working with English and German sites so far—apart from some that may have found their way in by accident—and the only kind of “lie” that I often encountered in that area was when no HTTP encoding was specified.
There was another problem with my approach, which forced me to reimplement some small parts of UnicodeDammit: https://bitbucket.org/Telofy/utilofies/src/0d8cdc3ae5a0a08e7fb5906d96f0d8e2284751d1/utilofies/bslib.py?at=master#cl-15.
The encoding problem was reported to me, which usually means that it must’ve occurred on a number of pages, and I vaguely remember that I already ran into it and solved it for the old UnicodeDammit sometime in 2011.
When a page is declared, say, UTF-8 consistently everywhere (or anywhere) but contains a single illegal byte sequence—for example in an HTML comment like this one—then the “correct” encoding, UTF-8, is discarded. Moreover, Windows-1252 is somehow able to decode it, so that all umlauts and ligatures are mucked up.
When all declared encodings fail, I now immediately fall back on forcing the first one of them. Only if no encodings at all were declared anywhere do I fall back on UTF-8. I hope this will alleviate the problem.