Character encoding detection #2

ghost · 2013-12-25T13:29:48Z

I noticed that to get unicode, you rely on the requests package's request.text attribute (in network.py->get_html). To get this, requests only uses the HTTP header encoding declaration (requests.utils.get_encoding_from_headers()), and reverts to ISO-8859-1 if it doesn't find one. This results in incorrect character encoding in a lot of cases.

You can use another function from requests to give you the encodings listed in the HTML: requests.utils.get_encodings_from_content() which will work to fill in the gaps. What I generally do is test the request object encoding first. If it's not ISO-8859-1, then it has been passed an encoding, and I return the request.text unicode. If it is, then I call the requests.utils.get_encodings_from_content() which parses via regex. It returns a list of suggested encodings from the content to try, which are generally correct.

In the final case, neither approach will work, an example is this page: http://boaforma.abril.com.br/fitness/todos-os-treinos/bikes-eletricas-759925.shtml

There is no HTTP header encoding, and an incorrect encoding declaration in the content: content="text/html; charset=uISO-8859-1. Here we could use chardet or fall back to the original ISO-8859-1 encoding that requests defaults to (it works in this case).

I'd be happy to add this to the code if desired so you can pull it. Would it be most appropriate to put this into the network.py file?

Edit: Also, I have a large collection of special snowflake links that provide decoding difficulties and edge cases that we could add to the test suite if necessary.

The text was updated successfully, but these errors were encountered:

codelucas · 2013-12-25T20:05:24Z

Sorry for the late response, it's Christmas haha. This would be great, looking forward to the pull request! The edge cases for our testing suite would be welcome also.

codelucas · 2014-01-08T09:03:04Z

Hey, if you are still thinking about adding this change, can you hold off for 2 days. I'm about to make another large revamp on the library (internationalization and API).

But after it is done, this change is greatly needed. I have been seeing some cases for foreign languages especially where relying on .text gives a wrong output.

ghost · 2014-01-08T09:06:17Z

Cool, I actually have been testing out my changes to the get_html function but haven't commited them yet. I'm still deciding how to deal with some weird cases.

jeffnappi · 2014-03-05T20:23:22Z

BS4 and/or UnicodeDammit would be an excellent solution for this.

codelucas · 2014-07-30T23:42:19Z

Closing, refer to this related BeautifulSoup4 issue which handles this issue for us:
#44

Improper language parsing for non-meta-language set

codelucas closed this as completed Jul 30, 2014

hartym added a commit to hartym/newspaper that referenced this issue Jan 3, 2017

Merge pull request codelucas#2 from kjam/master

17d6a8c

Improper language parsing for non-meta-language set

HodorTheCoder mentioned this issue Dec 8, 2017

Article.parse() not parsing entire article body correctly from HTML! #485

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding detection #2

Character encoding detection #2

ghost commented Dec 25, 2013

codelucas commented Dec 25, 2013

codelucas commented Jan 8, 2014

ghost commented Jan 8, 2014

jeffnappi commented Mar 5, 2014

codelucas commented Jul 30, 2014

Character encoding detection #2

Character encoding detection #2

Comments

ghost commented Dec 25, 2013

codelucas commented Dec 25, 2013

codelucas commented Jan 8, 2014

ghost commented Jan 8, 2014

jeffnappi commented Mar 5, 2014

codelucas commented Jul 30, 2014