Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding detection #2

Closed
ghost opened this issue Dec 25, 2013 · 5 comments
Closed

Character encoding detection #2

ghost opened this issue Dec 25, 2013 · 5 comments

Comments

@ghost
Copy link

ghost commented Dec 25, 2013

I noticed that to get unicode, you rely on the requests package's request.text attribute (in network.py->get_html). To get this, requests only uses the HTTP header encoding declaration (requests.utils.get_encoding_from_headers()), and reverts to ISO-8859-1 if it doesn't find one. This results in incorrect character encoding in a lot of cases.

You can use another function from requests to give you the encodings listed in the HTML: requests.utils.get_encodings_from_content() which will work to fill in the gaps. What I generally do is test the request object encoding first. If it's not ISO-8859-1, then it has been passed an encoding, and I return the request.text unicode. If it is, then I call the requests.utils.get_encodings_from_content() which parses via regex. It returns a list of suggested encodings from the content to try, which are generally correct.

In the final case, neither approach will work, an example is this page: http://boaforma.abril.com.br/fitness/todos-os-treinos/bikes-eletricas-759925.shtml

There is no HTTP header encoding, and an incorrect encoding declaration in the content: content="text/html; charset=uISO-8859-1. Here we could use chardet or fall back to the original ISO-8859-1 encoding that requests defaults to (it works in this case).

I'd be happy to add this to the code if desired so you can pull it. Would it be most appropriate to put this into the network.py file?

Edit: Also, I have a large collection of special snowflake links that provide decoding difficulties and edge cases that we could add to the test suite if necessary.

@codelucas
Copy link
Owner

Sorry for the late response, it's Christmas haha. This would be great, looking forward to the pull request! The edge cases for our testing suite would be welcome also.

@codelucas
Copy link
Owner

Hey, if you are still thinking about adding this change, can you hold off for 2 days. I'm about to make another large revamp on the library (internationalization and API).

But after it is done, this change is greatly needed. I have been seeing some cases for foreign languages especially where relying on .text gives a wrong output.

@ghost
Copy link
Author

ghost commented Jan 8, 2014

Cool, I actually have been testing out my changes to the get_html function but haven't commited them yet. I'm still deciding how to deal with some weird cases.

@jeffnappi
Copy link
Contributor

BS4 and/or UnicodeDammit would be an excellent solution for this.

@codelucas
Copy link
Owner

Closing, refer to this related BeautifulSoup4 issue which handles this issue for us:
#44

hartym added a commit to hartym/newspaper that referenced this issue Jan 3, 2017
Improper language parsing for non-meta-language set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants