New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character encoding detection #2
Comments
Sorry for the late response, it's Christmas haha. This would be great, looking forward to the pull request! The edge cases for our testing suite would be welcome also. |
Hey, if you are still thinking about adding this change, can you hold off for 2 days. I'm about to make another large revamp on the library (internationalization and API). But after it is done, this change is greatly needed. I have been seeing some cases for foreign languages especially where relying on .text gives a wrong output. |
Cool, I actually have been testing out my changes to the get_html function but haven't commited them yet. I'm still deciding how to deal with some weird cases. |
BS4 and/or UnicodeDammit would be an excellent solution for this. |
Closing, refer to this related BeautifulSoup4 issue which handles this issue for us: |
Improper language parsing for non-meta-language set
I noticed that to get unicode, you rely on the requests package's request.text attribute (in
network.py->get_html
). To get this, requests only uses the HTTP header encoding declaration (requests.utils.get_encoding_from_headers()
), and reverts to ISO-8859-1 if it doesn't find one. This results in incorrect character encoding in a lot of cases.You can use another function from requests to give you the encodings listed in the HTML:
requests.utils.get_encodings_from_content()
which will work to fill in the gaps. What I generally do is test the request object encoding first. If it's not ISO-8859-1, then it has been passed an encoding, and I return the request.text unicode. If it is, then I call therequests.utils.get_encodings_from_content()
which parses via regex. It returns a list of suggested encodings from the content to try, which are generally correct.In the final case, neither approach will work, an example is this page: http://boaforma.abril.com.br/fitness/todos-os-treinos/bikes-eletricas-759925.shtml
There is no HTTP header encoding, and an incorrect encoding declaration in the content:
content="text/html; charset=uISO-8859-1
. Here we could use chardet or fall back to the original ISO-8859-1 encoding that requests defaults to (it works in this case).I'd be happy to add this to the code if desired so you can pull it. Would it be most appropriate to put this into the network.py file?
Edit: Also, I have a large collection of special snowflake links that provide decoding difficulties and edge cases that we could add to the test suite if necessary.
The text was updated successfully, but these errors were encountered: