-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_clean_html: lxml error #108
Comments
The |
And I think I've found the error. Line 9 of the file should actually be: return clean_attributes(tounicode(self._html(True))) So that it forces the Gonna roll with this change for now. |
get_clean_html is made for summary cleaning up, not for the full doc cleaning up. |
I get the same error even if i used get_clean_html only on requests.get(url).text, but only on a spécifique url :https://start.lesechos.fr/travailler-a-letranger/actu-internationales/expatriation-les-pays-qui-chouchoutent-le-plus-les-talents-13988.php page_response = requests.get(page_link)
doc = Document(page_response.text)
doc.get_clean_html()
|
The text was updated successfully, but these errors were encountered: