get_clean_html: lxml error #108

dufferzafar · 2018-09-30T14:50:05Z

>>> u = "https://www.geeksforgeeks.org/samsung-research-institute-bangalore-srib-intern/"

>>> import requests

>>> r = requests.get(u)

>>> from readability import Document

>>> doc = Document(r.content)

>>> doc.get_clean_html()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dufferzafar/.local/lib/python3.7/site-packages/readability/readability.py", line 167, in get_clean_html
    return clean_attributes(tounicode(self.html))
  File "src/lxml/etree.pyx", line 3397, in lxml.etree.tounicode
TypeError: Type '<class 'NoneType'>' cannot be serialized.
Type '<class 'NoneType'>' cannot be serialized.

dufferzafar · 2018-09-30T15:05:32Z

The doc.summary() method works, but it doesn't seem to have all the data that we want.

dufferzafar · 2018-09-30T15:07:45Z

And I think I've found the error.

Line 9 of the file should actually be:

   return clean_attributes(tounicode(self._html(True)))

So that it forces the self.html attribute to be set.

Gonna roll with this change for now.

buriy · 2018-10-01T14:13:51Z

get_clean_html is made for summary cleaning up, not for the full doc cleaning up.
You can clean the full doc yourself, you don't need this lib for that ;)

MChrys · 2019-01-23T11:39:45Z

I get the same error even if i used get_clean_html only on requests.get(url).text, but only on a spécifique url :https://start.lesechos.fr/travailler-a-letranger/actu-internationales/expatriation-les-pays-qui-chouchoutent-le-plus-les-talents-13988.php

page_response = requests.get(page_link)
doc  = Document(page_response.text)
doc.get_clean_html()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-f64b1934a8ea> in <module>
      1 page_response = requests.get(page_link)
      2 doc  = Document(page_response.text)
----> 3 doc.get_clean_html()

C:\ProgramData\Anaconda3\lib\site-packages\readability\readability.py in get_clean_html(self)
    165         to disable or to improve DOM-to-text conversion in .summary() method
    166         """
--> 167         return clean_attributes(tounicode(self.html))
    168 
    169     def summary(self, html_partial=False):

src/lxml/etree.pyx in lxml.etree.tounicode()

TypeError: Type '<class 'NoneType'>' cannot be serialized.

dufferzafar mentioned this issue Sep 30, 2018

Force html parsing in get_clean_html() #109

Closed

buriy closed this as completed Oct 1, 2018

buriy reopened this Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_clean_html: lxml error #108

get_clean_html: lxml error #108

dufferzafar commented Sep 30, 2018

dufferzafar commented Sep 30, 2018

dufferzafar commented Sep 30, 2018

buriy commented Oct 1, 2018

MChrys commented Jan 23, 2019 •

edited

Loading

get_clean_html: lxml error #108

get_clean_html: lxml error #108

Comments

dufferzafar commented Sep 30, 2018

dufferzafar commented Sep 30, 2018

dufferzafar commented Sep 30, 2018

buriy commented Oct 1, 2018

MChrys commented Jan 23, 2019 • edited Loading

MChrys commented Jan 23, 2019 •

edited

Loading