Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
error: get comments, not text #23
Comments
|
Hi, readability is based on heuristics and can work unreliably for some small fraction of articles. If you like to fix this issue, you might investigate the top page parts submitted for the best candidate election, and select the one you need manually -- e.g., first result from top10 that don't include "commentmetadata" textually, or suggest alternative heuristics or weighting procedure for them. Please look at Document(html).score_paragraphs() output. |
|
xnj you might give the breadability library a shot. It appears to parse this page correctly: |
xnj commentedJun 21, 2012
my code:
from readability.readability import Document
import urllib
url='http://www.womanclinics.ru/boli-vnizu-zhivota-u-zhenshhin.html'
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
print readable_article
"""
in readable_article we get comments, not article text
but when i use original tool: http://www.readability.com/articles/erldraqk
its work right, we get article text, not comments
"""