error: get comments, not text #23

Closed
xnj opened this Issue Jun 21, 2012 · 2 comments

Comments

Projects
None yet
3 participants

xnj commented Jun 21, 2012

my code:

from readability.readability import Document
import urllib

url='http://www.womanclinics.ru/boli-vnizu-zhivota-u-zhenshhin.html'

html = urllib.urlopen(url).read()

readable_article = Document(html).summary()

print readable_article

"""
in readable_article we get comments, not article text
but when i use original tool: http://www.readability.com/articles/erldraqk
its work right, we get article text, not comments
"""

Owner

buriy commented Jun 21, 2012

Hi,

readability is based on heuristics and can work unreliably for some small fraction of articles.
This is not the only article where comments are considered better than the article itself ;-)

If you like to fix this issue, you might investigate the top page parts submitted for the best candidate election, and select the one you need manually -- e.g., first result from top10 that don't include "commentmetadata" textually, or suggest alternative heuristics or weighting procedure for them.

Please look at Document(html).score_paragraphs() output.
You could subclass Document and write your own select_best_candidate method.

Collaborator

mitechie commented Jun 21, 2012

xnj you might give the breadability library a shot.

It appears to parse this page correctly:
http://readable.bmark.us/view/http%3A%2F%2Fwww.womanclinics.ru%2Fboli-vnizu-zhivota-u-zhenshhin.html

@buriy buriy closed this Oct 9, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment