-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I can't extract content from this Chinese article #47
Comments
Hi Nathan, |
I made a pull request with some specific handling of Chinese encodings (gb2312 is often NOT gb2312). For the content issue, poor Chinese web design in general makes it difficult to correctly determine content. In the example you posted, they use tables to organize content. It's difficult to use this readability algorithm on typical Chinese websites for this sort of reason. This difficulty probably won't go away without large changes. At my company we scrape large amounts of Chinese text and have basically decided to just live with the fact that many sites are unscrapable, and putting in special handling for large/important sites. |
Thanks a lot, @mperdomo1 Have you got a chance to compare python-readability to newspaper and python-goose projects yet? |
I checked out newspaper when it was new and it seemed like it used python-goose to handle readability, and at the time when I looked at it, python-readability was a lot more robust and better at picking out content. This definitely could have changed though. I am actually thinking about writing a readability project dedicated to CJK pages because the structure can often be so broken. |
@mperdomo1 Thanks, I've applied your patch, but for the document mentioned by Nathan it's still blank. P.S. Also I'll have a chance to test readability on millions of URLs from thousand of news sources in several next months, so I'll add a lot of new tests and we'll get a completely refreshed version of the module by then. |
The article in question works fine now. So closing this issue. |
When I use readability on this article http://www.he.xinhuanet.com/news/2012-06/05/content_25346340.htm I am unable to extract any content. The encoding is gb2312, but I've converted it to unicode and the summary is still empty. The html elements don't have informative ids/classes, is there any way readability could handle documents like it?
The text was updated successfully, but these errors were encountered: