Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can't extract content from this Chinese article #47

Closed
nathanathan opened this issue May 12, 2014 · 6 comments
Closed

I can't extract content from this Chinese article #47

nathanathan opened this issue May 12, 2014 · 6 comments

Comments

@nathanathan
Copy link

When I use readability on this article http://www.he.xinhuanet.com/news/2012-06/05/content_25346340.htm I am unable to extract any content. The encoding is gb2312, but I've converted it to unicode and the summary is still empty. The html elements don't have informative ids/classes, is there any way readability could handle documents like it?

@buriy
Copy link
Owner

buriy commented May 12, 2014

Hi Nathan,
At the moment you might try to set encoding manually (just copy the __main__ code from here https://github.com/buriy/python-readability/blob/master/readability/readability.py#L588 to your file if you use it from the command-line) or to write some code to extract meta-charset information from the page (please note this information could provide wrong encoding sometimes).
I'll happily merge it into this branch if you or someone else will contribute this functionality.
There's also related unsolved #42 related to finding the best solution for this problem at large.

@ghost
Copy link

ghost commented May 13, 2014

I made a pull request with some specific handling of Chinese encodings (gb2312 is often NOT gb2312).

For the content issue, poor Chinese web design in general makes it difficult to correctly determine content. In the example you posted, they use tables to organize content. It's difficult to use this readability algorithm on typical Chinese websites for this sort of reason. This difficulty probably won't go away without large changes. At my company we scrape large amounts of Chinese text and have basically decided to just live with the fact that many sites are unscrapable, and putting in special handling for large/important sites.

@buriy
Copy link
Owner

buriy commented May 13, 2014

Thanks a lot, @mperdomo1
I'll try to test it on some Chinese news sources tomorrow and will merge if everything will go well.

Have you got a chance to compare python-readability to newspaper and python-goose projects yet?

@ghost
Copy link

ghost commented May 14, 2014

I checked out newspaper when it was new and it seemed like it used python-goose to handle readability, and at the time when I looked at it, python-readability was a lot more robust and better at picking out content. This definitely could have changed though. I am actually thinking about writing a readability project dedicated to CJK pages because the structure can often be so broken.

@buriy
Copy link
Owner

buriy commented May 18, 2014

@mperdomo1 Thanks, I've applied your patch, but for the document mentioned by Nathan it's still blank.
@nathanathan
I'm working on resolving your issue and also rectify logging in new version.
Found a serious html sanitise error related to the images block included into (parent) content block -- the content block just gets removed, only images block should be removed instead.

P.S. Also I'll have a chance to test readability on millions of URLs from thousand of news sources in several next months, so I'll add a lot of new tests and we'll get a completely refreshed version of the module by then.

@buriy
Copy link
Owner

buriy commented Jul 27, 2015

The article in question works fine now. So closing this issue.
Thanks for patience.

@buriy buriy closed this as completed Jul 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants