I can't extract content from this Chinese article #47

nathanathan · 2014-05-12T08:38:39Z

When I use readability on this article http://www.he.xinhuanet.com/news/2012-06/05/content_25346340.htm I am unable to extract any content. The encoding is gb2312, but I've converted it to unicode and the summary is still empty. The html elements don't have informative ids/classes, is there any way readability could handle documents like it?

buriy · 2014-05-12T12:16:41Z

Hi Nathan,
At the moment you might try to set encoding manually (just copy the __main__ code from here https://github.com/buriy/python-readability/blob/master/readability/readability.py#L588 to your file if you use it from the command-line) or to write some code to extract meta-charset information from the page (please note this information could provide wrong encoding sometimes).
I'll happily merge it into this branch if you or someone else will contribute this functionality.
There's also related unsolved #42 related to finding the best solution for this problem at large.

ghost · 2014-05-13T07:19:57Z

I made a pull request with some specific handling of Chinese encodings (gb2312 is often NOT gb2312).

For the content issue, poor Chinese web design in general makes it difficult to correctly determine content. In the example you posted, they use tables to organize content. It's difficult to use this readability algorithm on typical Chinese websites for this sort of reason. This difficulty probably won't go away without large changes. At my company we scrape large amounts of Chinese text and have basically decided to just live with the fact that many sites are unscrapable, and putting in special handling for large/important sites.

buriy · 2014-05-13T17:26:15Z

Thanks a lot, @mperdomo1
I'll try to test it on some Chinese news sources tomorrow and will merge if everything will go well.

Have you got a chance to compare python-readability to newspaper and python-goose projects yet?

ghost · 2014-05-14T00:11:27Z

I checked out newspaper when it was new and it seemed like it used python-goose to handle readability, and at the time when I looked at it, python-readability was a lot more robust and better at picking out content. This definitely could have changed though. I am actually thinking about writing a readability project dedicated to CJK pages because the structure can often be so broken.

buriy · 2014-05-18T16:10:20Z

@mperdomo1 Thanks, I've applied your patch, but for the document mentioned by Nathan it's still blank.
@nathanathan
I'm working on resolving your issue and also rectify logging in new version.
Found a serious html sanitise error related to the images block included into (parent) content block -- the content block just gets removed, only images block should be removed instead.

P.S. Also I'll have a chance to test readability on millions of URLs from thousand of news sources in several next months, so I'll add a lot of new tests and we'll get a completely refreshed version of the module by then.

buriy · 2015-07-27T03:37:31Z

The article in question works fine now. So closing this issue.
Thanks for patience.

buriy closed this as completed Jul 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can't extract content from this Chinese article #47

I can't extract content from this Chinese article #47

nathanathan commented May 12, 2014

buriy commented May 12, 2014

ghost commented May 13, 2014

buriy commented May 13, 2014

ghost commented May 14, 2014

buriy commented May 18, 2014

buriy commented Jul 27, 2015

I can't extract content from this Chinese article #47

I can't extract content from this Chinese article #47

Comments

nathanathan commented May 12, 2014

buriy commented May 12, 2014

ghost commented May 13, 2014

buriy commented May 13, 2014

ghost commented May 14, 2014

buriy commented May 18, 2014

buriy commented Jul 27, 2015