Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Add article html extraction #11

Merged
merged 1 commit into from

3 participants

@voidfiles

I think it could be useful to get the HTML of the extracted article as well. This allows you to retain some of the semantic information in the html. Also it will help if you end up displaying the extracted article somehow.

@codelucas
Owner

Thanks for the pull request! It looks good, will play with the code over this weekend and most likely merge it on Monday.

I think it makes sense to add an option keep_article_html in the configs to see if the user wants to keep the extra html or not.

Do you think the default should be on or off?

@voidfiles
@codelucas
Owner

Oh wow your JSON API is pretty cool, I like the repository name :p (newspaper 0.0.5 is out with a lot of bug fixes but there are still tons of bugs unfortunately, so maybe consider updating your requirements.txt).

The only tradeoff for turning the article html on would be when people are building multiple newspapers for different huge news sources which generate so many articles that it can be heavy on RAM for all of that extra article html.

But that is a very good feature still, so I have to think about it.

@voidfiles

So, if that is the case that I would think that defaulting to off is a good idea. It won't be hard for me to default it to on for my project.

@codelucas codelucas merged commit 8b12b5a into codelucas:master
@codelucas
Owner

I just merged it, thanks for the pull request. I also committed some changes immediately afterwards. I moved all of your lxml imports into the parser.py module, not sure if it's a good design choice yet but I want all the lxml objects to be kept that module for bookkeeping & sanity.

I also added a config option to turn your feature on/off (default off). It's a keep_article_html boolean in the configuration. Config objects are within the Source and Article classes.

An example usage (which has been tested and works) is:

from newspaper import Article

a = Article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html?hpt=hp_t1', keep_article_html=True)
a.download()
a.parse()

>>> print a.article_html[:600]
u'<div> \n<p><strong>(CNN)</strong> -- Charles Smith insisted Sunday that the former NBA players who went to North Korea for a basketball diplomacy trip, led by Dennis Rodman, weren\'t paid by the repressive regime.</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph2">"Absolutely not. I think I am astute enough to understand the dynamics, especially collecting monetary dollars from North Korea. No, we did not get paid from North Korea at all," he told CNN in a lengthy exclusive interview on "New Day Sunday."</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph3">Smith, who retired from the NBA in 1997 after nine seasons, said an Irish online betting company and a documentary film crew paid expenses for the ex-players turned hoops ambassadors.</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph4">Last month the Irish company, Paddy Power, said it had removed its name from Rodman\'s project after the <a href="http://www.cnn.com/2013/12/31/world/asia/north-korea-kim-jong-un-speech/index.html">execution of Kim\'s uncle and top aide, Jang Song Thaek.</a> But it said it would honor its "contractual commitments" to the team.</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph5">Speaking by satellite from Beijing, Smith said it wasn\'t about the money. He saw it as an opportunity to go to a reclusive country and exchange cultural information with other athletes and citizens. But he didn\'t see it as a birthday present for North Korean leader Kim Jong Un.</p>\n \n \n<p class="cnn_storypgraphtxt cnn_storypgraph6">"That\'s the date that was set. I didn\'t know it was his birthday," he said in the half-hour interview. "And it didn\'t matter to me once I found out that it was his birthday."</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph7">'
...

Thanks once again for the pull!

@codelucas
Owner

I'll add this feature (and your name in the contribs) in the next 0.0.6 release!

@tbmoss3
@codelucas
Owner

You should try to pip install this library on a clean virtualenv. It's especially important for a library as big as newspaper.

The error you are getting is very common if there are multiple lxml's or different lxml's on your python path. It is trying to reference an etree that probably is incorrect.

virtualenv newspaper-env
cd newspaper-env
source bin/activate
pip install newspaper (or if you are using ubuntu, use easy_install lxml first, then pip install newspaper).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 11, 2014
  1. @voidfiles

    Add article html extraction

    voidfiles authored
This page is out of date. Refresh to see the latest.
View
1  .gitignore
@@ -39,3 +39,4 @@ nosetests.xml
.mr.developer.cfg
.project
.pydevproject
+venv
View
10 newspaper/article.py
@@ -85,6 +85,9 @@ def __init__(self, url, title=u'', source_url=u'', config=None, **kwargs):
# the article's unchanged and raw html
self.html = u''
+ # The html of the main article node
+ self.article_html = u''
+
# flags warning users in-case they forget to download() or parse()
self.is_parsed = False
self.is_downloaded = False
@@ -190,7 +193,8 @@ def parse(self):
self.set_movies(video_extractor.get_videos())
self.top_node = self.extractor.post_cleanup(self.top_node)
- text = output_formatter.get_formatted_text(self)
+ text, article_html = output_formatter.get_formatted(self)
+ self.set_article_html(article_html)
self.set_text(text)
if self.raw_doc is not None:
@@ -370,6 +374,10 @@ def set_text(self, text):
if text:
self.text = text
+ def set_article_html(self, article_html):
+ if article_html:
+ self.article_html = article_html
+
def set_top_img(self, src_url):
"""
We want to provide 2 api's for images. One at
View
20 newspaper/outputformatters.py
@@ -5,6 +5,17 @@
from HTMLParser import HTMLParser
from .text import innerTrim
+import lxml
+from lxml.html.clean import Cleaner
+
+
+cleaner = Cleaner()
+cleaner.javascript = True
+cleaner.style = True
+cleaner.allow_tags = ['a', 'span', 'p', 'br', 'strong', 'b', 'em']
+cleaner.remove_unknown_tags = False
+
+
class OutputFormatter(object):
def __init__(self, config):
@@ -26,14 +37,16 @@ def get_language(self, article):
def get_top_node(self):
return self.top_node
- def get_formatted_text(self, article):
+ def get_formatted(self, article):
self.top_node = article.top_node
self.remove_negativescores_nodes()
+ html = self.convert_to_html()
self.links_to_text()
self.add_newline_to_br()
self.replace_with_text()
self.remove_fewwords_paragraphs(article)
- return self.convert_to_text()
+ text = self.convert_to_text()
+ return (text, html)
def convert_to_text(self):
txts = []
@@ -45,6 +58,9 @@ def convert_to_text(self):
txts.extend(txt_lis)
return '\n\n'.join(txts)
+ def convert_to_html(self):
+ return lxml.html.tostring(cleaner.clean_html(self.get_top_node()))
+
def add_newline_to_br(self):
for e in self.parser.getElementsByTag(self.top_node, tag='br'):
e.text = r'\n'
Something went wrong with that request. Please try again.