Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add article html extraction #11

Merged
merged 1 commit into from Jan 13, 2014
Merged

Conversation

voidfiles
Copy link
Contributor

I think it could be useful to get the HTML of the extracted article as well. This allows you to retain some of the semantic information in the html. Also it will help if you end up displaying the extracted article somehow.

@codelucas
Copy link
Owner

Thanks for the pull request! It looks good, will play with the code over this weekend and most likely merge it on Monday.

I think it makes sense to add an option keep_article_html in the configs to see if the user wants to keep the extra html or not.

Do you think the default should be on or off?

@voidfiles
Copy link
Contributor Author

I am not sure what the default should be because I am not sure what most
people are using newspaper for. I just built a JSON API for newspaper,
https://github.com/voidfiles/newspaper-delivery, and I thought article html
extraction would be a good addition. So, for my use case I'd like to be
turned on by default.

On Sat, Jan 11, 2014 at 1:53 PM, Lucas Ou-Yang notifications@github.comwrote:

Thanks for the pull request! It looks good, will play with the code over
this weekend and most likely merge it on Monday.

I think it makes sense to add an option in the configs to see if the user
wants to keep the article or not.
Do you think the default should be on or off?


Reply to this email directly or view it on GitHubhttps://github.com//pull/11#issuecomment-32108329
.

@codelucas
Copy link
Owner

Oh wow your JSON API is pretty cool, I like the repository name :p (newspaper 0.0.5 is out with a lot of bug fixes but there are still tons of bugs unfortunately, so maybe consider updating your requirements.txt).

The only tradeoff for turning the article html on would be when people are building multiple newspapers for different huge news sources which generate so many articles that it can be heavy on RAM for all of that extra article html.

But that is a very good feature still, so I have to think about it.

@voidfiles
Copy link
Contributor Author

So, if that is the case that I would think that defaulting to off is a good idea. It won't be hard for me to default it to on for my project.

codelucas added a commit that referenced this pull request Jan 13, 2014
Add article html extraction
@codelucas codelucas merged commit 8b12b5a into codelucas:master Jan 13, 2014
@codelucas
Copy link
Owner

I just merged it, thanks for the pull request. I also committed some changes immediately afterwards. I moved all of your lxml imports into the parser.py module, not sure if it's a good design choice yet but I want all the lxml objects to be kept that module for bookkeeping & sanity.

I also added a config option to turn your feature on/off (default off). It's a keep_article_html boolean in the configuration. Config objects are within the Source and Article classes.

An example usage (which has been tested and works) is:

from newspaper import Article

a = Article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html?hpt=hp_t1', keep_article_html=True)
a.download()
a.parse()

>>> print a.article_html[:600]
u'<div> \n<p><strong>(CNN)</strong> -- Charles Smith insisted Sunday that the former NBA players who went to North Korea for a basketball diplomacy trip, led by Dennis Rodman, weren\'t paid by the repressive regime.</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph2">"Absolutely not. I think I am astute enough to understand the dynamics, especially collecting monetary dollars from North Korea. No, we did not get paid from North Korea at all," he told CNN in a lengthy exclusive interview on "New Day Sunday."</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph3">Smith, who retired from the NBA in 1997 after nine seasons, said an Irish online betting company and a documentary film crew paid expenses for the ex-players turned hoops ambassadors.</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph4">Last month the Irish company, Paddy Power, said it had removed its name from Rodman\'s project after the <a href="http://www.cnn.com/2013/12/31/world/asia/north-korea-kim-jong-un-speech/index.html">execution of Kim\'s uncle and top aide, Jang Song Thaek.</a> But it said it would honor its "contractual commitments" to the team.</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph5">Speaking by satellite from Beijing, Smith said it wasn\'t about the money. He saw it as an opportunity to go to a reclusive country and exchange cultural information with other athletes and citizens. But he didn\'t see it as a birthday present for North Korean leader Kim Jong Un.</p>\n \n \n<p class="cnn_storypgraphtxt cnn_storypgraph6">"That\'s the date that was set. I didn\'t know it was his birthday," he said in the half-hour interview. "And it didn\'t matter to me once I found out that it was his birthday."</p>\n<p class="cnn_storypgraphtxt cnn_storypgraph7">'
...

Thanks once again for the pull!

@codelucas
Copy link
Owner

I'll add this feature (and your name in the contribs) in the next 0.0.6 release!

@tbmoss3
Copy link

tbmoss3 commented Jan 14, 2014

Hi everyone.

Quick question. When I try the current package on Github and run pip
install on it, and then run setup, it throws an error:

Traceback (most recent call last):
File "<pyshell#0>", line 1, in
import newspaper
File "C:\Python27\lib\site-packages\newspaper__init__.py", line 12, in

from .article import Article, ArticleException
File "C:\Python27\lib\site-packages\newspaper\article.py", line 17, in

from . import network
File "C:\Python27\lib\site-packages\newspaper\network.py", line 8, in

from .configuration import Configuration
File "C:\Python27\lib\site-packages\newspaper\configuration.py", line 14,
in
from .parsers import Parser, ParserSoup
File "C:\Python27\lib\site-packages\newspaper\parsers.py", line 9, in

import lxml.html
File "C:\Python27\lib\site-packages\lxml\html__init__.py", line 42, in

from lxml import etree
ImportError: cannot import name etree

Any known reason why? haven't been keeping up too much with the thread, but
am extremely interested in this package!

Thanks,

Benton

On Sat, Jan 11, 2014 at 4:41 PM, Alex Kessinger notifications@github.comwrote:

I think it could be useful to get the HTML of the extracted article as
well. This allows you to retain some of the semantic information in the
html. Also it will help if you end up displaying the extracted article

somehow.

You can merge this Pull Request by running

git pull https://github.com/voidfiles/newspaper master

Or view, comment on, or merge it at:

#11
Commit Summary

  • Add article html extraction

File Changes

Patch Links:


Reply to this email directly or view it on GitHubhttps://github.com//pull/11
.

@codelucas
Copy link
Owner

You should try to pip install this library on a clean virtualenv. It's especially important for a library as big as newspaper.

The error you are getting is very common if there are multiple lxml's or different lxml's on your python path. It is trying to reference an etree that probably is incorrect.

virtualenv newspaper-env
cd newspaper-env
source bin/activate
pip install newspaper (or if you are using ubuntu, use easy_install lxml first, then pip install newspaper).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants