New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add article html extraction #11
Conversation
|
Thanks for the pull request! It looks good, will play with the code over this weekend and most likely merge it on Monday. I think it makes sense to add an option Do you think the default should be |
|
I am not sure what the default should be because I am not sure what most On Sat, Jan 11, 2014 at 1:53 PM, Lucas Ou-Yang notifications@github.comwrote:
|
|
Oh wow your JSON API is pretty cool, I like the repository name :p (newspaper 0.0.5 is out with a lot of bug fixes but there are still tons of bugs unfortunately, so maybe consider updating your requirements.txt). The only tradeoff for turning the article html on would be when people are building multiple newspapers for different huge news sources which generate so many articles that it can be heavy on RAM for all of that extra article html. But that is a very good feature still, so I have to think about it. |
|
So, if that is the case that I would think that defaulting to off is a good idea. It won't be hard for me to default it to on for my project. |
|
I just merged it, thanks for the pull request. I also committed some changes immediately afterwards. I moved all of your lxml imports into the parser.py module, not sure if it's a good design choice yet but I want all the lxml objects to be kept that module for bookkeeping & sanity. I also added a config option to turn your feature on/off (default off). It's a An example usage (which has been tested and works) is: Thanks once again for the pull! |
|
I'll add this feature (and your name in the contribs) in the next 0.0.6 release! |
|
Hi everyone. Quick question. When I try the current package on Github and run pip Traceback (most recent call last): Any known reason why? haven't been keeping up too much with the thread, but Thanks, Benton On Sat, Jan 11, 2014 at 4:41 PM, Alex Kessinger notifications@github.comwrote:
|
|
You should try to pip install this library on a clean virtualenv. It's especially important for a library as big as newspaper. The error you are getting is very common if there are multiple lxml's or different lxml's on your python path. It is trying to reference an etree that probably is incorrect. |
I think it could be useful to get the HTML of the extracted article as well. This allows you to retain some of the semantic information in the html. Also it will help if you end up displaying the extracted article somehow.