Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithread & gevent framework built into newspaper #4

Closed
codelucas opened this issue Dec 29, 2013 · 5 comments
Closed

Multithread & gevent framework built into newspaper #4

codelucas opened this issue Dec 29, 2013 · 5 comments
Assignees

Comments

@codelucas
Copy link
Owner

I will add this feature tonight or tomorrow. Opening an issue for it because it is so important. Multithreading has always existed in newspaper but there hasn't been a public API for it.

Downloading multiple articles concurrently is super useful and newspaper has an effective setup to do so.

@ghost ghost assigned codelucas Dec 29, 2013
@codelucas
Copy link
Owner Author

Okay, I added a public API for multithreading article downloads (while also respecting news source domains).

Instead of going news source by source and spamming each source with X threads. We spread out 1-2 threads to each desired news source and download all of their articles concurrently so it's a WIN-WIN.

Check it out in the updated readme!

>>> import newspaper
>>> from newspaper import news_pool

>>> slate_paper = newspaper.build('http://slate.com')
>>> tc_paper = newspaper.build('http://techcrunch.com')
>>> espn_paper = newspaper.build('http://espn.com')

>>> papers = [slate_paper, tc_paper, espn_paper]
>>> news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
>>> news_pool.join()

At this point, you can safely assume that download() has been
called on every single article for all 3 sources.

>>> print slate_paper.articles[10].html
u'<html> ...' 

This is still a very rough implementation, i'm going to need a few more commits to clean this up fully. Ideally, users should be able to customize how many threads they want to allocate per news source.

I'm also aware that you can use "privoxy" to avoid rate limiting. Not sure if we need to build that in.

@dangayle
Copy link

dangayle commented Jan 3, 2014

Very cool.

@codelucas
Copy link
Owner Author

Thanks man! Feel free to open any issue or send whatever pull request :D Hopefully this project remains active.

(P.S. you are from Spokane? Good to see another Washingtonian here lol i'm from Issaquah)

@dangayle
Copy link

dangayle commented Jan 4, 2014

Not only do I live in Washington, I work at the Spokesman-Review newspaper in Spokane :) We're a Django/Python shop, and I'm always looking for cool new toys.

@codelucas
Copy link
Owner Author

Thats pretty cool man, i'm a huge Django fanatic also! Go Seahawks :D

hartym added a commit to hartym/newspaper that referenced this issue Jan 3, 2017
updating with better extractors for mismatched languages (i.e. french…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants