Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Fixes OutOfMemory error for large sites #31

Open
wants to merge 4 commits into
from

Conversation

Projects
None yet
2 participants

pokey909 commented Sep 4, 2011

  • Added support for external queues via :large_scale_crawl option. (Requires R/W permission for working dir)
  • Improved Thread handling. All threads now properly start working on the crawl

pokey909 added some commits Aug 31, 2011

Temporary fix for OutOfMemory error.
Occurs when crawling larges sites.

Issue: link_queue grows faster than threads consume links.

Fix: Wait until threads consumed enough links, then continue adding more to the queue.
Fixed issues:
- OutOfMemory caused by large link/page queues. Added thread safe ExtQueue class which swaps to disk when too much memory is consumed
- Improved threading. Most worker threads kept idling when launched simultaneously

Signed-off-by: Alexander Lenhardt <alenhard@techfak.uni-bielefeld.de>
Added some documentation and code cleanup
External queue storage can be activated via new option :large_scale_crawl

Signed-off-by: Alexander Lenhardt <alenhard@techfak.uni-bielefeld.de>

Do you have a 0.7.1 version for this pull request ?

Using your new option raised an error in my project:

"deadlock detected
/usr/local/rvm/gems/ruby-1.9.2-p136/gems/anemone-0.7.1/lib/anemone/ext_queue.rb:77:in `sleep'"

Have you ever encountered that kind of issue ?

Owner

pokey909 replied Apr 11, 2012

sry, for my already deleted comment. I thought this was the old version which tried to slow down the producer/consumer with locks. I didn't use the crawler for quite a while since I'm busy with my new job right now. But I'll try to have a look.

Owner

pokey909 replied Apr 11, 2012

but afaik, there is a proper external queue implemented in chris' most recent branch, right? If so, then there is no reason to use my crappy one.

Did not see it !?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment