Changes to reduce RAM usage #51

Open
wants to merge 5 commits into from

3 participants

@wordtracker

Hi,

We found with large, responsive sites (e.g. Wikipedia), the page_queue could quickly and continuously grow until we ran out of RAM.

I believe this is because the thread that processes the crawled pages does not get adequate time to run when crawling a responsive site -- there's not much free time available waiting for HTTP responses. We could use the 'sleep' option, but this unnecessarily slows crawling on smaller/less responsive sites.

Changing the PageStore option has no effect on this as the page_queue does not live there.

Secondly, we are running multiple concurrent crawls, therefore we can't use any of the PageStore alternatives as they assume one-crawl-at-a-time. Therefore, I added an option to not retain the processed data as this was using RAM for a feature we don't currently need (after_crawl).

I'm happy to split the changes up if that would change their acceptability. I appreciate the way I implemented the second change is not ideal.

Thanks,
Jamie

and others added some commits Jan 19, 2012
@chriskite Merge branch 'next' 4b378d5
@chriskite Merge branch 'next' 531d771
Jamie Cobbett page_queue is constrained by a size supplied in options, default 100.
This alleviates a problem we experienced with very responsive sites (wikipedia)
and a moderate per-page processing time. The page_queue would grow much faster
than it could be drained, using more and more RAM.

This change means that the queue grows until full, at which point the
Tentacles will block until the queue shrinks.

This does have the impact of slowing the crawl in some cases.
487601b
Jamie Cobbett Add option 'discard_page_data' to allow user to further decrease stor…
…age use.
2166b99
Jamie Cobbett Revert "Add option 'discard_page_data' to allow user to further decre…
…ase storage use."

This reverts commit 2166b99.
8b51d29
@leehambley

Any movement here?

I haven't diagnosed yet completely, but on a site with 28,000,000 pages indexed by google, I'm expecting a memory problem having watched a 20 thread process grow from a nominal memory usage at the beginning of the run to more than 1Gb in 1.5 hours. (Having crawled 626,419 pages, according to echo 'KEYS anemone:pages:*' | redis-cli | wc -l)

I'm thinking about trying out this patch to reign in the memory usage as I also don't need the after_crawl feature, I would have expected that Anemone could store that page list in Redis, retrieving it after the crawl using the backend store, not persisting something in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment