add the ability to stop the crawl #61

Open
wants to merge 1 commit into
from

Projects

None yet

5 participants

@gnapse
gnapse commented Jul 24, 2012

I had the need to stop the crawl at will, and so I added this new feature.

There's a new method that client code can call from within a page block, and will make the crawl stop after the current page is processed. The crawl is stopped nicely, not by forcefully killing the tentacle threads, but by clearing the queues so that the natural ending process occurs.

Hope this helps someone out there.

@sdball
sdball commented Jul 25, 2012

I needed the same functionality, this patch worked great.

@efrat-safanov

I needed this too, thanks!

@nickpearson

I've been needing this as well. It would be great if we could get this pulled in to the official gem. Thanks for implementing this!

@gnapse
gnapse commented May 28, 2013

I'd love to get this into the official gem, but unfortunately there have been no response from the repo owner 😞

@nickpearson

Agreed -- official support for this would be great. By the way @gnapse, I added a method to your implementation that allows one to determine whether the crawl was stopped early, which is something my app needed. (Thanks again for handling the hard part!)

In case it helps, feel free to add this to your pull request: nickpearson@84bccf5

@brutuscat
Contributor

@gnapse my branched fork accepts pull request ;) the only difference with the official gem, is that I implemented it using openuri instead of net http. That solved several issues with encoding, broken pipe errors, etc

Take a look at PR #34

@brutuscat brutuscat commented on the diff Dec 14, 2014
lib/anemone/core.rb
@@ -175,12 +188,17 @@ def run
@pages[page.url] = page
+ if @stop_crawl
+ page_queue.clear
@brutuscat
brutuscat Dec 14, 2014 Contributor

@gnapse why u clear the queue? What if u want to restart the processing? Would u submit this PR in the fork called Medusa?

@gnapse
gnapse Dec 14, 2014

Wow, I did this a long time ago, and I'm not currently in context to answer
this correctly.

I recall that I clear the queue because the way the stop_crawl function
works is that it cleans the queue so that the crawl ends up naturally (i.e.
the loop processing the queue can't find anything else queued so it ends in
a natural way, instead of ending as in a break-loop situation, if you know
what I mean).

Anyway, good to know that there's a fork keeping anemone alive. I liked
that gem and it was sad to see it discontinued. I'll try to work on
adapting the PR for the fork.

Ernesto
about.me/gnapse
@gnapse https://twitter.com/gnapse

On Sun, Dec 14, 2014 at 2:34 PM, Mauro Asprea notifications@github.com
wrote:

In lib/anemone/core.rb
#61 (diff):

@@ -175,12 +188,17 @@ def run

     @pages[page.url] = page
  •    if @stop_crawl
    
  •      page_queue.clear
    

@gnapse https://github.com/gnapse why u clear the queue? What if u want
to restart the processing? Would u submit this PR in the fork called
Medusa https://github.com/brutuscat/medusa?


Reply to this email directly or view it on GitHub
https://github.com/chriskite/anemone/pull/61/files#r21797711.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment