Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the ability to stop the crawl #61

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 19 additions & 1 deletion lib/anemone/core.rb
Expand Up @@ -79,6 +79,7 @@ def initialize(urls, opts = {})
@skip_link_patterns = []
@after_crawl_blocks = []
@opts = opts
@stop_crawl = false

yield self if block_given?
end
Expand Down Expand Up @@ -142,6 +143,18 @@ def focus_crawl(&block)
self
end

#
# Signals the crawler that it should stop the crawl before visiting the
# next page.
#
# This method is expected to be called within a page block, and it signals
# the crawler that it must stop after the current page is completely
# processed. All pages and links currently on queue are discared.
#
def stop_crawl
@stop_crawl = true
end

#
# Perform the crawl
#
Expand Down Expand Up @@ -175,12 +188,17 @@ def run

@pages[page.url] = page

if @stop_crawl
page_queue.clear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnapse why u clear the queue? What if u want to restart the processing? Would u submit this PR in the fork called Medusa?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I did this a long time ago, and I'm not currently in context to answer
this correctly.

I recall that I clear the queue because the way the stop_crawl function
works is that it cleans the queue so that the crawl ends up naturally (i.e.
the loop processing the queue can't find anything else queued so it ends in
a natural way, instead of ending as in a break-loop situation, if you know
what I mean).

Anyway, good to know that there's a fork keeping anemone alive. I liked
that gem and it was sad to see it discontinued. I'll try to work on
adapting the PR for the fork.

Ernesto
about.me/gnapse
@gnapse https://twitter.com/gnapse

On Sun, Dec 14, 2014 at 2:34 PM, Mauro Asprea notifications@github.com
wrote:

In lib/anemone/core.rb
#61 (diff):

@@ -175,12 +188,17 @@ def run

     @pages[page.url] = page
  •    if @stop_crawl
    
  •      page_queue.clear
    

@gnapse https://github.com/gnapse why u clear the queue? What if u want
to restart the processing? Would u submit this PR in the fork called
Medusa https://github.com/brutuscat/medusa?


Reply to this email directly or view it on GitHub
https://github.com/chriskite/anemone/pull/61/files#r21797711.

link_queue.clear
end

# if we are done with the crawl, tell the threads to end
if link_queue.empty? and page_queue.empty?
until link_queue.num_waiting == @tentacles.size
Thread.pass
end
if page_queue.empty?
if page_queue.empty? || @stop_crawl
@tentacles.size.times { link_queue << :END }
break
end
Expand Down