Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

respect greater of delay option or robots.txt crawl delay... #56

Open
wants to merge 1 commit into from

2 participants

@operand

... when obeying robots.txt

I'd like to obey webmasters' wishes when crawling their sites. If they specify a 10 second delay and I specify only 5 and that I'd like to obey their robots.txt, 10 should be the delay.

In order to make this change I also had to make threads default to 1 whenever 'obey_robots_txt' was set. I just didn't see any quick way to allow this without otherwise significant changes. I figure if you want to obey robots.txt you probably are okay with erring on the side of caution and running a single thread.

Maybe not acceptable for some. Your call. Hope this helps. And thanks for Anemone! It rocks!

@moezzie

Great initiative.

The single thread approach feels a little awkward, but there doesn't seem to be a whole lot of other options.

Keep up the good work. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jun 5, 2012
  1. respect greater of delay option or robots.txt crawl delay when obeyin…

    Dan Rodriguez authored
    …g robots.txt
This page is out of date. Refresh to see the latest.
Showing with 7 additions and 6 deletions.
  1. +2 −2 lib/anemone/core.rb
  2. +5 −4 lib/anemone/tentacle.rb
View
4 lib/anemone/core.rb
@@ -155,7 +155,7 @@ def run
page_queue = Queue.new
@opts[:threads].times do
- @tentacles << Thread.new { Tentacle.new(link_queue, page_queue, @opts).run }
+ @tentacles << Thread.new { Tentacle.new(link_queue, page_queue, @robots, @opts).run }
end
@urls.each{ |url| link_queue.enq(url) }
@@ -196,7 +196,7 @@ def run
def process_options
@opts = DEFAULT_OPTS.merge @opts
- @opts[:threads] = 1 if @opts[:delay] > 0
+ @opts[:threads] = 1 if @opts[:delay] > 0 || @opts[:obey_robots_txt]
storage = Anemone::Storage::Base.new(@opts[:storage] || Anemone::Storage.Hash)
@pages = PageStore.new(storage)
@robots = Robotex.new(@opts[:user_agent]) if @opts[:obey_robots_txt]
View
9 lib/anemone/tentacle.rb
@@ -6,10 +6,11 @@ class Tentacle
#
# Create a new Tentacle
#
- def initialize(link_queue, page_queue, opts = {})
+ def initialize(link_queue, page_queue, robots, opts = {})
@link_queue = link_queue
@page_queue = page_queue
@http = Anemone::HTTP.new(opts)
+ @robots = robots
@opts = opts
end
@@ -25,14 +26,14 @@ def run
@http.fetch_pages(link, referer, depth).each { |page| @page_queue << page }
- delay
+ delay(link)
end
end
private
- def delay
- sleep @opts[:delay] if @opts[:delay] > 0
+ def delay(link)
+ sleep [(@robots && @robots.delay(link.to_s)) || 0, @opts[:delay], 0].max
end
end
Something went wrong with that request. Please try again.