Anemone web-spider framework
Ruby
Switch branches/tags
Pull request Compare This branch is 3 commits ahead, 25 commits behind chriskite:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.bundle
bin
lib
spec
.gitignore
.rbenv-version
CHANGELOG.rdoc
CONTRIBUTORS
Gemfile
Gemfile.lock
LICENSE.txt
README.rdoc
Rakefile
VERSION
anemone.gemspec

README.rdoc

Anemone

Anemone is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily.

See anemone.rubyforge.org for more information.

Distributed Fork

This is a fork of the official repo that provides a distributed (multiprocess) version of Anemone.

It requires an external queue, and at this time only supports RabbitMQ.

Starting a distributed crawl is a 2 part process. First, you have initialize the crawl which setups the queues and page store. Then you start up Anemone processes, telling them to use the queues and page store that was previously initialized.

anemone = Anemone.crawl(ARGV[0], :threads => 1, :init => true, :tag => "test123") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB nil, "pages"
  anemone.queue :rabbit_mq, :host => host, :port => port
  anemone.on_every_page do |page|
    puts page.url
  end
end

That will setup two queues: page_queue-test123 and link_queue-test123 as well as create a pages-test123 collection in MongoDB.

Next you launch multiple processes that should run the exact same code block but with :init => false instead.

The crawl is finished when all processes exit. Each process will exit when the link_queue and page_queue have been empty for finish_timeout seconds (which is an Anemone option unique to this fork). There is a caveat to this appoach. If your queues are being consumed faster than they are produced, a process could get starved and exit prematurely. In practice, the queues will probably be filled much faster than they are consumed.

At this time, there is no facility for cleaning up the queues and page store once the crawl is finished. You should do this manually.

Features

  • Multi-threaded design for high performance

  • Tracks 301 HTTP redirects

  • Built-in BFS algorithm for determining page depth

  • Allows exclusion of URLs based on regular expressions

  • Choose the links to follow on each page with focus_crawl()

  • HTTPS support

  • Records response time for each page

  • CLI program can list all pages in a domain, calculate page depths, and more

  • Obey robots.txt

  • In-memory or persistent storage of pages during crawl, using TokyoCabinet, MongoDB, or Redis

Examples

See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

Requirements

  • nokogiri

  • robots

Development

To test and develop this gem, additional requirements are:

  • rspec

  • fakeweb

  • tokyocabinet

  • mongo

  • redis

You will need to have Tokyo Cabinet, MongoDB, and Redis installed on your system and running.