A mean little DSL'd poltergeist (capybara) based web crawler that stuffs data into your Rails app.
Ruby JavaScript
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
phantom
samples
spec
.gitignore
.rspec
.rvmrc
Gemfile
Guardfile
LICENSE.txt
README.md
Rakefile
klepto.gemspec

README.md

Klepto

A mean little DSL'd capybara (poltergeist) based web scraper that structures data into ActiveRecord or wherever(TM).

Features

  • CSS or XPath Syntax
  • Full javascript processing via phantomjs / poltergeist
  • All the fun of capybara
  • Scrape multiple pages with a single bot
  • Pretty nifty DSL
  • Test coverage!

Installing

You need at least PhantomJS 1.8.1. There are no other external dependencies (you don't need Qt, or a running X server, etc.)

Mac

  • Homebrew: brew install phantomjs
  • MacPorts: sudo port install phantomjs
  • Manual install: Download this

Linux

  • Download the 32 bit or 64 bit binary.
  • Extract the tarball and copy bin/phantomjs into your PATH

Windows

Manual compilation

Do this as a last resort if the binaries don't work for you. It will take quite a long time as it has to build WebKit.

(See also the PhantomJS building guide.)

Then put klepto in your gemfile.

gem 'klepto', '>= 0.2.5'

Usage (All your content are belong to us)

Say you want a bunch of Bieb tweets! How is there not profit in that?

# Fetch a web site or multiple. Bot#new takes a *splat!
@bot = Klepto::Bot.new("https://twitter.com/justinbieber"){
  # By default, it uses CSS selectors
  name      'h1.fullname'

  # If you love C# or you are over 40, XPath is an option!
  username "//span[contains(concat(' ',normalize-space(@class),' '),' screen-name ')]", :syntax => :xpath
  
  # By default Klepto uses the #text method, you can pass an :attr to use instead...
  #   or a block that will receive the Capybara Node or Result set.
  tweet_ids 'li.stream-item', :match => :all, :attr => 'data-item-id'
  
  # Want to match all the nodes for the selector? Pass :match => :all
  links 'span.url a', :match => :all do |node|
    node[:href]
  end

  # Nested structures? Let klepto know this is a resource
  last_tweet 'li.stream-item', :as => :resource do
    twitter_id do |node|
      node['data-item-id']
    end
    content '.content p'
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :attr => :href
  end      

  # Multiple Nested structures? Let klepto know this is a collection of resources
  # Does bieber, tweet to much? Maybe. Lets only get the new stuff kids crave.
  tweets    'li.stream-item', :as => :collection, :limit => 10 do
    twitter_id do |node|
      node['data-item-id']
    end
    tweet '.content p', :css
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :css, :attr => :href
  end     

  # Set some headers, why not.
  config.headers({
    'Referer'     => 'http://www.twitter.com'
  })  

  # on_http_status can take a splat of statuses or ~statuses(4xx,5xx)
  #   you can also have multiple handlers on a status
  #   Note: Capybara automatically follows redirects, so the statuses 3xx
  #   are never present. If you want to watch for a redirect pass see below
  config.on_http_status(:redirect){
    puts "Something redirected..."
  }
  config.on_http_status(200){
    puts "Expected this, NBD."
  }

  config.on_http_status('5xx','4xx'){
    puts "HOLY CRAP!"
  }

  config.after(:get) do |page|
    # This is fired after each HTTP GET. It receives a Capybara::Node
  end  

  # If you want to do something with each resource, like stick it in AR
  #   go for it here...
  config.after do |resource|
    @user = User.new
    @user.name = resource[:name]
    @user.username = resource[:username]
    @user.save

    resource[:tweets].each do |tweet|
      Tweet.create(tweet)
    end
  end #=> Profit!
}

# You can get an array of hashes(resources), so if you wanted to do something else 
# you could do it here...
@bot.resources.each do |resource|
  pp resource
end

Got a string of HTML you don't need to crawl first?

@html = Capybara::Node::Simple.new(@html_string)
@structure = Klepto::Structure.build(@html){
  # inside the build method, everything works the same as Bot.new
  name      'h1.fullname'
  username  'span.screen-name'

  links 'span.url a', :match => :all do |node|
    node[:href]
  end

  tweets    'li.stream-item', :as => :collection do
    twitter_id do |node|
      node['data-item-id']
    end
    tweet '.content p', :css
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :css, :attr => :href
  end       
}

Configuration Options

  • config.headers - Hash; Sets request headers
  • config.url - String; Set URL to structure
  • config.abort_on_failure - Boolean(Default: true); Should structuring be aborted on 4xx or 5xx

Callbacks & Processing

  • before
    • :get (browser, url)
  • after
    • :structure (Hash) - receives the structure from the page
    • :get (browser, url) - called after each HTTP GET
    • :abort (browser, hash(details)) - called after a 4xx or 5xx if config.abort_on_failure is true (default)

Stuff I'm going to add.

  • Ensure after(:each) work at resource/collection level as well
  • Add after(:all)
  • :if, :unless for as: (:collection|:resource) to. context should be captured node that block is run against
  • Access to hash from within a block (for bulk assignment of other attributes) ?
  • config.allow_rescue_in_block #should exceptions in blocks be auto rescued with nil as the return value
  • :default should be able to take a proc

Async

-> https://github.com/igrigorik/em-synchrony

Cookie Stuffing

cookies({
  'Has Fun' => true
})  

Pre-req Steps

prepare [
  [:GET, 'http://example.com'],
  [:POST, 'http://example.com/login', {username: 'cory', password: '123456'}],
]

Page Assertions

assertions do
  #presence and value assertions...
end
on_assertion_failure{ |response, bot| }

Structure :if unless: lambda{|node| node.class.include?("newsflash")}