Klepto

A mean little DSL'd capybara (poltergeist) based web scraper that structures data into ActiveRecord or wherever(TM).

Features

CSS or XPath Syntax
Full javascript processing via phantomjs / poltergeist
All the fun of capybara
Scrape multiple pages with a single bot
Pretty nifty DSL
Test coverage!

Installing

You need at least PhantomJS 1.8.1. There are no other external dependencies (you don't need Qt, or a running X server, etc.)

Mac

Homebrew: brew install phantomjs
MacPorts: sudo port install phantomjs
Manual install: Download this

Linux

Download the 32 bit or 64 bit binary.
Extract the tarball and copy bin/phantomjs into your PATH

Windows

Download the precompiled binary for Windows

Manual compilation

Do this as a last resort if the binaries don't work for you. It will take quite a long time as it has to build WebKit.

Download the source tarball
Extract and cd in
./build.sh

(See also the PhantomJS building guide.)

Then put klepto in your gemfile.

gem 'klepto', '>= 0.2.5'

Usage (All your content are belong to us)

Say you want a bunch of Bieb tweets! How is there not profit in that?

# Fetch a web site or multiple. Bot#new takes a *splat!
@bot = Klepto::Bot.new("https://twitter.com/justinbieber"){
  # By default, it uses CSS selectors
  name      'h1.fullname'

  # If you love C# or you are over 40, XPath is an option!
  username "//span[contains(concat(' ',normalize-space(@class),' '),' screen-name ')]", :syntax => :xpath
  
  # By default Klepto uses the #text method, you can pass an :attr to use instead...
  #   or a block that will receive the Capybara Node or Result set.
  tweet_ids 'li.stream-item', :match => :all, :attr => 'data-item-id'
  
  # Want to match all the nodes for the selector? Pass :match => :all
  links 'span.url a', :match => :all do |node|
    node[:href]
  end

  # Nested structures? Let klepto know this is a resource
  last_tweet 'li.stream-item', :as => :resource do
    twitter_id do |node|
      node['data-item-id']
    end
    content '.content p'
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :attr => :href
  end      

  # Multiple Nested structures? Let klepto know this is a collection of resources
  # Does bieber, tweet to much? Maybe. Lets only get the new stuff kids crave.
  tweets    'li.stream-item', :as => :collection, :limit => 10 do
    twitter_id do |node|
      node['data-item-id']
    end
    tweet '.content p', :css
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :css, :attr => :href
  end     

  # Set some headers, why not.
  config.headers({
    'Referer'     => 'http://www.twitter.com'
  })  

  # on_http_status can take a splat of statuses or ~statuses(4xx,5xx)
  #   you can also have multiple handlers on a status
  #   Note: Capybara automatically follows redirects, so the statuses 3xx
  #   are never present. If you want to watch for a redirect pass see below
  config.on_http_status(:redirect){
    puts "Something redirected..."
  }
  config.on_http_status(200){
    puts "Expected this, NBD."
  }

  config.on_http_status('5xx','4xx'){
    puts "HOLY CRAP!"
  }

  config.after(:get) do |page|
    # This is fired after each HTTP GET. It receives a Capybara::Node
  end  

  # If you want to do something with each resource, like stick it in AR
  #   go for it here...
  config.after do |resource|
    @user = User.new
    @user.name = resource[:name]
    @user.username = resource[:username]
    @user.save

    resource[:tweets].each do |tweet|
      Tweet.create(tweet)
    end
  end #=> Profit!
}

# You can get an array of hashes(resources), so if you wanted to do something else 
# you could do it here...
@bot.resources.each do |resource|
  pp resource
end

Got a string of HTML you don't need to crawl first?

@html = Capybara::Node::Simple.new(@html_string)
@structure = Klepto::Structure.build(@html){
  # inside the build method, everything works the same as Bot.new
  name      'h1.fullname'
  username  'span.screen-name'

  links 'span.url a', :match => :all do |node|
    node[:href]
  end

  tweets    'li.stream-item', :as => :collection do
    twitter_id do |node|
      node['data-item-id']
    end
    tweet '.content p', :css
    timestamp '._timestamp', :attr => 'data-time'
    permalink '.time a', :css, :attr => :href
  end       
}

Configuration Options

config.headers - Hash; Sets request headers
config.url - String; Set URL to structure
config.abort_on_failure - Boolean(Default: true); Should structuring be aborted on 4xx or 5xx

Callbacks & Processing

before
- :get (browser, url)
after
- :structure (Hash) - receives the structure from the page
- :get (browser, url) - called after each HTTP GET
- :abort (browser, hash(details)) - called after a 4xx or 5xx if config.abort_on_failure is true (default)

Stuff I'm going to add.

Ensure after(:each) work at resource/collection level as well
Add after(:all)
:if, :unless for as: (:collection|:resource) to. context should be captured node that block is run against
Access to hash from within a block (for bulk assignment of other attributes) ?
config.allow_rescue_in_block #should exceptions in blocks be auto rescued with nil as the return value
:default should be able to take a proc

Async

-> https://github.com/igrigorik/em-synchrony

Cookie Stuffing

cookies({
  'Has Fun' => true
})

Pre-req Steps

prepare [
  [:GET, 'http://example.com'],
  [:POST, 'http://example.com/login', {username: 'cory', password: '123456'}],
]

Page Assertions

assertions do
  #presence and value assertions...
end
on_assertion_failure{ |response, bot| }

Structure :if unless: lambda{|node| node.class.include?("newsflash")}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
lib		lib
phantom		phantom
samples		samples
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rvmrc		.rvmrc
Gemfile		Gemfile
Guardfile		Guardfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
klepto.gemspec		klepto.gemspec

License

coryodaniel/klepto

Folders and files

Latest commit

History

Repository files navigation