GitHub - feldpost/scrapi: scrAPI is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.

ScrAPI toolkit for Ruby

A framework for writing scrapers using CSS selectors and simple select => extract => store processing rules.

Here’s an example that scrapes auctions from eBay:

ebay_auction = Scraper.define do
  process "h3.ens>a", :description=>:text,
                      :url=>"@href"
  process "td.ebcPr>span", :price=>:text
  process "div.ebPicture >a>img", :image=>"@src"

  result :description, :url, :price, :image
end

ebay = Scraper.define do
  array :auctions

  process "table.ebItemlist tr.single",
          :auctions => ebay_auction

  result :auctions
end

And using the scraper:

auctions = ebay.scrape(html)

# No. of auctions found
puts auctions.size

# First auction:
auction = auctions[0]
puts auction.description
puts auction.url

To get the latest source code with regular updates:

svn co labnotes.org/svn/public/ruby/scrapi

Version of Ruby

Currently ScrAPI does not run with Ruby 1.9.2, but with the dev versions of Ruby 1.9.3. This is due to a bug in Ruby’s visibility context handling (see changelog #29578 and bug #3406 on the official Ruby page). Using the most recent dev version of Ruby is easy with RVM (rvm.beginrescueend.com/).

Using TIDY

By default scrAPI uses Tidy (actually Tidy-FFI) to cleanup the HTML.

You need to install the Tidy Gem for Ruby:

gem install tidy_ffi

And the Tidy binary libraries, available here:

http://tidy.sourceforge.net/

By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That’s one place to place the Tidy library.

Alternatively, just point Tidy to the library with:

TidyFFI.library_path = "...."

On Linux this would probably be:

TidyFFI.library_path = "/usr/local/lib/libtidy.so"

On OS/X this would probably be:

TidyFFI.library_path = “/usr/lib/libtidy.dylib”

For testing purposes, you can also use the built in HTML parser. It’s useful for testing and getting up to grabs with scrAPI, but it doesn’t deal well with broken HTML. So for testing only:

Scraper::Base.parser :html_parser

License

Developed for co.mments.com

Code and documention: labnotes.org

HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license. www.jin.gr.jp/~nahi/Ruby/html-parser/README.html

Porting to Ruby 1.9.x by Christoph Lupprich, lupprich.info

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cheat		cheat
lib		lib
test		test
.gitignore		.gitignore
CHANGELOG		CHANGELOG
MIT-LICENSE		MIT-LICENSE
README.rdoc		README.rdoc
Rakefile		Rakefile
scrapi.gemspec		scrapi.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrAPI toolkit for Ruby

Version of Ruby

Using TIDY

License

About

Releases

Packages

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

ScrAPI toolkit for Ruby

Version of Ruby

Using TIDY

License

About

Resources

Stars

Watchers

Forks

Releases

Packages

Contributors

Languages