Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 100 lines (58 sloc) 2.779 kb
d3781e3 @assaf Initial commit brough over from SVN
authored
1 == ScrAPI toolkit for Ruby
2
3 A framework for writing scrapers using CSS selectors and simple
4 select => extract => store processing rules.
5
6 Here’s an example that scrapes auctions from eBay:
7
8 ebay_auction = Scraper.define do
9 process "h3.ens>a", :description=>:text,
10 :url=>"@href"
11 process "td.ebcPr>span", :price=>:text
12 process "div.ebPicture >a>img", :image=>"@src"
13
14 result :description, :url, :price, :image
15 end
16
17 ebay = Scraper.define do
18 array :auctions
19
20 process "table.ebItemlist tr.single",
21 :auctions => ebay_auction
22
23 result :auctions
24 end
25
26 And using the scraper:
27
28 auctions = ebay.scrape(html)
29
30 # No. of auctions found
31 puts auctions.size
32
33 # First auction:
34 auction = auctions[0]
35 puts auction.description
36 puts auction.url
37
38
39 To get the latest source code with regular updates:
40
41 svn co http://labnotes.org/svn/public/ruby/scrapi
42
22d4901 @clupprich Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out becau...
clupprich authored
43 == Version of Ruby
44
08f207e @assaf ScrAPI 2.0.0 adds support for Ruby 1.9.2 using Tidy FFI, thanks to
authored
45 ScrAPI 1.2.x tested with Ruby 1.8.6 and 1.8.7, but will not work on Ruby 1.9.x.
46
47 ScrAPI 2.0.x switches to TidyFFI to runs on Ruby 1.9.2 and newer.
48
49 Due to a bug in Ruby's visibility context handling (see changelog #29578 and bug
50 #3406 on the official Ruby page), you need to declare all result attributes
51 explicitly, using result method or attr_reader/_accessor.
d3781e3 @assaf Initial commit brough over from SVN
authored
52
53 == Using TIDY
54
22d4901 @clupprich Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out becau...
clupprich authored
55 By default scrAPI uses Tidy (actually Tidy-FFI) to cleanup the HTML.
d3781e3 @assaf Initial commit brough over from SVN
authored
56
57 You need to install the Tidy Gem for Ruby:
22d4901 @clupprich Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out becau...
clupprich authored
58 gem install tidy_ffi
d3781e3 @assaf Initial commit brough over from SVN
authored
59
60 And the Tidy binary libraries, available here:
61
62 http://tidy.sourceforge.net/
63
64 By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That's one place to place the Tidy library.
65
66 Alternatively, just point Tidy to the library with:
67
22d4901 @clupprich Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out becau...
clupprich authored
68 TidyFFI.library_path = "...."
d3781e3 @assaf Initial commit brough over from SVN
authored
69
70 On Linux this would probably be:
71
22d4901 @clupprich Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out becau...
clupprich authored
72 TidyFFI.library_path = "/usr/local/lib/libtidy.so"
d3781e3 @assaf Initial commit brough over from SVN
authored
73
74 On OS/X this would probably be:
75
22d4901 @clupprich Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out becau...
clupprich authored
76 TidyFFI.library_path = “/usr/lib/libtidy.dylib”
d3781e3 @assaf Initial commit brough over from SVN
authored
77
78 For testing purposes, you can also use the built in HTML parser. It's useful for testing and getting up to grabs with scrAPI, but it doesn't deal well with broken HTML. So for testing only:
79
80 Scraper::Base.parser :html_parser
81
82
83 == License
84
85 Copyright (c) 2006 Assaf Arkin, under Creative Commons Attribution and/or MIT License
86
87 Developed for http://co.mments.com
88
89 Code and documention: http://labnotes.org
90
91 HTML cleanup and good hygene by Tidy, Copyright (c) 1998-2003 World Wide Web Consortium.
92 License at http://tidy.sourceforge.net/license.html
93
94 HTML DOM extracted from Rails, Copyright (c) 2004 David Heinemeier Hansson. Under MIT license.
95
96 HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license.
97 http://www.jin.gr.jp/~nahi/Ruby/html-parser/README.html
22d4901 @clupprich Ported scrAPI to Ruby 1.9.3 (unfortunately 1.9.2 will not work out becau...
clupprich authored
98
08f207e @assaf ScrAPI 2.0.0 adds support for Ruby 1.9.2 using Tidy FFI, thanks to
authored
99 Porting to Ruby 1.9.x by Christoph Lupprich, http://lupprich.info
Something went wrong with that request. Please try again.