Clone this wiki locally
Wombat is a simple ruby DSL to scrape webpages on top of the cool Mechanize and Nokogiri gems. It is aimed to be a more high level abstraction if you dont want to dig into the specifics of getting the page and parsing it into your own data structure, which can be a decent amount of work, depending on what you need.
With Wombat, you can simply call
Wombat.crawl to get started. You can also specify a class that includes the module
Wombat::Crawler and tells which information to retrieve and where.
Basically you can name stuff however you want and specify how you want to find that information in the page. For example:
class MyScraper include Wombat::Crawler base_url "http://domain_to_be_scraped.com" path "/path-to/page-with/the-data" some_data css: "div.elemClass .anchor" another_info xpath: "//my/xpath[@style='selector']" end
Wombat.crawl do base_url "http://domain_to_be_scraped.com" path "/path-to/page-with/the-data" some_data css: "cdiv.elemClass .anchor" another_info xpath: "//my/xpath[@style='selector']" end
The both examples above are equivalent and provide the same results. You can either use the class based approach (1st one), or simply call
Wombat.crawl (2nd approach). Behind the scenes,
Wombat.crawl is just a shortcut so you don't need to create a class if it doesn't make sense for you.
Let's see what is going on here. First you create your class and include the
Wombat::Crawler module. Simple. Then, you start naming the properties you want to retrieve in a free format. The restrictions are the same as you would have to name a ruby method, because this is in fact a ruby method call :)
The lines that say
path are reserved words to say where you want to get that information from. You should include only the scheme (http), host and port in the
base_url field. In the
path, you include only the path portion of the url to be scraped, always starting with a forward slash.
To run this scraper, just instantiate it and call the
my_cool_scraper = MyScraper.new my_cool_scraper.crawl
This will request and parse the page according to the filters you specified and return a hash with that same structure. We'll get into that detail later. Another important detail is that selectors can be specified as CSS or XPath. As you probably already realized, CSS selectors should start with
css= and XPath with
By default, properties will return the text of the first matching element for the provided selector.
Again, very important: Each property will return only the first matching element for that selector.
If you want to retrieve all the matching elements, use the option
As of version
2.2, you can also request a pre-fetched Mechanize page in the following format:
Wombat.crawl do m = Mechanize.new mp = m.get 'http://www.google.com' page mp # ... end
If you use the method above, it will take precedence over any
path attributes if they are also specified in the crawler.