You can clone with
Parsley is a simple language for extracting structured data from web pages. Parsley consists of a powerful Selector Language wrapped with a JSON Structure that can represent page-wide formatting.
Check out A Simple Tutorial to extract a CSV file of beers by brewery and rating in only a few lines of code.
Parsley has a Command-line Interface, Ruby Bindings, Python Bindings, and a C Interface, and can output to JSON, CSV, and XML.
The following parselet parses a Yelp business listing (no endorsement implied).
"user_name": ".user-name a",
You can get JSON out by typing:
sh$: parsley businesses.let http://www.yelp.com/biz/amnesia-san-francisco
To get a site-wide crawl that will dump a businesses.csv, and a reviews.csv (with foreign key to businesses), run:
sh$: csvget --parselet=businesses.let http://www.yelp.com/biz/amnesia-san-francisco
It’s that easy. Get started with Installation Instructions.
Sites are for example purposes. Please obey robots.txt.
Last edited by eugeneotto, July 20, 2013