eugeneotto edited this page Jul 20, 2013 · 11 revisions

Parsley is a simple language for extracting structured data from web pages. Parsley consists of a powerful Selector Language wrapped with a JSON Structure that can represent page-wide formatting.

Check out A Simple Tutorial to extract a CSV file of beers by brewery and rating in only a few lines of code.

Parsley has a Command-line Interface, Ruby Bindings, Python Bindings, and a C Interface, and can output to JSON, CSV, and XML.

The following parselet parses a Yelp business listing (no endorsement implied).

{
  "name": "h1",
  "phone": "#bizPhone",
  "address": "address",
  "reviews(.review)": [
    {
      "date": ".date",
      "user_name": ".user-name a",
      "comment": "with-newlines(.review_comment)"
    }
  ]
}

You can get JSON out by typing:

sh$: parsley businesses.let http://www.yelp.com/biz/amnesia-san-francisco

To get a site-wide crawl that will dump a businesses.csv, and a reviews.csv (with foreign key to businesses), run:

sh$: csvget --parselet=businesses.let http://www.yelp.com/biz/amnesia-san-francisco

It’s that easy. Get started with Installation Instructions.

Sites are for example purposes. Please obey robots.txt.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.