Skip to content
gnoireaux edited this page Sep 13, 2010 · 5 revisions

This page, and the Selector Language page define the semantics of the Parsley language.

The JSON structure of a Parsley script is mirrored in the output (except arrays may have more elements).

{"foo":"some selector"} will always produce output like {"foo":"some extracted content"}. The extracted content is the inner text of the first selector match.

{"foo":["some selector"]} will always produce output like {"foo":["some extracted content", "some more content"]}. The length of the extracted JSON array is equal to the number of selector matches on the target page.

Operators

Keys can also contain operators, which are ignored in the output. ? is an operator that denotes optionality. {"foo?":"some selector"} will produce output like {"foo":"some extracted content"}. If the selector is not found in the target document, it will return {} rather than throwing an exception.

“Broken” Parselets

A non-optional key that doesn’t match any content is considered broken. An object that has broken keys or values, is also considered broken.

{"foo": {"bar": "selector that doesn't match anything"}} will raise an exception: Key not found: /foo/bar.

{"foo": {"bar?": "selector that doesn't match anything"}} will return {"foo": {}}.

{"foo?": {"bar": "selector that doesn't match anything"}} will return {}.

Broken parselets in Arrays will be addressed after we introduce arrays.

Restrictions on structure

The naming (e.g. pre-operator) portion of keys are expected to be alphanumeric, plus spaces, underscores, and dashes. The idiomatic separation character is the underscore.

Keys and values should be Strings, Objects, or Arrays. Integers, etc are not supported at this time.

Arrays in the input should contain exactly one child, which should not itself be an Array.

Arrays of Objects

Arrays of objects are structured in one of two ways, which express the major web idioms for recordsets in page structure. “Scoped arrays” handle the idiom of encapsulation, where each record is contained in e.g. an <li> element. “Bucketed arrays” handle the idiom where groups have similar structure, but no single encapsulating element per group.

Scoped arrays

First, we need to introduce the scoping operator.

Parentheses – Scoping operator

The parentheses is a scoping operator that can be used in any key. {"foo(div)":"a"} will return the same result as {"foo": "div a"}. This is sometimes useful to prevent repetition (e.g. in {"link(li div.foo a)": {"name": ".", "link": "@href"}). More often, it is used to express scoped arrays.

A real-world example

Yelp’s business pages (e.g. http://www.yelp.com/biz/tourist-club-mill-valley ) have reviews that are each contained in their own <li class=“nonfavoriteReview”>. The following parselet gets the date, user_name, and comment body from each.

{
  "name": "h1",
  "phone": "#bizPhone",
  "address": "address",
  "reviews(.nonfavoriteReview)": [
    {
      "date": ".ieSucks .smaller",
      "user_name": ".reviewer_info a",
      "comment": "with-newlines(.review_comment)"
    }
  ]
}

Broken parselets in arrays.

When broken Objects are in an Array, then the default behavior is to simply drop each broken object from the Array, and declare the Array itself broken if and only if the Array is empty.

In the previous Yelp example, if there were 3 reviews, but one of them had a missing date, then the returned reviews Array would only contain two elements. e.g.:

{
  "name": "Joe's",
  "phone": "555-5555",
  "address": "12345 Main St",
  "reviews": [
    {
      "date": "December 19th",
      "user_name": "jacko",
      "comment": "Hi world."
    },
    {
      "date": "December 21st",
      "user_name": "billy",
      "comment": "Yo, dawg."
    }
  ]
}

If, however, the date key was marked optional, then the result would be something like:

{
  "name": "Joe's",
  "phone": "555-5555",
  "address": "12345 Main St",
  "reviews": [
    {
      "date": "December 19th",
      "user_name": "jacko",
      "comment": "Hi world."
    },
    {
      "user_name": "suzy",
      "comment": "I'm commenting without the date somehow."
    },
    {
      "date": "December 21st",
      "user_name": "billy",
      "comment": "Yo, dawg."
    }
  ]
}

Requiring complete Arrays with the ! operator

Extending the previous example, imagine that you want the parse to fail if any of the dates are missing. You can add ! to the reviews key, like so:

{
  "name": "h1",
  "phone": "#bizPhone",
  "address": "address",
  "reviews!(.nonfavoriteReview)": [
    {
      "date": ".ieSucks .smaller",
      "user_name": ".reviewer_info a",
      "comment": "with-newlines(.review_comment)"
    }
  ]
}

This changes the behavior of the reviews array. A broken object in the reviews array now causes the entire reviews array to be broken. Suzy’s comment will now cause this parselet to raise an exception: Key not found: /reviews/date.

Bucketed Arrays

Bucketed arrays are used when there is no structure around a group. Imagine trying to parse the following HTML to grab the list of invitees for each event:

<!-- ... -->
<h2>My BBQ Invitees</h2>
<p>Joe</p>
<p>Jeff</p>
<p>Suzy</p>
<h2>My Dinner Invitees</h2>
<p>Dylan</p>
<p>Hobbes</p>
<!-- ... -->

It turns out that in Parsley, this is easy. First the code, then the explanation!

{
  "events": [{
    "name": "h2",
    "attendees": ["p"]
  }]
}

It, of course returns:

{
  "events": [{
    "name": "My BBQ Invitees",
    "attendees": ["Joe", "Jeff", "Suzy"]
  },
  {
    "name": "My Dinner Invitees",
    "attendees": ["Dylan", "Hobbes"]
  }]
}

But why does this work?

First, the events key realizes that its value is an array, and that it has no scope, and therefore goes into bucketed mode. It inserts into itself an Object, with two “buckets”, one for name, and one for the attendees.

name is scalar, so its bucket can only fit one element. attendees is an array, so it can fit unbounded elements.

When parsing the page, the parser encounters “My BBQ Invitees”, and puts it in the name bucket. It encounters “Joe”, "Jeff, and “Suzy”, and puts each of them in the attendees bucket.

Next, it encounters “My Dinner Invitees”. We cannot insert this element into the already full name bucket, so a new event Object is created. Then the parse continues, with the remaining elements put into the second event.

Overflow of a singular key within a bucketed array triggers the creation of a new child Object.