Skip to content
The language engine and tool for web parsing
HTML JavaScript
Branch: master
Clone or download
dijs Merge pull request #9 from tycho01/master
mention similar projects
Latest commit abae6f5 Nov 8, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Check for command line arguments before running program Jan 3, 2017
test V2 Rewrite Jan 2, 2017
.babelrc V2 Rewrite Jan 2, 2017
.eslintrc V2 Rewrite Jan 2, 2017
.gitignore Updated README Jan 3, 2017
.npmignore Getting ready for npm push Jan 3, 2017
README.md mention similar projects Nov 8, 2017
package.json 2.0.4 Jan 7, 2017

README.md

pársz

- A tool for parsing the web

NPM Version

Usage

Install globally from npm/yarn

$ npm install -g parsz

View options from help menu

$ parsz --help

Use a "parselet" as a recipe/filter to parse a website.

The structure of the parselet is JSON.

Here is an example of a parselet for grabbing business data from a Yelp page:

{
  "name": "h1|trim",
  "phone": ".biz-phone|trim",
  "address": "address|trim",
  "reviews(.review)": [{
    "date": "meta[itemprop=datePublished] @content",
    "name": ".user-name a",
    "comment": ".review-content p"
  }]
}

As a module

You can also use parsz as a module:

import parsz from 'parsz';

parsz([Parselet JSON], [URL]).then(data => {
  // Do something with the data
});

Tips

This is a very general purpose and flexible tool. But here are some tips for getting started.

Grabbing a list of data

Use a reference selector in the key and an Array as the value.

{
  "users(.user)": [{
    "name": ".name",
    "age": ".age",
  }]
}

Use transformation functions on data

Add a pipe (|) and the transformation name after the data selector.

{
  "user": {
    "name": ".name|trim",
    "age": ".age|parseInt",
    "worth": ".age|parseFloat",
    "someNumber": ".age|floor",
  }
}

If anyone would like to see a certain, helpful transformation function added, please just open a issue

Grabbing an attribute

Use a (@) symbol to reference an attribute.

{
  "user": {
    "name": ".name",
    "nickname": ".name@data-nickname",
  }
}

Grabbing remote data

Use a (~) and a link selector to reference external content. The mapping (value) will be relative to that new external scope.

{
  "user": {
    "name": ".name",
    "company~(a.company)": {
      "name": ".company-name",
      "address": ".company-address",
    },
  }
}

Have fun!

Related projects

You can’t perform that action at this time.