Simple crawler pattern match.
CSS JavaScript
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
docs
lib
test
.gitignore
README.md
package.json

README.md

Suck

Get the tree of anchors based into an pattern.

Installation

[sudo] npm install -g suck

Usage

Target

suck -t [host] // the address of place you get extract

List of Patterns

suck -t [host] -l [pattern] // list of patterns, who you need to extract

Output

suck -t [host] -l [pattern] -o [filename] // the json has a array structure

Errors

// erro.json has the information of the process `[todo:improve-errors]`

Limit

suck -t [host] -l [pattern] -o [filename] -e [filename] -e [number] // number of times to turn back

Verbose

suck -t [host] -l [pattern] -o [filename] -e [filename] -e [number] -v // verbose mode, show the progress information

Example of usage

[Verbose output]

suck -t http://pt.wikipedia.org/wiki/El_Chavo_del_Ocho -l imdb -v -e 1 

[File output]

suck -t http://pt.wikipedia.org/wiki/Charlie_Chaplin -l imdb -e 10 -o chaplin-reference.json

Next

- Create the documentation of API

- Add RegExp support to match patterns

- Add ignore patterns support

- Cluster crawling

- Add Jug support

- Trainning algorithm to capture URL's with patterns:

    - Semantic URL

    - Schema information

    - Meta information

    - Content

    - Sentiment

    - analysis sentimental


- Move the methods of `bin/index.js` to `lib/index.js`, to make more reusable. [ OK ]

Author and the license

Kaique da Silva <kaique.developer@gmail.com>, under <BSD-license>