Skip to content

Getting Started

chriso edited this page · 32 revisions

Data scraping and processing code is organised into modular and extendable jobs written in JavaScript or CoffeeScript. A typical job consists of of taking some input, processing / reducing it in some way, and then outputting the emitted results, although no step is compulsory. Some scraping jobs don't require input, etc.

Running a job

Jobs can be run from the command line or through a web interface. To run a job from the command line (extension can be omitted), run

$ myjob

To run jobs through the web interface, copy your jobs to ~/.node_modules and run

$ -p 8080

The web interface can be accessed at http://localhost:8080/

Running a job from within another script

Use nodeio.start(job, options, callback, capture_output).

A job usually defines its own output method, but if you need to capture the output and return it to the callback, set capture_output to true. Note that callback takes (err, output) or (err) if not capturing output.

Debugging a job

Sometimes a job may display incorrect behavior. To find out why and see what's going on under the hood, use the -g or --debug switch

$ --debug myjob

The anatomy of a job

Example 1: Hello World!


var nodeio = require('');
exports.job = new nodeio.Job({
    input: false,
    run: function () {
        this.emit('Hello World!');

nodeio = require ''
class Hello extends nodeio.JobClass
    input: false
    run: (num) -> @emit 'Hello World!'

@class = Hello
@job = new Hello()

To run the example

$ -s hello
     => Hello World!

Note: the -s switch omits status messages from output => same as appending 2> /dev/null

Example 2: Double each element of input


var nodeio = require('');
exports.job = new nodeio.Job({
    input: [0,1,2],
    run: function (num) {
        this.emit(num * 2);

nodeio = require ''
class Double extends nodeio.JobClass
    input: [0,1,2]
    run: (num) -> @emit num * 2

@class = Double
@job = new Double()

Example 3: Inheritance


var Double = require('./double').job;

exports.job = Double.extend({
    run: function (num) {, num * 2);
        //Same as: this.emit(num * 4)

Note: CoffeeScript inheritance with multiple files is temporarily broken in the latest release.. A fix is coming soon! Classes that are defined in the same file are fine:

nodeio = require ''
class Double extends nodeio.JobClass
    input: [0,1,2]
    run: (num) -> @emit num * 2

class Quad extends Double
    run: (num) -> super num * 2

@class = Quad
@job = new Quad()

Basic concepts

Job options

Options allow you to easily incorporate common or complex behavior. A full list of options can be found in the API.

Options are specified as an object containing key/value pairs

var nodeio = require('');
var options = {
    timeout: 10,    //Timeout after 10 seconds
    max: 20,        //Run 20 threads concurrently (when run() is async)
    retries: 3      //Threads can retry 3 times before failing
exports.job = new nodeio.Job(options, methods);

Determining when a job is complete

Being asynchronous, needs to be able to determine when each thread (a call to run()) is complete, and when the entire job is complete.

A thread is complete after:

  • emit(), fail(), retry() or skip() has been called - any subsequent calls in the same thread are ignored
  • An option, such as timeout, causes the thread to automatically call one of the methods above
  • run() returns something other than null - in this case, the return value is emitted

** Important: if one of the above conditions is not met, the thread will hang indefinitely **

The job is complete when:

  • All of the input has been consumed, or in the case of input: false, when one thread has completed
  • exit() is called

Passing arguments to jobs

Sometimes it may be desirable to be able to specify arguments to a job, e.g.

$ myjob arg1 arg2 arg3

Arguments can be accessed through this.options.args, e.g.

run: function() {
    console.log(this.options.args[0]); //"arg1"

Retrying, skipping or failing a thread

To retry or skip a thread, use the retry() or skip() methods (no arguments required), e.g. to remove empty lines


var nodeio = require('');
exports.job = new nodeio.Job({
    run: function(line) {
        if (line.trim() == '') {
        } else {

Some job options (timeout, retries, redirects) cause fail() to be called automatically after some condition

var nodeio = require('');
exports.job = new nodeio.Job({timeout: 5}, {
    run: function(input) {
        //There are no conditions that would cause this thread to be marked as complete, so it will timeout after 5 seconds
    fail: function (input, status) { 
        //status = "timeout"
        this.emit('Thread failed'); //You still need to complete the thread with an emit or skip, etc.

Goto part 2: Working with input / output

Goto part 3: Scraping data from the web

Goto part 4: Data validation and sanitization

Something went wrong with that request. Please try again.