Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: 8f797fb657
Fetching contributors…

Cannot retrieve contributors at this time

69 lines (48 sloc) 2.668 kb

To install node.io, use npm:

$ npm install node.io

For usage details, run

$ node.io --help    

What is node.io?

node.io is a framework for scraping and processing data. A node.io job typically consists of a) taking some input, b) using or transforming it, and c) outputting something.

node.io can simplify the process of:

  • Filtering / sanitizing a list
  • MapReduce
  • Loading a list of URLs and scraping and saving some data from each
  • Parsing log files
  • Transforming data from one format to another, e.g. from CSV to a database
  • Recursively load all files in a directory and it's subdirectories and execute a command on each file

Why node.io?

  • Create modular and extensible jobs for scraping and processing data
  • Written in Node.js and Javascript - jobs are concise, asynchronous and FAST
  • Speed up execution by distributing work among child processes and other servers (soon)
  • Easily handle a variety of input / output situations
    • Reading / writing lines to and from files
    • Reading all files in a directory (and optionally recursing)
    • Reading / writing rows to and from a database
    • STDIN / STDOUT
    • Piping between other node.io jobs
    • Any combination of the above, or completely custom IO
  • Includes a robust framework for scraping and selecting web data
  • Support for a variety of proxies when making requests
  • Includes a data validation and sanitization framework
  • Provides support for retries, timeouts, dynamically adding input, etc.

Documentation

Initial documentation is available here.

Better documentation will be available once I have time to write it.. See http://node.io/ for updates.

Examples

See ./examples

Roadmap

  • Automatically handle HTTP codes, e.g. redirect on 3xx or call fail() on 4xx/5xx
  • Nested requests inherit referrer / cookies if to the same domain
  • Add more DOM selector / traversal methods
  • Test proxy callbacks
  • Add distributed processing
  • Installation without NPM (install.sh)
  • Refactoring

Credits

node.io uses the following awesome libraries:

License

MIT License

Jump to Line
Something went wrong with that request. Please try again.