Skip to content
Web scraping made simple.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
__test__
.editorconfig
.gitattributes
.gitignore
LICENSE
README.md
circle.yml
index.js
package.json
yarn.lock

README.md

tokio

NPM version NPM downloads CircleCI donate chat

Web scraping made simple.

Features

  • Built on the top of jsdom.
  • It runs inline and external scripts on the page.
  • You can add resource filter to not load certain external resources.
  • Simple and fast, only 100 SLOC and it does not require Electron or Chromium.

Install

yarn add tokio

Table of Contents

Usage

const Tokio = require('tokio')

const tokio = new Tokio({
  url: 'https://some-website.com'
})

tokio.fetch().then(html => {
  console.log(html) //=> string

  // Query HTML with cheerio (server-side jQuery)
  // https://github.com/cheeriojs/cheerio
  const $ = tokio.query(html)
})

API

new Tokio(options)

options

options.url
  • Type: string
  • Required: required

The URL to fetch.

options.wait
  • Type: number string
  • Default: 50

Wait for certain time (in milliseconds) or dom element to show up.

options.manually
  • Type: boolean string

Instead of using options.wait, you can manually call window.__tokio_ready__() in your website to tell us that it's ready to be captured.

It can also be a string like i_am_ready so that you can call window.i_am_ready() instead.

options.resourceFilter
  • Type: resource => boolean

Whether to load certain resource. Check out the resource type.

options.requestOptions
  • proxy: string A URL for a HTTP proxy to use for the requests.
  • agent: http(s).Agent instance to use.
  • agentOptions: The agent options; defaults to { keepAlive: true, keepAliveMsecs: 115000 }, see http api for more details.
  • strictSSL: If true, requires SSL certificates be valid; defaults to true, see request module for more details.
  • userAgent: The user agent string used in requests; defaults to Node.js (#process.platform#; U; rv:#process.version#)
  • headers: An object giving any headers that will be used while loading the HTML from options.url, if applicable.
options.variables

Inject variables to the global scope window.

tokio.fetch()

  • Type: () => Promise<string>

Fetch URL and return corresponding HTML. (JavaScript on this page will be evaluated.)

tokio.query(html, opts)

This is basically cheerio.load(html, opts).

Contributing

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D

Author

tokio © egoist, Released under the MIT License.
Authored and maintained by egoist with help from contributors (list).

github.com/egoist · GitHub @egoist · Twitter @_egoistlily

You can’t perform that action at this time.