HTML-Play

Fetch and parse web pages with Node.js like a boss 🕶

Features

Intuitive APIs for extracting useful contents like links and images.
CSS selectors.
Mocked user-agent (like a real web browser).
Full JavaScript support.
```
await htmlPlay(url, { browser: true })
```
Using Chromium under the hood by default, thanks to Playwright.

Recipes

Grab a list of all links and images on the page.

import { htmlPlay } from 'html-play'

const { dom } = await htmlPlay('https://nodejs.org')
// Will print all link URLs on the page
console.log(dom.links)
// Will print all image URLs on the page
console.log(dom.images)

Select an element with a CSS selector.

import { htmlPlay } from 'html-play'

const { dom } = await htmlPlay('https://nodejs.org')
const intro = dom.find('#home-intro', { containing: 'Node' })
// Will print: 'Node.js® is an open-source, cross-platform...'
console.log(intro.text)

Expand to view more recipes.

Let's grab some wallpapers from unsplash.

import { htmlPlay } from 'html-play'

const { dom } = await htmlPlay('https://unsplash.com/t/wallpapers')
const elements = dom.findAll('img[itemprop=thumbnailUrl]')
const images = elements.map(({ image }) => image)
// Will print something like
// ['https://images.unsplash.com/photo-1705834008920-b08bf6a05223', ...]
console.log(images)

Let's load some hacker news from Hack News.

import { htmlPlay } from 'html-play'

const { dom } = await htmlPlay('https://news.ycombinator.com')
const titles = dom.findAll('.titleline')
const news = titles.map(({ text, link }) => [text, link])
// Will print something like
// [['news 1', 'http://xxx.com'], ['news 2', 'http://yyy.com'], ...]
console.log(news)

Load a dynamic website, which means its content is generated by JavaScript!

// Search for images of "flower" with Google
import { htmlPlay } from 'html-play'

const { dom } = await htmlPlay('https://www.google.com/search?&q=flower&tbm=isch', { browser: true })
// Filtering is still needed if you want this work...
console.log(dom.images)

Send requests with custom cookies.

import { htmlPlay } from '../src/index.js'

const { dom } = await htmlPlay('https://httpbin.org/cookies', {
  fetch: { fetchInit: { headers: { Cookie: 'a=1; b=2;' } } },
})
// Will print { "cookies": { "a": "1", "b": "2" } }
console.log(dom.text)

Installation

npm i html-play

If you want to use a browser to "run" the page before parsing, you'll need to install Chromium with Playwright.

npm i playwright
npx playwright install chromium

APIs

Methods

htmlPlay

Fetch a certain URL and return its response with the parsed DOM.

Example:
```
import { htmlPlay } from 'html-play'

const { dom } = await htmlPlay('http://example.com')
```
Parameters:
- url
  
  Type: string
  
  The URL to fetch.
- options (Optional)
  
  Type: object
  
  Default: { fetch: true }
  - fetch (Optional)
    
    Type: boolean | object
    
    Default: true
    
    If set to true, we will use the Fetch API to load the requested URL. You can also specify the options for the Fetch API by passing an object here.
    - fetcher (Optional)
      
      Type: function
      
      The fetch function we are going to use. We can pass a polyfill here.
    - fetchInit (Optional)
      
      Type: function
      
      The fetch parameters passed to the fetch function. See fetch#options. You can set HTTP headers or cookies here.
  - browser (Optional)
    
    Type: boolean | object
    
    Default: false
    
    If set to true, we will use Playwright to load the requested URL. You can also specify the options for Playwright by passing an object here.
    - browser (Optional)
      
      Type: object
      
      The Playwright Browser instance to use.
    - page (Optional)
      
      Type: object
      
      The Playwright Page instance to use.
    - launchOptions (Optional)
      
      The launchOptions passed to Playwright when we are launching the browser. See BrowserType#browser-type-launch
    - beforeNavigate (Optional)
      
      A custom hook function that will be called before the page is loaded. page and browser can be accessed here as the properties of its first parameter to interact with the page.
    - afterNavigate (Optional)
      
      A custom hook function that will be called after the page is loaded. page and browser can be accessed here as the properties of its first parameter to interact with the page.
Returns:

A Promise of the Response instance (see below).
Classes

Response

Properties
- url
  
  Type: string
  
  The URL of the response. If the response is redirected from another URL, the value will be the final redirected URL.
- status
  
  Type: number
  
  The HTTP status code of the response.
- content
  
  Type: string
  
  The response content as a plain string.
- dom
  
  Type: object
  
  The parsed root DOM. See DOMElement.
- json
  
  Type: object | undefined
  
  The parsed response JSON. If the response is not a valid JSON, it will be undefined.
- rawBrowserResponse
  
  Type: object
  
  The raw response object returned by Playwright.
- rawFetchResponse
  
  Type: object
  
  The raw response object returned by the Fetch API.
DOMElement

Properties
- html
  
  Type: string
  
  The "outerHTML" of this element.
- link
  
  Type: string
  
  If the element is an anchor element, this will be the absolute value of the element's link, or it will be an empty string.
- links
  
  Type: string[]
  
  All the anchor elements inside this element.
- text
  
  Type: string
  
  The text of the element with whitespaces and linebreaks stripped.
- rawText
  
  Type: string
  
  The original text of the element.
- image
  
  Type: string
  
  If the element is an image embed element, this will be the absolute URL of the element's image, or it will be an empty string.
- images
  
  Type: string[]
  
  All the image URLs inside this element.
- backgroundImage
  
  Type: string
  
  The background image source extracted from the element's inline style.
- element
  
  Type: object
  
  The corresponding JSDOM element object.
Methods
- find
  
  Find the first matched child DOMElement inside this element.
  
  Parameters
  - selector
    
    Type: string
    
    The CSS selector to use.
  - options (Optional)
    
    Type: object
    - containing (Optional)
      
      Type: string
      
      Check if the element contains the specified substring.
      
      Type: string
- findAll
  
  Find all matched child DOMElements inside this element.
  
  Parameters
  - selector
    
    Type: string
    
    The CSS selector to use.
  - options (Optional)
    
    Type: object
    - containing (Optional)
      
      Type: string
      
      Check if the element contains the specified substring.
      
      Type: string
- getAttribute
  
  Parameters
  - qualifiedName
    
    Type: string
    
    Returns element's first attribute whose qualified name is qualifiedName, and undefined if there is no such attribute otherwise.

Credits

This project is highly inspired by another fabulous library Requests-HTML for Python.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
.gitignore		.gitignore
changelog.md		changelog.md
eslint.config.js		eslint.config.js
license		license
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
readme.md		readme.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML-Play

Features

Recipes

Installation

APIs

Methods

`htmlPlay`

Example:

Parameters:

Returns:

Classes

`Response`

Properties

`DOMElement`

Properties

Methods

Parameters

Parameters

Parameters

Credits

License

About

Releases 8

Languages

License

arianrhodsandlot/html-play

Folders and files

Latest commit

History

Repository files navigation

HTML-Play

Features

Recipes

Installation

APIs

Methods

htmlPlay

Example:

Parameters:

Returns:

Classes

Response

Properties

DOMElement

Properties

Methods

Parameters

Parameters

Parameters

Credits

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Languages

`htmlPlay`

`Response`

`DOMElement`