Fetch and parse web pages with Node.js like a boss 🕶
- Intuitive APIs for extracting useful contents like links and images.
- CSS selectors.
- Mocked user-agent (like a real web browser).
- Full JavaScript support.
Using Chromium under the hood by default, thanks to Playwright.
await htmlPlay(url, { browser: true })
-
Grab a list of all links and images on the page.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://nodejs.org') // Will print all link URLs on the page console.log(dom.links) // Will print all image URLs on the page console.log(dom.images)
-
Select an element with a CSS selector.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://nodejs.org') const intro = dom.find('#home-intro', { containing: 'Node' }) // Will print: 'Node.js® is an open-source, cross-platform...' console.log(intro.text)
Expand to view more recipes.
- Let's grab some wallpapers from unsplash.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://unsplash.com/t/wallpapers') const elements = dom.findAll('img[itemprop=thumbnailUrl]') const images = elements.map(({ image }) => image) // Will print something like // ['https://images.unsplash.com/photo-1705834008920-b08bf6a05223', ...] console.log(images)
- Let's load some hacker news from Hack News.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://news.ycombinator.com') const titles = dom.findAll('.titleline') const news = titles.map(({ text, link }) => [text, link]) // Will print something like // [['news 1', 'http://xxx.com'], ['news 2', 'http://yyy.com'], ...] console.log(news)
- Load a dynamic website, which means its content is generated by JavaScript!
// Search for images of "flower" with Google import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://www.google.com/search?&q=flower&tbm=isch', { browser: true }) // Filtering is still needed if you want this work... console.log(dom.images)
- Send requests with custom cookies.
import { htmlPlay } from '../src/index.js' const { dom } = await htmlPlay('https://httpbin.org/cookies', { fetch: { fetchInit: { headers: { Cookie: 'a=1; b=2;' } } }, }) // Will print { "cookies": { "a": "1", "b": "2" } } console.log(dom.text)
npm i html-play
If you want to use a browser to "run" the page before parsing, you'll need to install Chromium with Playwright.
npm i playwright
npx playwright install chromium
-
Fetch a certain URL and return its response with the parsed DOM.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('http://example.com')
-
url
Type:
string
The URL to fetch.
-
options
(Optional)Type:
object
Default:
{ fetch: true }
-
fetch
(Optional)Type:
boolean | object
Default:
true
If set to
true
, we will use the Fetch API to load the requested URL. You can also specify the options for the Fetch API by passing anobject
here.-
fetcher
(Optional)Type:
function
The fetch function we are going to use. We can pass a polyfill here.
-
fetchInit
(Optional)Type:
function
The fetch parameters passed to the fetch function. See fetch#options. You can set HTTP headers or cookies here.
-
-
browser
(Optional)Type:
boolean | object
Default:
false
If set to
true
, we will use Playwright to load the requested URL. You can also specify the options for Playwright by passing anobject
here.-
browser
(Optional)Type:
object
The Playwright Browser instance to use.
-
page
(Optional)Type:
object
The Playwright Page instance to use.
-
launchOptions
(Optional)The
launchOptions
passed to Playwright when we are launching the browser. See BrowserType#browser-type-launch -
beforeNavigate
(Optional)A custom hook function that will be called before the page is loaded.
page
andbrowser
can be accessed here as the properties of its first parameter to interact with the page. -
afterNavigate
(Optional)A custom hook function that will be called after the page is loaded.
page
andbrowser
can be accessed here as the properties of its first parameter to interact with the page.
-
-
A
Promise
of theResponse
instance (see below). -
-
-
url
Type:
string
The URL of the response. If the response is redirected from another URL, the value will be the final redirected URL.
-
status
Type:
number
The HTTP status code of the response.
-
content
Type:
string
The response content as a plain string.
-
dom
Type:
object
The parsed root DOM. See
DOMElement
. -
json
Type:
object | undefined
The parsed response JSON. If the response is not a valid JSON, it will be
undefined
. -
rawBrowserResponse
Type:
object
The raw response object returned by Playwright.
-
rawFetchResponse
Type:
object
The raw response object returned by the Fetch API.
-
html
Type:
string
The "
outerHTML
" of this element. -
link
Type:
string
If the element is an anchor element, this will be the absolute value of the element's link, or it will be an empty string.
-
links
Type:
string[]
All the anchor elements inside this element.
-
text
Type:
string
The text of the element with whitespaces and linebreaks stripped.
-
rawText
Type:
string
The original text of the element.
-
image
Type:
string
If the element is an image embed element, this will be the absolute URL of the element's image, or it will be an empty string.
-
images
Type:
string[]
All the image URLs inside this element.
-
backgroundImage
Type:
string
The background image source extracted from the element's inline style.
-
element
Type:
object
The corresponding
JSDOM
element object.
-
find
Find the first matched child
DOMElement
inside this element.-
selector
Type:
string
The CSS selector to use.
-
options
(Optional)Type:
object
-
containing
(Optional)Type:
string
Check if the element contains the specified substring.
Type:
string
-
-
-
findAll
Find all matched child
DOMElement
s inside this element.-
selector
Type:
string
The CSS selector to use.
-
options
(Optional)Type:
object
-
containing
(Optional)Type:
string
Check if the element contains the specified substring.
Type:
string
-
-
-
getAttribute
-
qualifiedName
Type:
string
Returns element's first attribute whose qualified name is qualifiedName, and
undefined
if there is no such attribute otherwise.
-
-
This project is highly inspired by another fabulous library Requests-HTML for Python.