Skip to content

Fetch and parse web pages with Node.js like a boss 🕶.

License

Notifications You must be signed in to change notification settings

arianrhodsandlot/html-play

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML-Play

NPM badge GitHub badge

Fetch and parse web pages with Node.js like a boss 🕶

screenshot

Features

  • Intuitive APIs for extracting useful contents like links and images.
  • CSS selectors.
  • Mocked user-agent (like a real web browser).
  • Full JavaScript support.
    await htmlPlay(url, { browser: true })
    Using Chromium under the hood by default, thanks to Playwright.

Recipes

  • Grab a list of all links and images on the page.

    import { htmlPlay } from 'html-play'
    
    const { dom } = await htmlPlay('https://nodejs.org')
    // Will print all link URLs on the page
    console.log(dom.links)
    // Will print all image URLs on the page
    console.log(dom.images)
  • Select an element with a CSS selector.

    import { htmlPlay } from 'html-play'
    
    const { dom } = await htmlPlay('https://nodejs.org')
    const intro = dom.find('#home-intro', { containing: 'Node' })
    // Will print: 'Node.js® is an open-source, cross-platform...'
    console.log(intro.text)
Expand to view more recipes.
  • Let's grab some wallpapers from unsplash.
    import { htmlPlay } from 'html-play'
    
    const { dom } = await htmlPlay('https://unsplash.com/t/wallpapers')
    const elements = dom.findAll('img[itemprop=thumbnailUrl]')
    const images = elements.map(({ image }) => image)
    // Will print something like
    // ['https://images.unsplash.com/photo-1705834008920-b08bf6a05223', ...]
    console.log(images)
  • Let's load some hacker news from Hack News.
    import { htmlPlay } from 'html-play'
    
    const { dom } = await htmlPlay('https://news.ycombinator.com')
    const titles = dom.findAll('.titleline')
    const news = titles.map(({ text, link }) => [text, link])
    // Will print something like
    // [['news 1', 'http://xxx.com'], ['news 2', 'http://yyy.com'], ...]
    console.log(news)
  • Load a dynamic website, which means its content is generated by JavaScript!
    // Search for images of "flower" with Google
    import { htmlPlay } from 'html-play'
    
    const { dom } = await htmlPlay('https://www.google.com/search?&q=flower&tbm=isch', { browser: true })
    // Filtering is still needed if you want this work...
    console.log(dom.images)
  • Send requests with custom cookies.
    import { htmlPlay } from '../src/index.js'
    
    const { dom } = await htmlPlay('https://httpbin.org/cookies', {
      fetch: { fetchInit: { headers: { Cookie: 'a=1; b=2;' } } },
    })
    // Will print { "cookies": { "a": "1", "b": "2" } }
    console.log(dom.text)

Installation

npm i html-play

If you want to use a browser to "run" the page before parsing, you'll need to install Chromium with Playwright.

npm i playwright
npx playwright install chromium

APIs

  • Methods

    htmlPlay

    Fetch a certain URL and return its response with the parsed DOM.

    Example:
    import { htmlPlay } from 'html-play'
    
    const { dom } = await htmlPlay('http://example.com')
    Parameters:
    • url

      Type: string

      The URL to fetch.

    • options (Optional)

      Type: object

      Default: { fetch: true }

      • fetch (Optional)

        Type: boolean | object

        Default: true

        If set to true, we will use the Fetch API to load the requested URL. You can also specify the options for the Fetch API by passing an object here.

        • fetcher (Optional)

          Type: function

          The fetch function we are going to use. We can pass a polyfill here.

        • fetchInit (Optional)

          Type: function

          The fetch parameters passed to the fetch function. See fetch#options. You can set HTTP headers or cookies here.

      • browser (Optional)

        Type: boolean | object

        Default: false

        If set to true, we will use Playwright to load the requested URL. You can also specify the options for Playwright by passing an object here.

        • browser (Optional)

          Type: object

          The Playwright Browser instance to use.

        • page (Optional)

          Type: object

          The Playwright Page instance to use.

        • launchOptions (Optional)

          The launchOptions passed to Playwright when we are launching the browser. See BrowserType#browser-type-launch

        • beforeNavigate (Optional)

          A custom hook function that will be called before the page is loaded. page and browser can be accessed here as the properties of its first parameter to interact with the page.

        • afterNavigate (Optional)

          A custom hook function that will be called after the page is loaded. page and browser can be accessed here as the properties of its first parameter to interact with the page.

    Returns:

    A Promise of the Response instance (see below).

  • Classes

    Response

    Properties
    • url

      Type: string

      The URL of the response. If the response is redirected from another URL, the value will be the final redirected URL.

    • status

      Type: number

      The HTTP status code of the response.

    • content

      Type: string

      The response content as a plain string.

    • dom

      Type: object

      The parsed root DOM. See DOMElement.

    • json

      Type: object | undefined

      The parsed response JSON. If the response is not a valid JSON, it will be undefined.

    • rawBrowserResponse

      Type: object

      The raw response object returned by Playwright.

    • rawFetchResponse

      Type: object

      The raw response object returned by the Fetch API.

    DOMElement

    Properties
    • html

      Type: string

      The "outerHTML" of this element.

    • link

      Type: string

      If the element is an anchor element, this will be the absolute value of the element's link, or it will be an empty string.

    • links

      Type: string[]

      All the anchor elements inside this element.

    • text

      Type: string

      The text of the element with whitespaces and linebreaks stripped.

    • rawText

      Type: string

      The original text of the element.

    • image

      Type: string

      If the element is an image embed element, this will be the absolute URL of the element's image, or it will be an empty string.

    • images

      Type: string[]

      All the image URLs inside this element.

    • backgroundImage

      Type: string

      The background image source extracted from the element's inline style.

    • element

      Type: object

      The corresponding JSDOM element object.

    Methods
    • find

      Find the first matched child DOMElement inside this element.

      Parameters
      • selector

        Type: string

        The CSS selector to use.

      • options (Optional)

        Type: object

        • containing (Optional)

          Type: string

          Check if the element contains the specified substring.

          Type: string

    • findAll

      Find all matched child DOMElements inside this element.

      Parameters
      • selector

        Type: string

        The CSS selector to use.

      • options (Optional)

        Type: object

        • containing (Optional)

          Type: string

          Check if the element contains the specified substring.

          Type: string

    • getAttribute

      Parameters
      • qualifiedName

        Type: string

        Returns element's first attribute whose qualified name is qualifiedName, and undefined if there is no such attribute otherwise.

Credits

This project is highly inspired by another fabulous library Requests-HTML for Python.

License

MIT