Skip to content
This repository was archived by the owner on Oct 2, 2020. It is now read-only.
This repository was archived by the owner on Oct 2, 2020. It is now read-only.

Document parsing and returning content (scraping) using the HTMLRewriter #658

@kristianfreeman

Description

@kristianfreeman

in the workers video that wes bos put out yesterday, he was looking for a tool to parse some HTML coming in via response, and return a subsection of it as the new response inside of the worker. he ultimately did some manual string parsing, but it turns out this would have been a perfect use-case for htmlrewriter's parser!

i synced up with wes and gave him my example code, which sets up an htmlrewriter looking for an img with a specific id, and setting the img src as part of the worker's local request state. if that state is found (an image has been parsed), it gets returned directly, otherwise, the worker falls back to the original request:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

const BASE = 'wes.io'

function buildURL(url) {
  const urlObject = new URL(url)
  const newURL = urlObject.href.replace(urlObject.host, BASE)
  return newURL
}

class ImageHandler {
  element(element) {
    this.foundImage = element.getAttribute('src')
  }
}

async function handleRequest(request) {
  const parts = request.url.split('/')
  
  if (parts.length !== 5 || !parts[4].includes('content')) {
    return fetch(buildURL(request.url))
  }

  const dropURL = parts.slice(0, 4).join("/")
  const resp = await fetch(buildURL(dropURL))
  const handler = new ImageHandler()
  const body = await new HTMLRewriter()
    .on('img#image-item', handler)
    .transform(resp)
    .text()

  if (handler.foundImage) {
    return fetch(handler.foundImage)
  } else {
    return new Response(body, resp)
  }
}

this is a super useful feature of the htmlrewriter that we don't have any docs for! i synced up with @harrishancock to make sure there wasn't a better way to do this – maybe long-term we can propose an API improvement to handle this use-case, but for now, this should probably be added as a template and talked more generally about in the reference part of the docs ("Parsing and returning data from a response using HTMLRewriter")

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions