Document parsing and returning content (scraping) using the HTMLRewriter

in the [workers video that wes bos put out yesterday](https://www.youtube.com/watch?v=48NWaLkDcME), he was looking for a tool to parse some HTML coming in via response, and return a subsection of it as the new response inside of the worker. he ultimately did some manual string parsing, but it turns out this would have been a perfect use-case for htmlrewriter's parser!

i synced up with wes and gave him my example code, which sets up an htmlrewriter looking for an img with a specific id, and setting the img src as part of the worker's local request state. if that state is found (an image has been parsed), it gets returned directly, otherwise, the worker falls back to the original request:

```js
addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

const BASE = 'wes.io'

function buildURL(url) {
  const urlObject = new URL(url)
  const newURL = urlObject.href.replace(urlObject.host, BASE)
  return newURL
}

class ImageHandler {
  element(element) {
    this.foundImage = element.getAttribute('src')
  }
}

async function handleRequest(request) {
  const parts = request.url.split('/')
  
  if (parts.length !== 5 || !parts[4].includes('content')) {
    return fetch(buildURL(request.url))
  }

  const dropURL = parts.slice(0, 4).join("/")
  const resp = await fetch(buildURL(dropURL))
  const handler = new ImageHandler()
  const body = await new HTMLRewriter()
    .on('img#image-item', handler)
    .transform(resp)
    .text()

  if (handler.foundImage) {
    return fetch(handler.foundImage)
  } else {
    return new Response(body, resp)
  }
}
```

this is a super useful feature of the htmlrewriter that we don't have any docs for! i synced up with @harrishancock to make sure there wasn't a better way to do this – maybe long-term we can propose an API improvement to handle this use-case, but for now, this should probably be added as a template and talked more generally about in the reference part of the docs ("Parsing and returning data from a response using HTMLRewriter")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document parsing and returning content (scraping) using the HTMLRewriter #658

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Document parsing and returning content (scraping) using the HTMLRewriter #658

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions