What
Extract structured metadata from HTML pages and return it in FetchResponse. Agents currently have to re-parse markdown to get basic page info.
Fields to add
title — from <title> or <meta property="og:title">
description — from <meta name="description"> or og:description
language — from <html lang="...">
canonical_url — from <link rel="canonical">
published_date / modified_date — from <meta> or <time> elements, JSON-LD
author — from <meta name="author"> or JSON-LD
links — extracted list of [text, href] pairs
headings — outline/TOC as structured data
Why
Biggest bang for buck for agentic use — agents need this metadata universally and shouldn't have to re-parse converted content to get it.
Acceptance criteria
- New
metadata field on FetchResponse with above subfields
- Extracted during HTML processing (no extra fetch)
- All fields optional
- Tests with wiremock covering extraction from realistic HTML