Skip to content

feat: structured metadata extraction (title, description, language, canonical URL, author, links, headings) #71

@chaliy

Description

@chaliy

What

Extract structured metadata from HTML pages and return it in FetchResponse. Agents currently have to re-parse markdown to get basic page info.

Fields to add

  • title — from <title> or <meta property="og:title">
  • description — from <meta name="description"> or og:description
  • language — from <html lang="...">
  • canonical_url — from <link rel="canonical">
  • published_date / modified_date — from <meta> or <time> elements, JSON-LD
  • author — from <meta name="author"> or JSON-LD
  • links — extracted list of [text, href] pairs
  • headings — outline/TOC as structured data

Why

Biggest bang for buck for agentic use — agents need this metadata universally and shouldn't have to re-parse converted content to get it.

Acceptance criteria

  • New metadata field on FetchResponse with above subfields
  • Extracted during HTML processing (no extra fetch)
  • All fields optional
  • Tests with wiremock covering extraction from realistic HTML

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions