Skip to content

feat(convert): add structured metadata extraction from HTML pages#77

Merged
chaliy merged 1 commit intomainfrom
claude/issue-71-metadata-extraction
Mar 27, 2026
Merged

feat(convert): add structured metadata extraction from HTML pages#77
chaliy merged 1 commit intomainfrom
claude/issue-71-metadata-extraction

Conversation

@chaliy
Copy link
Copy Markdown
Contributor

@chaliy chaliy commented Mar 27, 2026

What

Add PageMetadata struct and extraction from HTML pages during fetch. Returns structured metadata alongside converted content in FetchResponse.metadata.

Why

Agents currently have to re-parse markdown to get basic page info like title, description, and links. This is the single biggest improvement for agentic use — agents need this metadata universally.

How

  • New PageMetadata struct with: title, description, language, canonical_url, author, published_date, modified_date, links (Vec), headings outline
  • extract_metadata() — single-pass HTML parser for meta tags, title, links, language, canonical URL
  • extract_headings() — separate pass for heading outline extraction
  • Both integrated into DefaultFetcher — metadata populated when HTML content detected
  • OG tags override basic HTML tags (og:title > title, og:description > meta description)
  • DoS limits: max 500 links, max 200 headings per page

Risk

  • Low — additive change, existing behavior unchanged
  • New optional field on FetchResponse, backward-compatible

Checklist

  • Unit tests passed (16 tests covering all metadata fields)
  • Clippy clean
  • Docs build clean
  • Specs are up to date

Closes #71

Add PageMetadata struct with title, description, language, canonical_url,
author, published/modified dates, links, and headings outline. Metadata
is extracted during HTML processing in DefaultFetcher and returned in
FetchResponse.metadata field.

Closes #71
@chaliy chaliy merged commit 26f1347 into main Mar 27, 2026
10 checks passed
@chaliy chaliy deleted the claude/issue-71-metadata-extraction branch March 27, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: structured metadata extraction (title, description, language, canonical URL, author, links, headings)

1 participant