Skip to content

feat: content-focused extraction — strip nav/footer/sidebar boilerplate #72

@chaliy

Description

@chaliy

What

Add boilerplate stripping to reduce token waste for agents. Most web pages are 80% navigation, footers, sidebars, and ads.

Approach

  • Strip <nav>, <footer>, <aside>, elements with role="navigation", role="banner", role="contentinfo"
  • Prioritize <main>, <article>, [role="main"] content when present
  • Add content_focus option to FetchRequest: "main" (strip boilerplate) vs "full" (current behavior, default)

Why

Huge token savings for agents. A typical news article page is 80%+ boilerplate that wastes LLM context.

Acceptance criteria

  • New content_focus field on FetchRequest
  • When "main": strip nav/footer/aside/boilerplate elements before conversion
  • When "full" or omitted: current behavior unchanged
  • Tests covering pages with and without semantic HTML structure

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions