Skip to content

chore(tools): Convert HTML to markdown in web_fetch#475

Merged
JeanMertz merged 6 commits intomainfrom
prr46
Mar 26, 2026
Merged

chore(tools): Convert HTML to markdown in web_fetch#475
JeanMertz merged 6 commits intomainfrom
prr46

Conversation

@JeanMertz
Copy link
Copy Markdown
Collaborator

Previously, web_fetch returned raw HTML to the LLM, which is verbose and hard to process—full of tags, scripts, and styles that consume tokens without adding useful information.

The tool now converts HTML responses to Markdown using htmd, stripping <script>, <style>, <noscript>, <svg>, and <iframe> tags in the process. The page <title> is extracted and prepended as an H1 heading, HTML entities in the title are decoded, and consecutive blank lines are collapsed to at most two to keep output compact.

Binary content types (images, audio, video, PDFs, etc.) are now rejected early with a descriptive error rather than returning garbled bytes.

For large pages exceeding 200 KB, the tool first attempts to summarize the content via Claude Haiku using the ANTHROPIC_API_KEY environment variable. If that's unavailable or fails, it falls back to hard truncation with a note indicating how many bytes were cut.

The `fs_create_file` tool previously printed raw markdown code fences in
its confirmation message. Now it passes the fence through
`jp_md::Formatter` to render terminal-colored syntax highlighting,
falling back to the raw fence if formatting fails.

Also adds a test module for `create_file` covering both the
format-arguments path (checking for ANSI codes and no raw fences) and
the run path (verifying the file is actually written). The `jp_md`
buffer test suite is extended to verify that blank lines inside fenced
code blocks are preserved as `FencedCodeLine` events.

Signed-off-by: Jean Mertz <git@jeanmertz.com>
Previously, `web_fetch` returned raw HTML to the LLM, which is verbose
and hard to process—full of tags, scripts, and styles that consume
tokens without adding useful information.

The tool now converts HTML responses to Markdown using `htmd`, stripping
`<script>`, `<style>`, `<noscript>`, `<svg>`, and `<iframe>` tags in the
process. The page `<title>` is extracted and prepended as an H1 heading,
HTML entities in the title are decoded, and consecutive blank lines are
collapsed to at most two to keep output compact.

Binary content types (images, audio, video, PDFs, etc.) are now rejected
early with a descriptive error rather than returning garbled bytes.

For large pages exceeding 200 KB, the tool first attempts to summarize
the content via Claude Haiku using the `ANTHROPIC_API_KEY` environment
variable. If that's unavailable or fails, it falls back to hard
truncation with a note indicating how many bytes were cut.

Signed-off-by: Jean Mertz <git@jeanmertz.com>
…tput

Signed-off-by: Jean Mertz <git@jeanmertz.com>
Base automatically changed from prr44 to main March 25, 2026 23:23
Signed-off-by: Jean Mertz <git@jeanmertz.com>
Signed-off-by: Jean Mertz <git@jeanmertz.com>
@JeanMertz JeanMertz merged commit d94a9aa into main Mar 26, 2026
12 checks passed
@JeanMertz JeanMertz deleted the prr46 branch March 26, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant