chore(tools): Convert HTML to markdown in `web_fetch` by JeanMertz · Pull Request #475 · dcdpr/jp

JeanMertz · 2026-03-25T23:14:36Z

Previously, web_fetch returned raw HTML to the LLM, which is verbose and hard to process—full of tags, scripts, and styles that consume tokens without adding useful information.

The tool now converts HTML responses to Markdown using htmd, stripping <script>, <style>, <noscript>, <svg>, and <iframe> tags in the process. The page <title> is extracted and prepended as an H1 heading, HTML entities in the title are decoded, and consecutive blank lines are collapsed to at most two to keep output compact.

Binary content types (images, audio, video, PDFs, etc.) are now rejected early with a descriptive error rather than returning garbled bytes.

For large pages exceeding 200 KB, the tool first attempts to summarize the content via Claude Haiku using the ANTHROPIC_API_KEY environment variable. If that's unavailable or fails, it falls back to hard truncation with a note indicating how many bytes were cut.

The `fs_create_file` tool previously printed raw markdown code fences in its confirmation message. Now it passes the fence through `jp_md::Formatter` to render terminal-colored syntax highlighting, falling back to the raw fence if formatting fails. Also adds a test module for `create_file` covering both the format-arguments path (checking for ANSI codes and no raw fences) and the run path (verifying the file is actually written). The `jp_md` buffer test suite is extended to verify that blank lines inside fenced code blocks are preserved as `FencedCodeLine` events. Signed-off-by: Jean Mertz <git@jeanmertz.com>

Previously, `web_fetch` returned raw HTML to the LLM, which is verbose and hard to process—full of tags, scripts, and styles that consume tokens without adding useful information. The tool now converts HTML responses to Markdown using `htmd`, stripping `<script>`, `<style>`, `<noscript>`, `<svg>`, and `<iframe>` tags in the process. The page `<title>` is extracted and prepended as an H1 heading, HTML entities in the title are decoded, and consecutive blank lines are collapsed to at most two to keep output compact. Binary content types (images, audio, video, PDFs, etc.) are now rejected early with a descriptive error rather than returning garbled bytes. For large pages exceeding 200 KB, the tool first attempts to summarize the content via Claude Haiku using the `ANTHROPIC_API_KEY` environment variable. If that's unavailable or fails, it falls back to hard truncation with a note indicating how many bytes were cut. Signed-off-by: Jean Mertz <git@jeanmertz.com>

…tput Signed-off-by: Jean Mertz <git@jeanmertz.com>

Signed-off-by: Jean Mertz <git@jeanmertz.com>

JeanMertz added 4 commits March 26, 2026 00:05

fixup! chore(tools, md): Highlight code blocks in fs_create_file ou…

33943d8

…tput Signed-off-by: Jean Mertz <git@jeanmertz.com>

Merge branch 'prr44' into prr46

32ca47a

Base automatically changed from prr44 to main March 25, 2026 23:23

JeanMertz added 2 commits March 26, 2026 10:24

fixup! Merge branch 'prr44' into prr46

444fcaa

Signed-off-by: Jean Mertz <git@jeanmertz.com>

fixup! fixup! Merge branch 'prr44' into prr46

035d7c8

Signed-off-by: Jean Mertz <git@jeanmertz.com>

JeanMertz merged commit d94a9aa into main Mar 26, 2026
12 checks passed

JeanMertz deleted the prr46 branch March 26, 2026 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(tools): Convert HTML to markdown in `web_fetch`#475

chore(tools): Convert HTML to markdown in `web_fetch`#475
JeanMertz merged 6 commits intomainfrom
prr46

JeanMertz commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JeanMertz commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant