Skip to content

[ENG-3021] Render and index <img> as first-class inline content#738

Merged
danhdoan merged 1 commit into
proj/byterover-tool-modefrom
feat/ENG-3021-img-first-class
May 29, 2026
Merged

[ENG-3021] Render and index <img> as first-class inline content#738
danhdoan merged 1 commit into
proj/byterover-tool-modefrom
feat/ENG-3021-img-first-class

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

Summary

Closes the silent-strip bug where curating a topic with <img src alt/> embedded inside a <bv-*> element succeeds on disk but vanishes on brv read and is missed by brv query. Worst-class UX — no error fires; the calling agent thinks the curate succeeded and the content is safe.

Empirically reproduced 2026-05-29:

  1. Curate a <bv-decision> containing <img src="..." alt="System architecture showing service mesh"/>.
  2. On-disk file: ✓ tag intact.
  3. brv read: ✗ rendered output shows a gap where the image was.
  4. brv query "service mesh": ✗ 0 matches.
  5. brv read --raw: ✓ source bytes intact — data was always safe on disk.

Root cause: both html-renderer.ts and the BM25 indexer's bodyText extraction rely on getInnerText(), which walks text-node descendants only. Void elements like <img> contribute nothing.

Approach

Treat <img> as first-class inline content — promote from "silently dropped" to "standard CommonMark + searchable".

Renderer (html-renderer.ts)

  • New private getInlineMarkdown(node) walks like getInnerText but translates <img> to CommonMark ![alt](src).
  • renderChild uses it for inline content; a top-level <img> case is added for the rare bv-sibling shape.
  • Defensive on malformed inputs:
    • missing src → empty string (no broken ![alt]() syntax)
    • missing alt![](src) (valid CommonMark click target)
    • ] in alt → collapsed to space (only ] closes the alt span; [ is harmless)
    • ) in src → CommonMark autolink form <src> (parens-tolerant)

Indexer (html-reader.ts + search-knowledge-service.ts)

  • New exported extractImageContent(elements) aggregates every <img>'s alt + src into a space-joined string.
  • Surfaced via the new HtmlTopicRead.imageContent field — separate focused helper; does NOT mutate getInnerText (avoiding shared-infrastructure blast-radius).
  • Indexer concatenates parsed.imageContent into the existing [bodyText, summary, tags, keywords, related] array.
  • URLs go in verbatim; the BM25 tokenizer's whitespace/punctuation split decomposes them into useful tokens (host, path segments, filename, extension).

Schema-version bump

INDEX_SCHEMA_VERSION bumped 6 → 7 so cached indexes built pre-fix invalidate on next daemon start. Previously-curated <img> content becomes searchable retroactively without a manual brv index rebuild.

Documentation

system-prompt.yml updated in two places to mention <img> is supported alongside the existing inline-HTML allowlist.

Test plan

  • Typecheck clean
  • Lint clean (0 errors). One pre-existing complexity warning on renderChild was 31 pre-fix, is 32 now — +1 unavoidable for the new top-level <img> branch.
  • 16 new unit tests + 6 new integration cases:
    • html-renderer.test.ts (7): inline <img> in <bv-decision>; ![](src) for missing alt; silent drop for missing src; ] escape; ) autolink fallback; top-level <img>; URL tokens in rendered output.
    • html-reader.test.ts (6): empty topic → empty imageContent; single image aggregated; multiple images preserve document order; empty attrs don't produce double spaces; readHtmlTopicSync surfaces imageContent on parsed result.
    • test/integration/scenarios/img-roundtrip.test.ts (6): full read + index pipeline with MiniSearch configured like production. Curate-shaped HTML on tmp disk → renderer shows markdown image syntax; queries on alt phrase, URL host token, URL path segment, and surrounding prose all match. Plus a regression guard for zero-image topics.
  • 42/42 affected-surface tests green.
  • 54/54 search-knowledge regression tests pass — no regression on the indexer's existing behavior.
  • Empirical bug reproduction (5-step) from the bug report now passes end-to-end at the test level.

Reviewer notes

  • The renderer's getInlineMarkdown is private and scoped to the <img> case — <a href> translation is a follow-up task in the same inline-html-support milestone (same fix shape, but different bug surface: anchor text is already preserved via inner text; only href is dropped).
  • extractImageContent is exported for the indexer to consume; designed so future inline-HTML elements with attribute-only content slot in by extending its tag-filter without rewriting the surface.
  • The INDEX_SCHEMA_VERSION bump is the standard self-healing pattern — no user-facing migration step. First daemon start after merge regenerates the index.
  • system-prompt.yml change is documentation for cipher-agent users (legacy curate/query paths). Tool-mode users will discover <img> works through trial — the writer already accepts it. Broader prompt-builder documentation for tool-mode is a possible follow-up if discovery is slow in practice.

Out of scope (follow-ups)

  • <a href> first-class rendering + indexing — same shape; tracked separately in the inline-html-support milestone.
  • Image-URL validation / host allowlist — image URLs pass through unchanged.
  • Alt-text accessibility validation — minimum length, presence required — separate decision.
  • Tool-mode kickoff-prompt documentation of <img> support — current update is to the cipher-agent system prompt only.

Related

  • Bug report: in-conversation 2026-05-29 (5-step empirical reproduction).
  • Milestone: features/html-memory-conversion/milestones/02-inline-html-support/ (research repo).
  • Affected paths:
    • src/server/infra/render/reader/html-renderer.ts — strip point fixed
    • src/server/infra/render/reader/html-reader.ts — new extractImageContent + imageContent field
    • src/agent/infra/tools/implementations/search-knowledge-service.ts — indexer wire + schema bump
    • src/agent/resources/prompts/system-prompt.yml — documentation

Closes the silent-strip bug where curating a topic with `<img src alt/>`
embedded inside a `<bv-*>` element succeeded on disk but vanished on
`brv read` and was missed by `brv query`. Empirically reproduced
2026-05-29: writer kept the tag intact; `brv read` showed a gap where
the image was; query for the alt text returned zero matches.

Cause: `html-renderer.ts` and the BM25 indexer's `bodyText` extraction
both rely on `getInnerText()`, which walks text-node descendants only.
Void elements like `<img>` have attribute data (`src`, `alt`) but no
text children → contribute nothing. Worst-class UX (write succeeds,
content disappears).

Approach: treat `<img>` as first-class inline content.

- `html-renderer.ts` — new private `getInlineMarkdown(node)` that walks
  like `getInnerText` but translates `<img>` to CommonMark `![alt](src)`.
  `renderChild` uses it for inline content; a top-level `<img>` case is
  added for the rare bv-sibling shape. Defensive on malformed input:
    * missing `src` → empty string (no broken `![alt]()` syntax)
    * missing `alt` → `![](src)` (valid CommonMark click target)
    * `]` in alt → collapsed to space (only `]` closes the alt span)
    * `)` in src → CommonMark autolink form `<src>` (parens-tolerant)
- `html-reader.ts` — new exported `extractImageContent(elements)` that
  aggregates every `<img>`'s alt + src into a space-joined string.
  Surfaced via the new `HtmlTopicRead.imageContent` field. Does NOT
  mutate `getInnerText` — separate focused helper with no surprise
  blast-radius on shared infrastructure.
- `search-knowledge-service.ts` — concatenate `parsed.imageContent`
  into the indexed-content array alongside bodyText / summary / tags /
  keywords / related. URLs go in verbatim; the BM25 tokenizer's
  whitespace/punctuation split decomposes them into useful tokens
  (host, path segments, filename, extension).
- `INDEX_SCHEMA_VERSION` bumped 6 → 7 so cached indexes built pre-fix
  invalidate on next daemon start. Previously-curated `<img>` content
  becomes searchable retroactively without a manual `brv index rebuild`.
- `system-prompt.yml` — extend the inline-HTML allowlist note to
  document `<img>` is supported.

Tests (16 new + 6 integration cases):

- `html-renderer.test.ts` (7 cases): canonical `<img>` rendering inside
  `<bv-decision>`; `![](src)` for missing alt; silent drop for missing
  src; `]` escape in alt; `)` autolink fallback in src; top-level
  `<img>` sibling rendering; URL tokens present in rendered output for
  BM25 friendliness.
- `html-reader.test.ts` (6 cases): empty topic → empty imageContent;
  single image alt + src aggregated; multiple images preserve document
  order; empty attrs don't produce double spaces; `readHtmlTopicSync`
  surfaces `imageContent` on the parsed result.
- `test/integration/scenarios/img-roundtrip.test.ts` (6 cases): full
  read + index pipeline. Curate-shaped HTML on tmp disk → `readHtmlTopic`
  → renderer shows markdown image syntax; indexer (MiniSearch with the
  same option shape as production) finds the topic for queries on alt
  phrase, URL host token, URL path segment, surrounding prose. Plus a
  regression guard for topics with zero images (no double spaces in
  the indexed content).

42/42 affected-surface tests green; 54/54 search-knowledge regression
tests pass. Typecheck + lint clean; the `renderChild` complexity
warning was 31 pre-fix and is 32 now — +1 unavoidable for the new
top-level `<img>` branch.

This is the first task in the post-merge inline-html-support
milestone (`features/html-memory-conversion/milestones/02-...`).
The matching fix for `<a href>` (same shape, different bug surface)
is tracked as a follow-up task.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Claude finished @danhdoan's task in 3m 39s —— View job


Review Summary

Overall score: 4.5 / 5

Tight, well-scoped fix for a genuinely worst-class UX bug (silent data loss with success signal). The Outside-In framing is right: the renderer + indexer + prompt docs land together so the behavior is consistent across read, query, and authoring surfaces. Test coverage is solid (16 unit + 6 integration), and the regression-guard for zero-image topics gives confidence the indexed-content shape is byte-identical for the non-image case.

What this PR does well

  • Schema-version bump (INDEX_SCHEMA_VERSION 6→7) is the right self-healing call — pre-fix indexes invalidate on next daemon start, so previously-curated <img> content becomes searchable retroactively without a manual rebuild step.
  • extractImageContent separation — the new helper is a focused, exported function that doesn't mutate getInnerText. Good blast-radius discipline; the BM25 input gains image content without touching the body-text contract.
  • Defensive renderImgMarkdown — missing src drops silently rather than emitting broken ![](); missing alt produces valid ![](src); ] in alt and ) in src have documented fallbacks.
  • System-prompt update — surfacing "alt is the primary search handle" is high-leverage. Without it, models can ship empty alt="" (renders fine in browsers) and silently degrade BM25 recall.
  • Integration test — proves the fix end-to-end by indexing a single topic with MiniSearch configured like production and querying for the alt phrase that was the original repro.

Findings (inline)

All non-blocking. None of these would justify holding the merge given the severity of the underlying bug.

  1. renderer:242 — Per CommonMark, unmatched [ in alt can ambiguate the parser. The doc-comment calls it "harmless" but it isn't strictly so. Cheap to scrub the same way ] is scrubbed.
  2. renderer:246 — When src contains ) and the <src> autolink fallback is used, alt is silently dropped from the rendered output. The indexer still has it, but the LLM reader loses the semantic signal. Suggested alt <src> form preserves both.
  3. renderer:209.join(' ') carries forward an existing trailing-space-before-punctuation artifact from getInnerText. Not a regression, but more visible now because the image markdown is substantive. Worth a follow-up that fixes both paths.
  4. renderer:221collapseInlineWhitespace is identical to collapseWhitespace in html-parser.ts. Worth deduplicating to keep the whitespace contract in lockstep.
  5. html-reader:113extractImageContent's parameter type is structural rather than readonly ElementNode[]. Cosmetic — tighten for documentation value.

Out-of-scope items (correctly deferred)

  • <a href> first-class rendering + indexing — same shape, different bug surface. Tracked in the inline-html-support milestone per the PR description.
  • Image-URL validation / host allowlist — separate concern.
  • Alt-text accessibility validation — separate concern.

Not flagged

  • TDD ordering — can't verify from the diff alone whether tests were written first, but the test artifacts (16 unit + 6 integration, including the bug-repro guard) are commensurate with the strict TDD requirement in CLAUDE.md.
  • Typecheck / lint / test runs — relying on the PR description's "42/42 affected-surface tests green" + "54/54 search-knowledge regression" claim. Did not re-run locally in this review.

  • Read changed files
  • Analyze html-renderer.ts changes
  • Analyze html-reader.ts changes
  • Analyze search-knowledge-service.ts changes
  • Analyze system-prompt.yml changes
  • Analyze test coverage
  • Post inline comments for any issues
  • Post final review summary
    · Branch: feat/ENG-3021-img-first-class

Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/server/infra/render/reader/html-reader.ts
Comment thread src/agent/infra/tools/implementations/search-knowledge-service.ts
Comment thread src/agent/infra/tools/implementations/search-knowledge-service.ts
Comment thread src/agent/resources/prompts/system-prompt.yml
@danhdoan danhdoan merged commit 6f2ac2e into proj/byterover-tool-mode May 29, 2026
6 checks passed
@danhdoan danhdoan deleted the feat/ENG-3021-img-first-class branch May 29, 2026 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants