[ENG-3021] Render and index <img> as first-class inline content#738
Merged
Merged
Conversation
Closes the silent-strip bug where curating a topic with `<img src alt/>`
embedded inside a `<bv-*>` element succeeded on disk but vanished on
`brv read` and was missed by `brv query`. Empirically reproduced
2026-05-29: writer kept the tag intact; `brv read` showed a gap where
the image was; query for the alt text returned zero matches.
Cause: `html-renderer.ts` and the BM25 indexer's `bodyText` extraction
both rely on `getInnerText()`, which walks text-node descendants only.
Void elements like `<img>` have attribute data (`src`, `alt`) but no
text children → contribute nothing. Worst-class UX (write succeeds,
content disappears).
Approach: treat `<img>` as first-class inline content.
- `html-renderer.ts` — new private `getInlineMarkdown(node)` that walks
like `getInnerText` but translates `<img>` to CommonMark ``.
`renderChild` uses it for inline content; a top-level `<img>` case is
added for the rare bv-sibling shape. Defensive on malformed input:
* missing `src` → empty string (no broken `![alt]()` syntax)
* missing `alt` → `` (valid CommonMark click target)
* `]` in alt → collapsed to space (only `]` closes the alt span)
* `)` in src → CommonMark autolink form `<src>` (parens-tolerant)
- `html-reader.ts` — new exported `extractImageContent(elements)` that
aggregates every `<img>`'s alt + src into a space-joined string.
Surfaced via the new `HtmlTopicRead.imageContent` field. Does NOT
mutate `getInnerText` — separate focused helper with no surprise
blast-radius on shared infrastructure.
- `search-knowledge-service.ts` — concatenate `parsed.imageContent`
into the indexed-content array alongside bodyText / summary / tags /
keywords / related. URLs go in verbatim; the BM25 tokenizer's
whitespace/punctuation split decomposes them into useful tokens
(host, path segments, filename, extension).
- `INDEX_SCHEMA_VERSION` bumped 6 → 7 so cached indexes built pre-fix
invalidate on next daemon start. Previously-curated `<img>` content
becomes searchable retroactively without a manual `brv index rebuild`.
- `system-prompt.yml` — extend the inline-HTML allowlist note to
document `<img>` is supported.
Tests (16 new + 6 integration cases):
- `html-renderer.test.ts` (7 cases): canonical `<img>` rendering inside
`<bv-decision>`; `` for missing alt; silent drop for missing
src; `]` escape in alt; `)` autolink fallback in src; top-level
`<img>` sibling rendering; URL tokens present in rendered output for
BM25 friendliness.
- `html-reader.test.ts` (6 cases): empty topic → empty imageContent;
single image alt + src aggregated; multiple images preserve document
order; empty attrs don't produce double spaces; `readHtmlTopicSync`
surfaces `imageContent` on the parsed result.
- `test/integration/scenarios/img-roundtrip.test.ts` (6 cases): full
read + index pipeline. Curate-shaped HTML on tmp disk → `readHtmlTopic`
→ renderer shows markdown image syntax; indexer (MiniSearch with the
same option shape as production) finds the topic for queries on alt
phrase, URL host token, URL path segment, surrounding prose. Plus a
regression guard for topics with zero images (no double spaces in
the indexed content).
42/42 affected-surface tests green; 54/54 search-knowledge regression
tests pass. Typecheck + lint clean; the `renderChild` complexity
warning was 31 pre-fix and is 32 now — +1 unavoidable for the new
top-level `<img>` branch.
This is the first task in the post-merge inline-html-support
milestone (`features/html-memory-conversion/milestones/02-...`).
The matching fix for `<a href>` (same shape, different bug surface)
is tracked as a follow-up task.
Contributor
|
Claude finished @danhdoan's task in 3m 39s —— View job Review SummaryOverall score: 4.5 / 5 Tight, well-scoped fix for a genuinely worst-class UX bug (silent data loss with success signal). The Outside-In framing is right: the renderer + indexer + prompt docs land together so the behavior is consistent across read, query, and authoring surfaces. Test coverage is solid (16 unit + 6 integration), and the regression-guard for zero-image topics gives confidence the indexed-content shape is byte-identical for the non-image case. What this PR does well
Findings (inline)All non-blocking. None of these would justify holding the merge given the severity of the underlying bug.
Out-of-scope items (correctly deferred)
Not flagged
|
RyanNg1403
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the silent-strip bug where curating a topic with
<img src alt/>embedded inside a<bv-*>element succeeds on disk but vanishes onbrv readand is missed bybrv query. Worst-class UX — no error fires; the calling agent thinks the curate succeeded and the content is safe.Empirically reproduced 2026-05-29:
<bv-decision>containing<img src="..." alt="System architecture showing service mesh"/>.brv read: ✗ rendered output shows a gap where the image was.brv query "service mesh": ✗ 0 matches.brv read --raw: ✓ source bytes intact — data was always safe on disk.Root cause: both
html-renderer.tsand the BM25 indexer'sbodyTextextraction rely ongetInnerText(), which walks text-node descendants only. Void elements like<img>contribute nothing.Approach
Treat
<img>as first-class inline content — promote from "silently dropped" to "standard CommonMark + searchable".Renderer (
html-renderer.ts)getInlineMarkdown(node)walks likegetInnerTextbut translates<img>to CommonMark.renderChilduses it for inline content; a top-level<img>case is added for the rare bv-sibling shape.src→ empty string (no broken![alt]()syntax)alt→(valid CommonMark click target)]in alt → collapsed to space (only]closes the alt span;[is harmless))in src → CommonMark autolink form<src>(parens-tolerant)Indexer (
html-reader.ts+search-knowledge-service.ts)extractImageContent(elements)aggregates every<img>'s alt + src into a space-joined string.HtmlTopicRead.imageContentfield — separate focused helper; does NOT mutategetInnerText(avoiding shared-infrastructure blast-radius).parsed.imageContentinto the existing[bodyText, summary, tags, keywords, related]array.Schema-version bump
INDEX_SCHEMA_VERSIONbumped 6 → 7 so cached indexes built pre-fix invalidate on next daemon start. Previously-curated<img>content becomes searchable retroactively without a manualbrv index rebuild.Documentation
system-prompt.ymlupdated in two places to mention<img>is supported alongside the existing inline-HTML allowlist.Test plan
renderChildwas 31 pre-fix, is 32 now — +1 unavoidable for the new top-level<img>branch.html-renderer.test.ts(7): inline<img>in<bv-decision>;for missing alt; silent drop for missing src;]escape;)autolink fallback; top-level<img>; URL tokens in rendered output.html-reader.test.ts(6): empty topic → empty imageContent; single image aggregated; multiple images preserve document order; empty attrs don't produce double spaces;readHtmlTopicSyncsurfacesimageContenton parsed result.test/integration/scenarios/img-roundtrip.test.ts(6): full read + index pipeline with MiniSearch configured like production. Curate-shaped HTML on tmp disk → renderer shows markdown image syntax; queries on alt phrase, URL host token, URL path segment, and surrounding prose all match. Plus a regression guard for zero-image topics.Reviewer notes
getInlineMarkdownis private and scoped to the<img>case —<a href>translation is a follow-up task in the same inline-html-support milestone (same fix shape, but different bug surface: anchor text is already preserved via inner text; onlyhrefis dropped).extractImageContentis exported for the indexer to consume; designed so future inline-HTML elements with attribute-only content slot in by extending its tag-filter without rewriting the surface.INDEX_SCHEMA_VERSIONbump is the standard self-healing pattern — no user-facing migration step. First daemon start after merge regenerates the index.system-prompt.ymlchange is documentation for cipher-agent users (legacycurate/querypaths). Tool-mode users will discover<img>works through trial — the writer already accepts it. Broader prompt-builder documentation for tool-mode is a possible follow-up if discovery is slow in practice.Out of scope (follow-ups)
<a href>first-class rendering + indexing — same shape; tracked separately in the inline-html-support milestone.<img>support — current update is to the cipher-agent system prompt only.Related
features/html-memory-conversion/milestones/02-inline-html-support/(research repo).src/server/infra/render/reader/html-renderer.ts— strip point fixedsrc/server/infra/render/reader/html-reader.ts— newextractImageContent+imageContentfieldsrc/agent/infra/tools/implementations/search-knowledge-service.ts— indexer wire + schema bumpsrc/agent/resources/prompts/system-prompt.yml— documentation