[ENG-2738] HTML-emission contract in curate.txt + fluency check by danhdoan · Pull Request #628 · campfirein/byterover-cli

danhdoan · 2026-05-10T07:58:02Z

Summary

Rewrite src/agent/resources/tools/curate.txt (the canonical curate tool description loaded by ToolDescriptionLoader) from structured JSON args (rawConcept/narrative/facts) to the M1 HTML-emission contract using the closed <bv-*> vocabulary from T1.
Cover all 5 elements with per-element schemas, value enums, allowed-children semantics, pairing conventions, output contract, and three worked examples.
Add a sanity unit test that asserts every ELEMENT_NAME, every per-element enum value, and every output-contract directive is present in the prompt — guards against drift when M2 expands the vocabulary.

Type

feat (M1 T2 — authoring fluency check)

Scope

src/agent/resources/tools/curate.txt — wholesale rewrite to HTML emission.
test/unit/server/infra/render/curate-prompt.test.ts — sanity drift guards (11 cases).

Linked issues

ENG-2738 (M1 T2 — authoring fluency sanity check)
Builds on ENG-2737 (PR feat: [ENG-2737] HTML render layer foundation — element registry + parser #627) — registry + parser foundation.

Test plan

All 121 render unit tests pass (11 new on the prompt sanity, 110 inherited from T1).
Typecheck clean (0 errors).
Build succeeds.
Authoring fluency check executed against the new prompt — full report in research repo (separate harness at local-auto-test/curate-fluency/, intentionally not in this repo per spec — "this is a spike, not production code").

Evidence

Fluency gate cleared on first run, after one prompt-tuning iteration.

Metric	Value
Pass rate	20 / 20 (100%) — gate threshold ≥ 18 / 20
Model	`claude-sonnet-4-5-20250929`
Per-category	bug-fix 8/8, rule-decision 6/6, general 6/6
Cost	$0.31

Per-fixture verdicts and a non-gate-blocking observation about Sonnet's tendency to fence-wrap structured output (50% of outputs; recommendation is to strip in T3's response parser, not to iterate the prompt further) are captured in the formal report:
research/features/html-memory-conversion/milestones/01-experiment/fluency-report.md

Checklist

Code passes lint + typecheck.
Unit tests added for the new prompt's invariants.
Fluency report published with honest reporting (no result-tuning).
No Co-Authored-By Claude.

Risks

The live curate flow on proj/html-mem-conversion will misbehave from this PR until T3 lands — the agent will emit HTML rather than calling tools.curate() with structured args, and the executor's parseCurationStatus() will not recognize the response shape. This is intentional and bounded to the project branch. T3 (ENG-273x) wires the executor + writer to consume the HTML output. Production (main) is unaffected.

Rewrite the canonical curate tool description (loaded by ToolDescriptionLoader and presented to the agent on every curate operation) from the structured JSON-args shape (rawConcept/narrative/facts/etc.) to the M1 HTML-emission contract using the closed `<bv-*>` vocabulary defined by T1. Coverage: - All 5 elements (bv-topic, bv-rule, bv-decision, bv-bug, bv-fix) with full per-element attribute schemas: required/optional, value enums, format constraints (importance integer 0-100, recency numeric 0-1, ISO-8601 updatedat, severity enums per element, etc.). - Allowed-children semantics (any/inline/block) + pairing convention (bug+fix as siblings). - Path format (slash-separated snake_case) + domain guidelines. - Output contract: HTML only, exactly one bv-topic root, no preamble/fences/ commentary, lowercase attribute names per HTML5 spec, closed vocabulary, no invented attributes, no clarifying questions. - Three worked examples (bug+fix runbook, rule+decision pair, general topic) to anchor the model's element-pairing and id-naming conventions. Add a sanity test (test/unit/server/infra/render/curate-prompt.test.ts) that loads curate.txt and asserts every registered ELEMENT_NAME, every per-element enum value, and every output-contract directive is present. Guards against silent drift when M2 expands the vocabulary or refactors attribute schemas. Authoring fluency check (M1 T2 spike, separate harness in local-auto-test/curate-fluency/, not in this repo): 20/20 generated outputs validated against the M1 element registry on Sonnet 4.5. Gate cleared on first full run after one prompt-tuning iteration. Decision: proceed to T3. Full report: research repo at features/html-memory-conversion/milestones/01-experiment/fluency-report.md

github-actions · 2026-05-10T07:58:19Z

Claude finished @danhdoan's task in 3m 58s —— View job

Claude PR Review

Overall: 4 / 5 — solid M1 contract rewrite. The HTML emission spec is well-scoped, internally near-consistent, and the registry/prompt alignment is tight. One real ambiguity in the prompt (<bv-rule> inline-vs-block contradiction) and a weaker-than-advertised drift test prevent a 5; both are cheap to fix in this PR rather than punting to T3.

What's good (praise)

Outside-In framing is correct. The prompt is written for the consumer (the LLM emitting curate output); element semantics, examples, and pairing conventions are all derived from concrete need rather than designed in isolation. Matches CLAUDE.md's Feature Development principle.
Registry / prompt alignment is exact. Every ELEMENT_NAMES entry, every per-element optional/required attribute, and every Zod enum (maturity, severity × 2) is faithfully mirrored in the prompt. The "single source of truth" claim in registry.ts:10–20 survives this PR.
Output contract is unambiguous on the load-bearing rules — first-char/last-chars, no fences, no preamble, exactly one <bv-topic> root, lowercase attributes (with the HTML5 normalization rationale inline), no clarifying questions. These are the rules T3's parser needs to assume.
Examples cover the three intended shapes (bug+fix runbook, rule+decision pair, general project context) — good fluency-budget choice for first-run prompt tuning.
Honest evidence reporting. Fluency report is filed with verdicts, the 50%-fence-wrap observation is surfaced rather than buried, and the "fix in T3 not in prompt" recommendation is reasoned (even if I'd push back — see below).

Issues (inline)

<bv-rule> inline-vs-block contradiction (curate.txt:53–81) — bv-rule declares inline children, but the "Standard HTML" section blanket-permits block tags inside any <bv-*>. Inline comment with the proposed split.
Drift-guard tests don't bind enums to elements (curate-prompt.test.ts:51–70) — it('lists severity enum values for bv-rule') passes if "must" exists anywhere in the prompt; it doesn't verify the value is in the <bv-rule> section. A swap between bv-bug and bv-decision severity wouldn't fail. Inline comment proposes section-anchored regex + a stronger test that parses the prompt's example blocks through the registry validators.
Code fences in examples may be a fluency-rate cause (curate.txt:99–145) — non-blocking, but worth a one-line caveat or switching to indented blocks before relying on T3's parser to clean up.
Path-format prompt vs schema (curate.txt:28) — nit; flagging only to confirm the prompt-tighter-than-validator gap is intentional.

Why 4 not 5

Issue 1 is a real ambiguity that an M2 strict validator will reject in one direction or the other — better to settle it now while the prompt is fresh than to retune after T3 lands.
Issue 2 is the one that makes me hesitate most: the test file's preamble explicitly promises drift-failure, but a future PR moving severity values across elements wouldn't trip the guard.

What I didn't review

Did not run the test suite or build locally (review-only context).
Did not read the fluency-report harness in the research repo (out-of-scope per spec — "spike, not production code").
Did not validate the live curate flow on proj/html-mem-conversion — the PR description correctly flags this is intentionally broken until T3.

Todos

Read changed files (curate.txt, curate-prompt.test.ts)
Compare against base proj/html-mem-conversion
Cross-reference registry / Zod schemas / parser
Post inline comments (4)
Post summary with score
· Branch

Field-by-field review of curate.txt against the actual rendered .md context files (frontmatter + ## Reason + ## Raw Concept + ## Narrative + ## Facts) surfaced two issues with the initial T2 draft: 1. <bv-topic> carried runtime-signal attributes (importance, maturity, recency, updatedat) that were explicitly migrated to a sidecar store under research/features/runtime-signals/ (per-user, per-machine state that caused noisy `vc status` diffs and team-share conflicts). Drop them from the schema; replace with the actual frontmatter fields the markdown writer renders: title (required), summary, tags, keywords, related. 2. The 5-element vocabulary covered Rules + three M1 net-new elements (Decision/Bug/Fix) but left every other rendered MD section as free-form HTML — losing the structural mapping the writer needs. Expand M1 vocabulary to 16 elements (added 11), each mapping to a specific section in the rendered .md output: - bv-reason → ## Reason - bv-task, → ## Raw Concept (Task / Changes / Files / Flow) bv-changes, bv-files, bv-flow - bv-structure, → ## Narrative (Structure / Dependencies / bv-dependencies, Highlights / Examples / Diagrams); bv-highlights, bv-rule already covered Rules bv-examples, bv-diagram (with type + title attrs for verbatim diagram preservation) - bv-fact (subject/category/value attrs) → ## Facts list Each element follows the existing makeAttributeValidator pattern; the data-driven registry stays the single source of truth. Tests: - Update bv-topic test to drop runtime-signal cases and assert the new required-attributes (path + title); add a regression case asserting passthrough still tolerates legacy importance/maturity/ recency without enforcing them. - Add registry test asserting (a) bv-topic exposes the new optionalAttributes, (b) runtime signals are NOT in the registry metadata, (c) every registered validator accepts its own empty node + rejects mismatched-tag. - Consolidate the 9 attribute-free new elements into one shared text-only-elements.test.ts (same trivial schema; per-element duplication wouldn't add coverage). - Add dedicated bv-diagram.test.ts (type enum) and bv-fact.test.ts (category enum, subject/value pass-through). - Update sample-topic.html fixture to exercise every new element + the new frontmatter attributes; assert no runtime signals leak in. - Update curate-prompt.test.ts to assert the expanded vocabulary, category/type enums for bv-fact/bv-diagram, and that the prompt explicitly excludes runtime signals from bv-topic attributes. Re-run the M1 T2 fluency check on the same 20 fixtures with the new vocabulary: 20 / 20 valid (gate cleared); 100% cohort-appropriate placement (bug+fix → bv-bug+bv-fix; rule+decision → bv-rule+bv-decision; general → bv-reason); 122 bv-fact instances extracted across the run (~6 per fixture). Fluency report updated: research/features/html-memory-conversion/milestones/01-experiment/fluency-report.md Open contract violation persists: 18/20 outputs (90%) wrap in code fences despite explicit prompt instruction. Recommendation for T3 unchanged — strip in the response parser, don't iterate the prompt.

Second writer audit (markdown-writer.ts:270-354) found 3 rawConcept sub-fields without an HTML home: rawConcept.timestamp, rawConcept.author, rawConcept.patterns. Add the missing elements so the M1 vocabulary covers every section the writer renders. Vocabulary additions (16 → 19 elements): - <bv-timestamp> — text-only; renders as `**Timestamp:**` under `## Raw Concept`. Distinct from frontmatter createdAt/updatedAt (system-set) — this is the date the concept's data represents. - <bv-author> — text-only; renders as `**Author:**`. - <bv-pattern> — structured (flags + description as attrs, pattern as text content). Multiple siblings inside <bv-topic> collected into the `**Patterns:**` bullet list. Mirrors how <bv-fact> works. Sanitiser: - stripCodeFenceWrapper(html) added to html-parser.ts. Strips a single outer ``` ... ``` (any language tag) wrapper from the LLM response. Inner fences (e.g., <pre><code> blocks inside <bv-diagram>) survive. T3's curate executor calls this before parseHtml. Defensive sanitisation generalises better than chasing the model's quirk via prompt iteration — observed fence-wrap rate is 70% on Sonnet 4.5 even with explicit prompt instruction to the contrary. Tests: - bv-pattern dedicated test (bv-pattern.test.ts) for the structured attrs. - text-only-elements.test.ts extended for bv-timestamp + bv-author (same trivial schema as the other text-only elements). - html-parser.test.ts: 7 new cases for stripCodeFenceWrapper covering ```html, ```xml, no-language ```, leading/trailing whitespace, unwrapped passthrough, mismatched-fence passthrough, inner-fence preservation, and parse-after-strip end-to-end. - sample-topic-roundtrip.test.ts: renderable-MD coverage assertion extended to require all 15 dedicated body-section elements (was 12). - sample-topic.html fixture exercises the 3 new elements. Re-ran the M1 T2 fluency check on the same 20 fixtures with the 19-element vocabulary. Results: - 20 / 20 valid (gate cleared) - 100% cohort-appropriate placement (unchanged) - bv-timestamp 60% adoption (used where input has a clear date) - bv-author 10% (used only on inputs that name an owner) - bv-pattern 0% — model correctly did not over-emit; no fixture contained regex content. Restraint is good behaviour. - Cost: $0.61 (vs. $0.60 for the 16-element draft; negligible) Updated fluency-report.md in research repo with final coverage matrix and the 19-element → markdown-writer mapping for T3.

github-actions · 2026-05-10T09:03:32Z

Claude finished @danhdoan's task in 4m 28s —— View job

Claude PR Review (round 2)

Overall: 4 / 5 — the two new commits (1be1dbd vocabulary expansion, eb470b5 writer-section coverage + fence sanitiser) substantially strengthen the M1 contract: 14 net-new bv-* elements with per-element schemas, validators, fixtures, and tests; a defensive fence sanitiser that quietly buys T3 a much cleaner parsing surface. Strong outside-in execution.

The score doesn't move from 4 because the original review's two load-bearing issues are still open — and one of them (the inline-vs-block contradiction) has actually expanded from 1 element to 7 with the vocabulary growth. Both are still cheap to fix in this PR.

What's new and good

Vocabulary now covers the rendered MD structure 1:1. Frontmatter (bv-topic) + Reason + Raw Concept (task / changes / files / flow / timestamp / author / pattern) + Narrative (structure / dependencies / highlights / rules / examples / diagrams) + Facts + Decision/Bug/Fix. The "writer doesn't need a switch on element name" promise (registry.ts:30–34) holds — every section maps to a registry entry.
stripCodeFenceWrapper is the right move. Defensive parsing > fighting Sonnet's quirk through prompt iteration. The implementation only strips a single outer fence, preserves inner <pre><code> (html-parser.test.ts:258–267), and the regex is conservative enough that a partial / unmatched fence is left alone (:253–256). Eight test cases — solid coverage.
Fixture exercises every M1 element at least once (sample-topic.html), and the round-trip test enforces it (sample-topic-roundtrip.test.ts:45–51). This is the right place to catch silent vocabulary breakage.
Schemas for new structured elements are tight where they should be: bv-fact.category (7-value enum), bv-diagram.type (6-value enum), bv-rule.severity and bv-bug.severity keep their distinct enums without cross-contamination.
Net-new test files for the structured-attribute elements (bv-fact.test.ts, bv-pattern.test.ts, bv-diagram.test.ts) plus a clean parameterised file for the 11 text-only elements (text-only-elements.test.ts) — good test ergonomics.

Issues still open from the prior review

1. `<bv-rule>` inline-vs-block contradiction — now expanded to 7 elements (was 1)

The original Issue 1 was not addressed and the vocabulary expansion has made it worse. The registry now declares allowedChildren: 'inline' for seven elements: bv-rule, bv-flow, bv-pattern, bv-fact, bv-task, bv-author, bv-timestamp (registry.ts:42, 113, 145, 165, 188, 208, 218). But the prompt's "Standard HTML inside <bv-*> elements" clause (curate.txt:146–150) still blanket-permits block tags (p, ul, ol, li, h1–h6, pre) inside any <bv-*>.

Worse, the new "Detail-preservation" section actively contradicts the inline declaration:

curate.txt:160–161 — "preserve original numbering in <ol> inside <bv-flow> or <bv-rule>" — but both elements are 'inline' in the registry.

Examples in the prompt and fixture currently honor inline (line 15 of fixture: bare text in <bv-flow>; lines 24–26: bare text in <bv-rule>). M1 passthrough validators don't reject the alternative form, so the LLM picks unpredictably; ADR-007 §13 strict validation in M2 will reject one of the two and you'll have to re-tune.

Fix now — make the standard-HTML clause per-element-aware. Either:

Split the clause into two: "Inline content (for bv-rule, bv-flow, bv-fact, bv-pattern, bv-task, bv-author, bv-timestamp): code, strong, em. Block content (for bv-decision, bv-bug, bv-fix, bv-changes, bv-files, bv-reason, bv-structure, bv-dependencies, bv-highlights, bv-examples, bv-diagram, bv-topic): the full list above plus inline."
And remove "or <bv-rule>" / "<bv-flow>" from the <ol> guidance at line 161 (or move them to the block list, but then update the registry too).

Fix this →

2. Drift-guard tests are still loose string-match

The lines flagged in the prior review (curate-prompt.test.ts:60–86) are unchanged. Each enum check still passes if the value exists anywhere in the prompt, not anchored to the relevant element's section. The new declares each registered element somewhere with an explanatory blurb test (:88–103) is a step forward — it verifies context exists near each element name — but it doesn't bind enum values to elements.

A future PR that swaps severity="critical" from <bv-bug> to <bv-rule>, or moves category="convention" off <bv-fact> to a new element, still passes all current tests. The file's preamble explicitly promises this drift class will fail loudly (:6–9) — and currently it doesn't.

The strongest fix (still the suggestion from the previous round): parse the prompt's three example HTML blocks with parseHtml from src/server/infra/render/reader/html-parser.ts and run each element through its registry validator. That catches both vocabulary drift AND drift in the examples themselves (e.g. severity="hihg" typo). Bonus value now: with the fixture covering every element, the same approach works directly on test/fixtures/render/sample-topic.html and would catch any registry/prompt drift.

Fix this →

New issues observed this round

3. Stale test description: "16 entries" should be "19 entries"

test/unit/server/infra/render/elements/registry.test.ts:23 says it('contains exactly the M1 vocabulary (16 entries)', () => { but ELEMENT_NAMES now has 19 entries. The assertion itself is fine (uses ELEMENT_NAMES.length), but the description is stale and will mislead future readers.

4. Comment / PR-description mismatch on fence-wrap rate

html-parser.ts:32 says "observed at ~90% on M1's authoring fluency check"; the PR description says "50% of outputs". Either the fluency report has been re-run with a higher rate (worth noting in the PR body), or one of these numbers is wrong. The 90% figure is what's now baked into the source code and will shape future maintainers' read of the sanitiser's importance.

5. `bv-rule` severity enum ordering is inconsistent

Prompt (curate.txt:108): "info", "should", "must" (logical severity ascending)
Schema (bv-rule/schema.ts): ['info', 'must', 'should'] (alphabetical)
Registry description (registry.ts:191): (info|must|should) (alphabetical)

Both are accepted by Zod, so this is non-blocking, but keeping them consistent (preferably the logical-severity ordering, which reads better in docs) avoids reader confusion.

6. `stripCodeFenceWrapper` is exported but not yet wired (intentional, just confirming)

parseHtml does not call stripCodeFenceWrapper. The PR description's risk note ("T3 wires the executor + writer to consume the HTML output") plus the function's design (separate utility, conservative regex) suggests the wiring is deferred to T3 — fine, but worth ensuring there's a tracked todo/issue so the sanitiser is actually invoked before parseHtml in the response path.

Issues from prior round — status

Issue 1 — bv-rule inline vs block — ❌ NOT RESOLVED, expanded to 7 elements (see issue Feat/init #1 above)
Issue 2 — drift-guard test efficacy — ❌ NOT RESOLVED (see issue Feat/refactor structure #2 above)
Issue 3 — code fences in examples — ✅ MITIGATED via stripCodeFenceWrapper in parser. The prompt's examples are still fenced; if the team wants belt-and-suspenders, also strip fences from the prompt's example blocks. Non-blocking.
Issue 4 — path-format prompt vs schema — ❌ NOT addressed (nit, intentional)

Why 4 not 5

The same reason as round 1: Issue #1 is a real ambiguity that an M2 strict validator will reject, and Issue #2 is a test that publicly promises drift-failure but doesn't deliver. Both are 10-15 minute fixes. The vocabulary expansion + fence sanitiser are excellent additions and worth a 5 in isolation, but they don't replace fixing the open issues — they make Issue #1 worse by widening its scope.

Todos

Read PR diff vs origin/proj/html-mem-conversion
Re-check the four issues from the prior review
Read new vocabulary additions (registry, schemas, validators)
Read fence-sanitiser change (parser)
Validate per-element schemas + tests
Post review
· Branch

danhdoan requested review from RyanNg1403, bao-byterover, cuongdo-byterover, leehpham and ngduyanhece as code owners May 10, 2026 07:58

cuongdo-byterover previously approved these changes May 10, 2026

View reviewed changes

github-actions Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/agent/resources/tools/curate.txt Outdated

github-actions Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/agent/resources/tools/curate.txt

github-actions Bot reviewed May 10, 2026

View reviewed changes

Comment thread test/unit/server/infra/render/curate-prompt.test.ts

github-actions Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/agent/resources/tools/curate.txt

danhdoan added 2 commits May 10, 2026 15:41

danhdoan dismissed cuongdo-byterover’s stale review via eb470b5 May 10, 2026 09:03

RyanNg1403 approved these changes May 10, 2026

View reviewed changes

danhdoan merged commit 3bdefe5 into proj/html-mem-conversion May 10, 2026
6 checks passed

danhdoan deleted the feat/ENG-2738 branch May 10, 2026 09:44

danhdoan mentioned this pull request May 10, 2026

[ENG-2739] curate path — HTML writer + format detector + executor wiring #630

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENG-2738] HTML-emission contract in curate.txt + fluency check#628

[ENG-2738] HTML-emission contract in curate.txt + fluency check#628
danhdoan merged 3 commits into
proj/html-mem-conversionfrom
feat/ENG-2738

danhdoan commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danhdoan commented May 10, 2026

Summary

Type

Scope

Linked issues

Test plan

Evidence

Checklist

Risks

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review

What's good (praise)

Issues (inline)

Why 4 not 5

What I didn't review

Todos

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review (round 2)

What's new and good

Issues still open from the prior review

1. <bv-rule> inline-vs-block contradiction — now expanded to 7 elements (was 1)

2. Drift-guard tests are still loose string-match

New issues observed this round

3. Stale test description: "16 entries" should be "19 entries"

4. Comment / PR-description mismatch on fence-wrap rate

5. bv-rule severity enum ordering is inconsistent

6. stripCodeFenceWrapper is exported but not yet wired (intentional, just confirming)

Issues from prior round — status

Why 4 not 5

Todos

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

1. `<bv-rule>` inline-vs-block contradiction — now expanded to 7 elements (was 1)

5. `bv-rule` severity enum ordering is inconsistent

6. `stripCodeFenceWrapper` is exported but not yet wired (intentional, just confirming)