Skip to content

[ENG-2738] HTML-emission contract in curate.txt + fluency check#628

Merged
danhdoan merged 3 commits into
proj/html-mem-conversionfrom
feat/ENG-2738
May 10, 2026
Merged

[ENG-2738] HTML-emission contract in curate.txt + fluency check#628
danhdoan merged 3 commits into
proj/html-mem-conversionfrom
feat/ENG-2738

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

Summary

  • Rewrite src/agent/resources/tools/curate.txt (the canonical curate tool description loaded by ToolDescriptionLoader) from structured JSON args (rawConcept/narrative/facts) to the M1 HTML-emission contract using the closed <bv-*> vocabulary from T1.
  • Cover all 5 elements with per-element schemas, value enums, allowed-children semantics, pairing conventions, output contract, and three worked examples.
  • Add a sanity unit test that asserts every ELEMENT_NAME, every per-element enum value, and every output-contract directive is present in the prompt — guards against drift when M2 expands the vocabulary.

Type

feat (M1 T2 — authoring fluency check)

Scope

  • src/agent/resources/tools/curate.txt — wholesale rewrite to HTML emission.
  • test/unit/server/infra/render/curate-prompt.test.ts — sanity drift guards (11 cases).

Linked issues

Test plan

  • All 121 render unit tests pass (11 new on the prompt sanity, 110 inherited from T1).
  • Typecheck clean (0 errors).
  • Build succeeds.
  • Authoring fluency check executed against the new prompt — full report in research repo (separate harness at local-auto-test/curate-fluency/, intentionally not in this repo per spec — "this is a spike, not production code").

Evidence

Fluency gate cleared on first run, after one prompt-tuning iteration.

Metric Value
Pass rate 20 / 20 (100%) — gate threshold ≥ 18 / 20
Model claude-sonnet-4-5-20250929
Per-category bug-fix 8/8, rule-decision 6/6, general 6/6
Cost $0.31

Per-fixture verdicts and a non-gate-blocking observation about Sonnet's tendency to fence-wrap structured output (50% of outputs; recommendation is to strip in T3's response parser, not to iterate the prompt further) are captured in the formal report:
research/features/html-memory-conversion/milestones/01-experiment/fluency-report.md

Checklist

  • Code passes lint + typecheck.
  • Unit tests added for the new prompt's invariants.
  • Fluency report published with honest reporting (no result-tuning).
  • No Co-Authored-By Claude.

Risks

The live curate flow on proj/html-mem-conversion will misbehave from this PR until T3 lands — the agent will emit HTML rather than calling tools.curate() with structured args, and the executor's parseCurationStatus() will not recognize the response shape. This is intentional and bounded to the project branch. T3 (ENG-273x) wires the executor + writer to consume the HTML output. Production (main) is unaffected.

Rewrite the canonical curate tool description (loaded by ToolDescriptionLoader
and presented to the agent on every curate operation) from the structured
JSON-args shape (rawConcept/narrative/facts/etc.) to the M1 HTML-emission
contract using the closed `<bv-*>` vocabulary defined by T1.

Coverage:
- All 5 elements (bv-topic, bv-rule, bv-decision, bv-bug, bv-fix) with full
  per-element attribute schemas: required/optional, value enums, format
  constraints (importance integer 0-100, recency numeric 0-1, ISO-8601
  updatedat, severity enums per element, etc.).
- Allowed-children semantics (any/inline/block) + pairing convention
  (bug+fix as siblings).
- Path format (slash-separated snake_case) + domain guidelines.
- Output contract: HTML only, exactly one bv-topic root, no preamble/fences/
  commentary, lowercase attribute names per HTML5 spec, closed vocabulary,
  no invented attributes, no clarifying questions.
- Three worked examples (bug+fix runbook, rule+decision pair, general topic)
  to anchor the model's element-pairing and id-naming conventions.

Add a sanity test (test/unit/server/infra/render/curate-prompt.test.ts) that
loads curate.txt and asserts every registered ELEMENT_NAME, every per-element
enum value, and every output-contract directive is present. Guards against
silent drift when M2 expands the vocabulary or refactors attribute schemas.

Authoring fluency check (M1 T2 spike, separate harness in
local-auto-test/curate-fluency/, not in this repo): 20/20 generated outputs
validated against the M1 element registry on Sonnet 4.5. Gate cleared on
first full run after one prompt-tuning iteration. Decision: proceed to T3.
Full report: research repo at features/html-memory-conversion/milestones/01-experiment/fluency-report.md
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 10, 2026

Claude finished @danhdoan's task in 3m 58s —— View job


Claude PR Review

Overall: 4 / 5 — solid M1 contract rewrite. The HTML emission spec is well-scoped, internally near-consistent, and the registry/prompt alignment is tight. One real ambiguity in the prompt (<bv-rule> inline-vs-block contradiction) and a weaker-than-advertised drift test prevent a 5; both are cheap to fix in this PR rather than punting to T3.

What's good (praise)

  • Outside-In framing is correct. The prompt is written for the consumer (the LLM emitting curate output); element semantics, examples, and pairing conventions are all derived from concrete need rather than designed in isolation. Matches CLAUDE.md's Feature Development principle.
  • Registry / prompt alignment is exact. Every ELEMENT_NAMES entry, every per-element optional/required attribute, and every Zod enum (maturity, severity × 2) is faithfully mirrored in the prompt. The "single source of truth" claim in registry.ts:10–20 survives this PR.
  • Output contract is unambiguous on the load-bearing rules — first-char/last-chars, no fences, no preamble, exactly one <bv-topic> root, lowercase attributes (with the HTML5 normalization rationale inline), no clarifying questions. These are the rules T3's parser needs to assume.
  • Examples cover the three intended shapes (bug+fix runbook, rule+decision pair, general project context) — good fluency-budget choice for first-run prompt tuning.
  • Honest evidence reporting. Fluency report is filed with verdicts, the 50%-fence-wrap observation is surfaced rather than buried, and the "fix in T3 not in prompt" recommendation is reasoned (even if I'd push back — see below).

Issues (inline)

  1. <bv-rule> inline-vs-block contradiction (curate.txt:53–81) — bv-rule declares inline children, but the "Standard HTML" section blanket-permits block tags inside any <bv-*>. Inline comment with the proposed split.
  2. Drift-guard tests don't bind enums to elements (curate-prompt.test.ts:51–70) — it('lists severity enum values for bv-rule') passes if "must" exists anywhere in the prompt; it doesn't verify the value is in the <bv-rule> section. A swap between bv-bug and bv-decision severity wouldn't fail. Inline comment proposes section-anchored regex + a stronger test that parses the prompt's example blocks through the registry validators.
  3. Code fences in examples may be a fluency-rate cause (curate.txt:99–145) — non-blocking, but worth a one-line caveat or switching to indented blocks before relying on T3's parser to clean up.
  4. Path-format prompt vs schema (curate.txt:28) — nit; flagging only to confirm the prompt-tighter-than-validator gap is intentional.

Why 4 not 5

  • Issue 1 is a real ambiguity that an M2 strict validator will reject in one direction or the other — better to settle it now while the prompt is fresh than to retune after T3 lands.
  • Issue 2 is the one that makes me hesitate most: the test file's preamble explicitly promises drift-failure, but a future PR moving severity values across elements wouldn't trip the guard.

What I didn't review

  • Did not run the test suite or build locally (review-only context).
  • Did not read the fluency-report harness in the research repo (out-of-scope per spec — "spike, not production code").
  • Did not validate the live curate flow on proj/html-mem-conversion — the PR description correctly flags this is intentionally broken until T3.

Todos

  • Read changed files (curate.txt, curate-prompt.test.ts)
  • Compare against base proj/html-mem-conversion
  • Cross-reference registry / Zod schemas / parser
  • Post inline comments (4)
  • Post summary with score
    · Branch

Comment thread src/agent/resources/tools/curate.txt Outdated
Comment thread src/agent/resources/tools/curate.txt
Comment thread test/unit/server/infra/render/curate-prompt.test.ts
Comment thread src/agent/resources/tools/curate.txt
danhdoan added 2 commits May 10, 2026 15:41
Field-by-field review of curate.txt against the actual rendered .md
context files (frontmatter + ## Reason + ## Raw Concept + ## Narrative
+ ## Facts) surfaced two issues with the initial T2 draft:

1. <bv-topic> carried runtime-signal attributes (importance, maturity,
   recency, updatedat) that were explicitly migrated to a sidecar store
   under research/features/runtime-signals/ (per-user, per-machine state
   that caused noisy `vc status` diffs and team-share conflicts). Drop
   them from the schema; replace with the actual frontmatter fields the
   markdown writer renders: title (required), summary, tags, keywords,
   related.

2. The 5-element vocabulary covered Rules + three M1 net-new elements
   (Decision/Bug/Fix) but left every other rendered MD section as
   free-form HTML — losing the structural mapping the writer needs.

Expand M1 vocabulary to 16 elements (added 11), each mapping to a
specific section in the rendered .md output:

  - bv-reason       → ## Reason
  - bv-task,        → ## Raw Concept (Task / Changes / Files / Flow)
    bv-changes,
    bv-files,
    bv-flow
  - bv-structure,   → ## Narrative (Structure / Dependencies /
    bv-dependencies,                Highlights / Examples / Diagrams);
    bv-highlights,                  bv-rule already covered Rules
    bv-examples,
    bv-diagram (with type + title attrs for verbatim diagram preservation)
  - bv-fact (subject/category/value attrs) → ## Facts list

Each element follows the existing makeAttributeValidator pattern; the
data-driven registry stays the single source of truth.

Tests:
- Update bv-topic test to drop runtime-signal cases and assert the
  new required-attributes (path + title); add a regression case
  asserting passthrough still tolerates legacy importance/maturity/
  recency without enforcing them.
- Add registry test asserting (a) bv-topic exposes the new
  optionalAttributes, (b) runtime signals are NOT in the registry
  metadata, (c) every registered validator accepts its own empty node
  + rejects mismatched-tag.
- Consolidate the 9 attribute-free new elements into one shared
  text-only-elements.test.ts (same trivial schema; per-element
  duplication wouldn't add coverage).
- Add dedicated bv-diagram.test.ts (type enum) and bv-fact.test.ts
  (category enum, subject/value pass-through).
- Update sample-topic.html fixture to exercise every new element +
  the new frontmatter attributes; assert no runtime signals leak in.
- Update curate-prompt.test.ts to assert the expanded vocabulary,
  category/type enums for bv-fact/bv-diagram, and that the prompt
  explicitly excludes runtime signals from bv-topic attributes.

Re-run the M1 T2 fluency check on the same 20 fixtures with the new
vocabulary: 20 / 20 valid (gate cleared); 100% cohort-appropriate
placement (bug+fix → bv-bug+bv-fix; rule+decision → bv-rule+bv-decision;
general → bv-reason); 122 bv-fact instances extracted across the run
(~6 per fixture). Fluency report updated:
research/features/html-memory-conversion/milestones/01-experiment/fluency-report.md

Open contract violation persists: 18/20 outputs (90%) wrap in code
fences despite explicit prompt instruction. Recommendation for T3
unchanged — strip in the response parser, don't iterate the prompt.
Second writer audit (markdown-writer.ts:270-354) found 3 rawConcept
sub-fields without an HTML home: rawConcept.timestamp, rawConcept.author,
rawConcept.patterns. Add the missing elements so the M1 vocabulary
covers every section the writer renders.

Vocabulary additions (16 → 19 elements):

- <bv-timestamp> — text-only; renders as `**Timestamp:**` under
  `## Raw Concept`. Distinct from frontmatter createdAt/updatedAt
  (system-set) — this is the date the concept's data represents.
- <bv-author> — text-only; renders as `**Author:**`.
- <bv-pattern> — structured (flags + description as attrs, pattern
  as text content). Multiple siblings inside <bv-topic> collected
  into the `**Patterns:**` bullet list. Mirrors how <bv-fact> works.

Sanitiser:

- stripCodeFenceWrapper(html) added to html-parser.ts. Strips a single
  outer ``` ... ``` (any language tag) wrapper from the LLM response.
  Inner fences (e.g., <pre><code> blocks inside <bv-diagram>) survive.
  T3's curate executor calls this before parseHtml. Defensive
  sanitisation generalises better than chasing the model's quirk via
  prompt iteration — observed fence-wrap rate is 70% on Sonnet 4.5
  even with explicit prompt instruction to the contrary.

Tests:

- bv-pattern dedicated test (bv-pattern.test.ts) for the structured
  attrs.
- text-only-elements.test.ts extended for bv-timestamp + bv-author
  (same trivial schema as the other text-only elements).
- html-parser.test.ts: 7 new cases for stripCodeFenceWrapper covering
  ```html, ```xml, no-language ```, leading/trailing whitespace,
  unwrapped passthrough, mismatched-fence passthrough, inner-fence
  preservation, and parse-after-strip end-to-end.
- sample-topic-roundtrip.test.ts: renderable-MD coverage assertion
  extended to require all 15 dedicated body-section elements (was 12).
- sample-topic.html fixture exercises the 3 new elements.

Re-ran the M1 T2 fluency check on the same 20 fixtures with the
19-element vocabulary. Results:
- 20 / 20 valid (gate cleared)
- 100% cohort-appropriate placement (unchanged)
- bv-timestamp 60% adoption (used where input has a clear date)
- bv-author 10% (used only on inputs that name an owner)
- bv-pattern 0% — model correctly did not over-emit; no fixture
  contained regex content. Restraint is good behaviour.
- Cost: $0.61 (vs. $0.60 for the 16-element draft; negligible)

Updated fluency-report.md in research repo with final coverage matrix
and the 19-element → markdown-writer mapping for T3.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 10, 2026

Claude finished @danhdoan's task in 4m 28s —— View job


Claude PR Review (round 2)

Overall: 4 / 5 — the two new commits (1be1dbd vocabulary expansion, eb470b5 writer-section coverage + fence sanitiser) substantially strengthen the M1 contract: 14 net-new bv-* elements with per-element schemas, validators, fixtures, and tests; a defensive fence sanitiser that quietly buys T3 a much cleaner parsing surface. Strong outside-in execution.

The score doesn't move from 4 because the original review's two load-bearing issues are still open — and one of them (the inline-vs-block contradiction) has actually expanded from 1 element to 7 with the vocabulary growth. Both are still cheap to fix in this PR.

What's new and good

  • Vocabulary now covers the rendered MD structure 1:1. Frontmatter (bv-topic) + Reason + Raw Concept (task / changes / files / flow / timestamp / author / pattern) + Narrative (structure / dependencies / highlights / rules / examples / diagrams) + Facts + Decision/Bug/Fix. The "writer doesn't need a switch on element name" promise (registry.ts:30–34) holds — every section maps to a registry entry.
  • stripCodeFenceWrapper is the right move. Defensive parsing > fighting Sonnet's quirk through prompt iteration. The implementation only strips a single outer fence, preserves inner <pre><code> (html-parser.test.ts:258–267), and the regex is conservative enough that a partial / unmatched fence is left alone (:253–256). Eight test cases — solid coverage.
  • Fixture exercises every M1 element at least once (sample-topic.html), and the round-trip test enforces it (sample-topic-roundtrip.test.ts:45–51). This is the right place to catch silent vocabulary breakage.
  • Schemas for new structured elements are tight where they should be: bv-fact.category (7-value enum), bv-diagram.type (6-value enum), bv-rule.severity and bv-bug.severity keep their distinct enums without cross-contamination.
  • Net-new test files for the structured-attribute elements (bv-fact.test.ts, bv-pattern.test.ts, bv-diagram.test.ts) plus a clean parameterised file for the 11 text-only elements (text-only-elements.test.ts) — good test ergonomics.

Issues still open from the prior review

1. <bv-rule> inline-vs-block contradiction — now expanded to 7 elements (was 1)

The original Issue 1 was not addressed and the vocabulary expansion has made it worse. The registry now declares allowedChildren: 'inline' for seven elements: bv-rule, bv-flow, bv-pattern, bv-fact, bv-task, bv-author, bv-timestamp (registry.ts:42, 113, 145, 165, 188, 208, 218). But the prompt's "Standard HTML inside <bv-*> elements" clause (curate.txt:146–150) still blanket-permits block tags (p, ul, ol, li, h1h6, pre) inside any <bv-*>.

Worse, the new "Detail-preservation" section actively contradicts the inline declaration:

  • curate.txt:160–161 — "preserve original numbering in <ol> inside <bv-flow> or <bv-rule>" — but both elements are 'inline' in the registry.

Examples in the prompt and fixture currently honor inline (line 15 of fixture: bare text in <bv-flow>; lines 24–26: bare text in <bv-rule>). M1 passthrough validators don't reject the alternative form, so the LLM picks unpredictably; ADR-007 §13 strict validation in M2 will reject one of the two and you'll have to re-tune.

Fix now — make the standard-HTML clause per-element-aware. Either:

  • Split the clause into two: "Inline content (for bv-rule, bv-flow, bv-fact, bv-pattern, bv-task, bv-author, bv-timestamp): code, strong, em. Block content (for bv-decision, bv-bug, bv-fix, bv-changes, bv-files, bv-reason, bv-structure, bv-dependencies, bv-highlights, bv-examples, bv-diagram, bv-topic): the full list above plus inline."
  • And remove "or <bv-rule>" / "<bv-flow>" from the <ol> guidance at line 161 (or move them to the block list, but then update the registry too).

Fix this →

2. Drift-guard tests are still loose string-match

The lines flagged in the prior review (curate-prompt.test.ts:60–86) are unchanged. Each enum check still passes if the value exists anywhere in the prompt, not anchored to the relevant element's section. The new declares each registered element somewhere with an explanatory blurb test (:88–103) is a step forward — it verifies context exists near each element name — but it doesn't bind enum values to elements.

A future PR that swaps severity="critical" from <bv-bug> to <bv-rule>, or moves category="convention" off <bv-fact> to a new element, still passes all current tests. The file's preamble explicitly promises this drift class will fail loudly (:6–9) — and currently it doesn't.

The strongest fix (still the suggestion from the previous round): parse the prompt's three example HTML blocks with parseHtml from src/server/infra/render/reader/html-parser.ts and run each element through its registry validator. That catches both vocabulary drift AND drift in the examples themselves (e.g. severity="hihg" typo). Bonus value now: with the fixture covering every element, the same approach works directly on test/fixtures/render/sample-topic.html and would catch any registry/prompt drift.

Fix this →

New issues observed this round

3. Stale test description: "16 entries" should be "19 entries"

test/unit/server/infra/render/elements/registry.test.ts:23 says it('contains exactly the M1 vocabulary (16 entries)', () => { but ELEMENT_NAMES now has 19 entries. The assertion itself is fine (uses ELEMENT_NAMES.length), but the description is stale and will mislead future readers.

4. Comment / PR-description mismatch on fence-wrap rate

html-parser.ts:32 says "observed at ~90% on M1's authoring fluency check"; the PR description says "50% of outputs". Either the fluency report has been re-run with a higher rate (worth noting in the PR body), or one of these numbers is wrong. The 90% figure is what's now baked into the source code and will shape future maintainers' read of the sanitiser's importance.

5. bv-rule severity enum ordering is inconsistent
  • Prompt (curate.txt:108): "info", "should", "must" (logical severity ascending)
  • Schema (bv-rule/schema.ts): ['info', 'must', 'should'] (alphabetical)
  • Registry description (registry.ts:191): (info|must|should) (alphabetical)

Both are accepted by Zod, so this is non-blocking, but keeping them consistent (preferably the logical-severity ordering, which reads better in docs) avoids reader confusion.

6. stripCodeFenceWrapper is exported but not yet wired (intentional, just confirming)

parseHtml does not call stripCodeFenceWrapper. The PR description's risk note ("T3 wires the executor + writer to consume the HTML output") plus the function's design (separate utility, conservative regex) suggests the wiring is deferred to T3 — fine, but worth ensuring there's a tracked todo/issue so the sanitiser is actually invoked before parseHtml in the response path.

Issues from prior round — status

  • Issue 1 — bv-rule inline vs block — ❌ NOT RESOLVED, expanded to 7 elements (see issue Feat/init #1 above)
  • Issue 2 — drift-guard test efficacy — ❌ NOT RESOLVED (see issue Feat/refactor structure #2 above)
  • Issue 3 — code fences in examples — ✅ MITIGATED via stripCodeFenceWrapper in parser. The prompt's examples are still fenced; if the team wants belt-and-suspenders, also strip fences from the prompt's example blocks. Non-blocking.
  • Issue 4 — path-format prompt vs schema — ❌ NOT addressed (nit, intentional)

Why 4 not 5

The same reason as round 1: Issue #1 is a real ambiguity that an M2 strict validator will reject, and Issue #2 is a test that publicly promises drift-failure but doesn't deliver. Both are 10-15 minute fixes. The vocabulary expansion + fence sanitiser are excellent additions and worth a 5 in isolation, but they don't replace fixing the open issues — they make Issue #1 worse by widening its scope.

Todos

  • Read PR diff vs origin/proj/html-mem-conversion
  • Re-check the four issues from the prior review
  • Read new vocabulary additions (registry, schemas, validators)
  • Read fence-sanitiser change (parser)
  • Validate per-element schemas + tests
  • Post review
    · Branch

@danhdoan danhdoan merged commit 3bdefe5 into proj/html-mem-conversion May 10, 2026
6 checks passed
@danhdoan danhdoan deleted the feat/ENG-2738 branch May 10, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants