fix(docx-core): harden heading detection (#157 Phase 1)#178
Conversation
Three concrete bugs in `read_file(format="json")` heading signals, verified against `tests/test_documents/nvca-regression/source.docx`: 1. False negative on the SPA's center-aligned ALL-CAPS bold title (`SERIES […] PREFERRED STOCK PURCHASE AGREEMENT`). It produced no heading signal because none of the existing detectors targeted centered-title formatting. 2. False positives in `extractHeaderInfo()`: the long-paragraph regex branch returned `title_bare` for body prose that began with a capitalized word, surfacing the leading word as `header_text` (e.g. `Termination` from `Termination of Section 2.2(d)(i) shall not affect …`). 3. False positives in `detectRunInHeader()`: paragraphs whose visible runs were entirely bold/underline (signature labels, defined-term lead-in sentences) were classified as `run_in_header` because the detector never required a real header-prefix → body transition. Why this scope (Phase 1 only — no `is_heading` schema yet): peer review (gemini + codex) flagged that the existing signals were too noisy to bless with a top-level verdict field. Across the 367-paragraph NVCA SPA, the `run_in_header` count drops from 12 (all false positives, each carrying a body-prose `header_text`) to 0 with the transition guard. With the signals now correct, derived `is_heading` / `heading_level` fields can be designed against a stable foundation in a follow-up issue. Changes: - `extractHeaderInfo()`: long-paragraph regex branch returns null when no explicit `.` or `:` terminator was matched. Short-paragraph branch (≤ 50 chars) and the explicit-terminator paths are unchanged — the existing `Governing Law and Venue: …` test still passes. - `detectRunInHeader()`: adds a formatting-transition invariant (`headerCharCount < totalVisibleChars`). Whole-paragraph bold/underline blocks are no longer surfaced as run-in headers; a real bold-prefix → non-bold-body transition still works. - New `detectTitleCapsCentered()`: final fallback for centered, ALL-CAPS, bold standalone titles. Strict gates (no lowercase, no list label, not in a table cell, ≤ 60 chars). Emits `header_style: 'title_caps_centered'`. - Documents the canonical heading-detection precedence rule in `skills/docx-editing/SKILL.md` and adds a brief pointer in `packages/docx-mcp/README.md`. - Regression tests: 5 new branch-coverage tests in `document_view.branch.test.ts` (synthetic XML) and 1 end-to-end JSON-shape test in `nvca_spa_regression.test.ts` against the actual SPA fixture. The Google Docs renderer (`packages/google-docs-core`) is structurally cast to `DocumentViewNode`; its schema is unchanged. Phase 2 design (the deferred `is_heading` / `heading_level` fields) will be tracked in a follow-up issue. Ref: #157
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Peer review of #178 (gemini + codex CLI) and a smoke against ~/Downloads docx fixtures surfaced four false-positive / false-negative cases that slip past the Phase 1 gates. Address all four in this commit: 1. detectTitleCapsCentered: add a content gate so single-token bracketed placeholders ([COMPANY], [___], <NAME>) and underscore-only signature lines no longer classify as titles. Require >= 3 ASCII letters AND >= 2 whitespace-separated word-tokens. (Both Codex and the smoke independently reported real fixture hits.) 2. detectTitleCapsCentered: split the length cap into its own constant MAX_CENTERED_TITLE_LENGTH = 120. Long corporate titles like `AMENDED AND RESTATED CERTIFICATE OF INCORPORATION OF FOO INC.` were rejected by the 60-char body-prose cap, producing real false negatives on the bundled NVCA COI fixture. 3. buildNodesForDocumentView: run detectTitleCapsCentered BEFORE extractHeaderInfo so short ALL-CAPS centered titles like `SCHEDULE OF PURCHASERS` surface as title_caps_centered (matching the documented precedence in SKILL.md) instead of as title_bare. No production consumers depend on the old emission ordering. 4. detectRunInHeader: replace the headerCharCount >= totalVisibleChars guard with an explicit body-text accumulator so trailing-whitespace-only "body" runs no longer defeat the whole-paragraph-bold rejection. Real false positive observed in Downloads/Co-Founder Agreement.docx (`Compensation & Benefits.` + trailing space-only run, all bold). Add five regression tests covering each fix. Update SKILL.md and the mcp README to reflect the new gates and 120-char cap. All tests still pass: docx-core 1147 (1146 + 1 skip), docx-mcp 675.
|
Pushed
Five new branch tests added covering each gate; SKILL.md and the MCP README updated to describe the new gates and the 120-char cap. Test status: docx-core 1146 pass + 1 skip, docx-mcp 675 pass, lint clean (1 pre-existing warning unchanged). Smoke distribution after the fix (sampled from
Confirmed conclusions kept: Review artifacts:
|
Post-merge smoke passedMerged: Steps
Log: |
…#187) (#188) * fix(docx-core): suppress non-sectional false-positive headings (#187) Two cross-fixture residual error classes after PR #178: 1. **Table cells**: `extractHeaderInfo()` ran on every paragraph regardless of `tableContext`, even though `detectTitleCapsCentered()` already explicitly skipped table cells. Bonterms NDA was 13/13 `title_bare` and 10/10 `title_with_colon` table-cell false positives (100%); Mutual NDA was 17/18; Common Paper NDA was 16/17. One-line fix adds the `&& !tableContext` gate to mirror the existing one on the centered-title path. 2. **Signature/field-label clusters**: NVCA SPA paragraphs 312–321 (`COMPANY:` / `By:` / `Name:` / `Title:` / `Address:` / `Email:` / `PURCHASER:` / `By:` / `Name:` / `Title:`) are body-level (not in tables) and were each individually classified as `title_with_colon` or `title_bare`. New `suppressSignatureClusters()` post-pass scans the assembled `nodes` array for runs of ≥4 consecutive paragraphs where ≥75% match a signature-label regex (mixed-case `Name:` form OR all-caps `COMPANY: …` form), then nulls `header_text` / `header_style` / `header_formatting` on every node in the cluster. Why these two specifically: a second peer review (Gemini + Codex CLI) measured 80/251 `title_bare` (32%) and 22/56 `title_with_colon` (39%) in-table false positives across 8 fixtures. None of PR #178's four fixes added a `!tableContext` guard to `extractHeaderInfo`, and the signature blocks live outside tables so the table fix alone wouldn't catch them. Cluster heuristic notes: - No "near document end" anchor (Gemini: MSAs and amended docs have signature blocks throughout). - Doesn't require all to be colon-suffixed (Codex: NVCA's `COMPANY: [Insert Name]` lacks colon-only suffix). - ≥4-paragraph minimum prevents single-label false suppression. Cross-fixture verification (8 fixtures): - Bonterms NDA: 13/13 + 10/10 → 0/0 + 0/0 - Mutual NDA: 16/15 + 0/0 → 1/0 + 0/0 - Common Paper NDA: 15/14 + 0/0 → 1/0 + 0/0 - NVCA SPA signature block (paragraphs 312–321): all `header_style: null` - ILPA LPAs (×2): unchanged at 20/0 + 0/0 — real section headings preserved Tests: 4 new (2 branch-coverage with synthetic XML, 2 end-to-end MCP JSON-shape against actual fixtures). Suites green: docx-core 1220 pass + 1 skip; docx-mcp 726 pass. Out of scope (deferred): - Placeholder/bracket rule for `title_bare` path (waits for footnote marker normalization first). - Confidence tiering. Peer review (Gemini + Codex) pending — see Step 8. Closes #187. * fix(docx-core): preserve adjacent headings + extend table gate to run-in detection Two issues caught by Codex peer review on PR #188: 1. **HIGH: cluster post-pass erased real headings adjacent to signature blocks.** suppressSignatureClusters() suppressed every node covered by any qualifying ≥4-paragraph window with ≥75% match density, including nodes that themselves did NOT match a signature-label regex. Codex reproduced: 4 signature labels followed by a Heading1 `Product` paragraph — the window [0..4] has 4/5 = 80% density, so the Heading1 was silently nulled. This is a production-shape failure because signature pages are routinely followed by the next section heading. Fix: only suppress nodes that themselves match a signature-label regex. The window is the density signal that authorizes suppression; adjacent non-matching nodes are preserved. 2. **NORMAL: detectRunInHeader still ran inside table cells**, despite SKILL.md saying heading detection is suppressed there. A table cell containing bold "Purpose:" plus body text would still emit `header_style: "run_in_header"`. Mirror the existing !tableContext gates already present on detectTitleCapsCentered and extractHeaderInfo. Two new regression tests: - All-caps `COMPANY:` / `PURCHASER:` cluster (NVCA-style mixed prefix + label rows, 9 paragraphs) all suppressed. - Boundary-preservation: 4 signature labels + adjacent Heading1 paragraph — labels suppressed, Heading1 preserved. Suites green: docx-core 1222 pass + 1 skip (was 1220, +2 new tests), docx-mcp 726 unchanged. Ref: #187 * fix(docx-core): suppress dense signature blocks; preserve only Word-styled headings adjacent Codex re-review found the previous fix was too lax. After narrowing suppression to "only nodes that match the signature-label regex," common signature-block shapes like: ACME CORP By: Name: Title: still leaked `ACME CORP` as `header_style: "title_bare"` (heuristic short-paragraph fallback fires; the regex doesn't match all-caps company names without a colon terminator). The right rule is in between the two extremes: - The original "any covering window" suppressed too much (erased adjacent Heading1 paragraphs). - "Only matching nodes" suppressed too little (leaked company names). Refined: inside a cluster window, suppress every node EXCEPT real Word-styled headings (`paragraph_style_id` matching `/^Heading[1-6]$/`). The density signal authorizes blanket suppression; only an explicit Word heading style is strong enough to overrule it. This handles both reproductions: - `By:`/`Name:`/`Title:`/`Address:`/`Product[Heading1]` → labels suppressed, Heading1 preserved (Codex's first repro). - `ACME CORP`/`By:`/`Name:`/`Title:` → ACME CORP suppressed (no Word style; only heuristic title_bare) (Codex's second repro). New regression test asserts the company-name lead-in case. Suites green: docx-core 1223 pass + 1 skip (was 1222, +1 test); docx-mcp 726 unchanged. Out of scope (deferred to a follow-up issue): - Broader regex coverage for `INVESTOR 1:`, `PURCHASER(S):`, `BORROWER[S]:` (digits, parens, brackets). Both reviewers flagged these as nice-to-have; they don't block this fix. Ref: #187 * fix(docx-core): narrow signature-cluster suppression to label-matching nodes Address Gemini + Codex peer review of PR #188 (issue #187 follow-up). Codex empirically reproduced the over-suppression bug: a window meeting the 4-paragraph / 75%-label density gate also covers non-label neighbors, and the previous code cleared `header_*` on every paragraph in the window (saving only `/^Heading[1-6]$/`-styled paragraphs). Empirically confirmed false negatives: - `[By:, Name:, Title:, Address:, Product]` — `Product` was cleared even though it doesn't match any signature-label regex. - `Heading7`, `Title`, `Subtitle`, `Article5L1` (all present in the fixture corpus) classified as `title_bare` when alone but were cleared when adjacent to a 4-label cluster. Fix: in the final clear loop, only mutate paragraphs that themselves satisfy `isSignatureClusterLabel`. The density gate still authorizes suppression of labels inside the window; non-label neighbors keep their detected heading metadata regardless of paragraph style. This obsoletes the `Heading1-6` exact-id escape hatch — real headings are preserved by virtue of not matching the label predicate. Tests: - Rewrite the "ACME CORP lead-in" test to assert the corrected behavior: the three label paragraphs are suppressed; ACME CORP keeps its heuristic `title_bare` classification (it's not a label). - Add a regression test for non-label body neighbor (`Product` after four labels). - Add a regression test for custom-style headings (`Heading7`, `Title`, `Subtitle`, `Article5L1`) adjacent to a signature cluster. - Add a regression test locking in correct WHEREAS-recital handling (no false suppression). Verification: - `npm -w @usejunior/docx-core test`: 1226 passed + 1 skipped (was 1220 + 1 skipped — 6 net new test assertions). - `npm -w @usejunior/docx-mcp test`: 726 passed (unchanged). - `npm run lint:workspaces`: 0 errors (1 pre-existing warning). - Cross-fixture spot check: in-table residual count remains 0 across all 7 fixtures. Some legitimate non-cluster headings now correctly preserved (Letter of Intent went from 6 → 11 residuals, all `Heading1`/`Heading2`-styled real section headings). Refs: peer-review feedback at #188 (comment)
#179) (#190) * feat(docx-core): add derived heading object to DocumentViewNode (#179) Phase 2 of #157: expose a sparse top-level `heading` field on every paragraph node, locking the precedence rule into the wire schema so consumers stop re-implementing it. Why now (after #157 / #178 / #188 detector hardening): the underlying signals are now stable enough to bless. Without this, every consumer re-implements the 5-step precedence ladder from `paragraph_style_id` + `list_metadata.header_style`, and every new heuristic added to the detector silently breaks every consumer's hardcoded `if/else` (as just happened with `title_caps_centered` in #178). Centralizing the verdict in the schema makes future detector changes non-breaking for downstream code. Schema (sparse — field omitted entirely for body paragraphs): heading?: { text: string; source: 'word_style' | 'run_in_header' | 'title_with_period' | 'title_with_colon' | 'title_caps_centered' | 'title_bare'; level: number | null; // 1..6 only when source === 'word_style' } Derivation rules (locked in #179): - `level` is set to 1..6 ONLY when `paragraph_style_id` matches /^Heading([1-6])$/ exactly. Style inheritance does NOT count — `HeadingPara1` / `HeadingPara2` styles in the NVCA fixture inherit from heading styles but behave like body paragraphs (`next: Normal`), so they MUST NOT produce a level. - `level` is `null` for all heuristic sources. - Word built-in heading wins over heuristics when both signals fire. - Field is OMITTED (not `null`) for body paragraphs — keeps payload small and gives a clean jq idiom: `.[] | select(.heading)`. Cross-renderer: the Google Docs renderer (packages/google-docs-core/src/document-view.ts) is structurally cast to DocumentViewNode at packages/docx-mcp/src/tools/gdocs/read_file.ts. Mirrored the derivation there but kept it Word-style-only (Google Docs has built-in heading styles; heuristic detectors do not run on the GDocs path). Bonus typing improvement: narrowed `header_style` from `string | null` to `HeuristicHeadingSource | null` (an exported type union) so consumers get autocomplete on the enum values. Tests: 8 new tests (7 in `document_view.branch.test.ts`, 1 e2e in `nvca_spa_regression.test.ts`): - Heading1 → `{source: 'word_style', level: 1}` - Heading6 → `{source: 'word_style', level: 6}` - HeadingPara1 (basedOn Heading1) → no `heading` key - title_caps_centered → `{source: 'title_caps_centered', level: null}` - run_in_header → `{source: 'run_in_header', level: null}` - Body paragraph → no `heading` key - Precedence: Heading2 + title_with_colon → `{source: 'word_style', level: 2}` - E2E: NVCA SPA title → `{source: 'title_caps_centered', level: null}` Suites: docx-core 1224 pass + 1 skip (was 1220, +4 net new tests after the helper-typing test churn). docx-mcp 733 pass; 1 pre-existing failure in `read_file_comments.test.ts:1028` unchanged from main — unrelated to this PR (introduced by PR #180's comments-namespace work). Documentation: SKILL.md and packages/docx-mcp/README.md updated to recommend `node.heading != null` as the canonical "is this a heading" check; raw `header_style` becomes the per-detector explanation layer. GDocs asymmetry documented explicitly. Peer review (Gemini + Codex) pending — see Step 8. Closes #179. * fix(docx-core): suppress heuristic heading promotion in table cells (peer review #190) Both Gemini and Codex independently flagged that heuristic detectors (run_in_header, title_with_period, title_with_colon, title_bare) still fired inside table cells, so the new canonical predicate `node.heading != null` was promoting ordinary label/value cell content ("Notice Address:", "Closing.", "Indemnification. ...") into structural headings. word_style still fires inside cells (legitimate Word Heading1..6 inside a table is still a heading). Other peer-review follow-ups: - Document HeadingValue.text semantic split (word_style = whole para, heuristic sources = extracted prefix) in JSDoc. - Export HeuristicHeadingSource so external consumers can name the narrowed ListMetadata.header_style union. - Fix CachedParagraph.paragraphId JSDoc — it stores namedStyleType (HEADING_1, NORMAL_TEXT), not the Google native paragraph ID. - Add GDocs renderer tests covering HEADING_n normalization, NORMAL_TEXT/TITLE/SUBTITLE rejection, and table-cell word_style. - Add docx-core regression tests for title_with_period, title_bare, table-cell suppression, and word_style-in-cell still firing. Test deltas: docx-core 1224 → 1227, google-docs-core 113 → 116. Pre-existing read_file_comments.test.ts:1028 failure unchanged.
Summary
Phase 1 of #157: tighten the existing heading-detection signals in
read_file(format=\"json\"), document the canonical precedence rule, and add a centered-title detector. Phase 2 (derivedis_heading/heading_levelschema fields) is deferred to a follow-up issue.Three concrete bugs fixed against
tests/test_documents/nvca-regression/source.docx:SERIES […] PREFERRED STOCK PURCHASE AGREEMENT) now surfaces asheader_style: 'title_caps_centered'instead of producing a null heading signal.extractHeaderInfo(): the long-paragraph regex branch no longer returnstitle_barewhen no explicit.or:terminator is matched — body prose likeTermination of Section 2.2(d)(i) shall not affect …andExcept as described in Section 2.2(d)(i), …no longer surface a leading-wordheader_text.detectRunInHeader(): a formatting-transition invariant (headerCharCount < totalVisibleChars) ensures the bold/underline prefix is followed by non-header body. Whole-paragraph bold/underline blocks (signature labels, defined-term lead-in sentences) no longer get classified as run-in headers.Smoke results across the 367-paragraph NVCA SPA
run_in_headerheader_text= whole body prose)title_with_periodtitle_with_colonrun_in_headermis-classification)title_baretitle_caps_centeredWhy Phase 1 only (not Phase 2)
Peer-reviewed with both Gemini and Codex CLI before implementation. Codex flagged that:
run_in_headerentries on main are exactly that noise).headerstring field is wired into toon header-stripping (document_view.ts:363) and grep locators (packages/docx-mcp/src/tools/grep.ts:56) — repurposing it would be more breaking than adding new fields later.DocumentViewNodeatpackages/docx-mcp/src/tools/gdocs/read_file.ts:21, so any new top-level field is immediately a cross-renderer contract.Phase 1 lands the cleaner signals and the documented precedence rule first; the Phase 2 schema (
heading: null | { text, source, level }per Codex's preference, vs. nestedheading: { is_heading, tier, level, source, source_detail }per Gemini's) will be designed against stable signals in a follow-up issue.Documentation
skills/docx-editing/SKILL.md— new "Heading Detection (read_file JSON output)" subsection with the canonical precedence table.packages/docx-mcp/README.md— brief paragraph documentinglist_metadata.header_text/header_style(previously undocumented), pointing to SKILL.md.Out of scope (tracked in follow-up)
is_heading/heading_levelschema work.Heading 1–Heading 6style detection (e.g., theHeadingPara1/HeadingPara2footgun in the NVCA fixture where styles inherit from heading styles but behave like body).Test plan
npm -w @usejunior/docx-core test— 1137 pass, 1 skipped, 0 fail (5 new branch tests added).npm -w @usejunior/docx-mcp test— 675 pass, 0 fail (1 new end-to-end JSON-shape test against the SPA fixture).npm run lint:workspaces— 0 errors (1 pre-existing warning inedit.test.tsunchanged).title_caps_centered,Termination of Section …andExcept as described in Section …no longer surface a non-nullheader_text, all 12 main-branchrun_in_headerfalse positives suppressed.Governing Law and Venue: …style inline headers still detected astitle_with_colon(no regression indocument_view.branch.test.ts:150).run_in_headerstill fires on legitimate cases.Ref: #157