Skip to content

fix(docx-core): harden heading detection (#157 Phase 1)#178

Merged
stevenobiajulu merged 2 commits into
mainfrom
157-heading-detection-20260505
May 10, 2026
Merged

fix(docx-core): harden heading detection (#157 Phase 1)#178
stevenobiajulu merged 2 commits into
mainfrom
157-heading-detection-20260505

Conversation

@stevenobiajulu
Copy link
Copy Markdown
Member

Summary

Phase 1 of #157: tighten the existing heading-detection signals in read_file(format=\"json\"), document the canonical precedence rule, and add a centered-title detector. Phase 2 (derived is_heading / heading_level schema fields) is deferred to a follow-up issue.

Three concrete bugs fixed against tests/test_documents/nvca-regression/source.docx:

  1. False negative: the SPA's center-aligned, ALL-CAPS, bold title (SERIES […] PREFERRED STOCK PURCHASE AGREEMENT) now surfaces as header_style: 'title_caps_centered' instead of producing a null heading signal.
  2. False positives in extractHeaderInfo(): the long-paragraph regex branch no longer returns title_bare when no explicit . or : terminator is matched — body prose like Termination of Section 2.2(d)(i) shall not affect … and Except as described in Section 2.2(d)(i), … no longer surface a leading-word header_text.
  3. False positives in detectRunInHeader(): a formatting-transition invariant (headerCharCount < totalVisibleChars) ensures the bold/underline prefix is followed by non-header body. Whole-paragraph bold/underline blocks (signature labels, defined-term lead-in sentences) no longer get classified as run-in headers.

Smoke results across the 367-paragraph NVCA SPA

Header style Main This PR Net
run_in_header 12 (all false positives, header_text = whole body prose) 0 -12 false positives
title_with_period 12 12 unchanged
title_with_colon 9 12 +3 (recovered from former run_in_header mis-classification)
title_bare 90 88 -2 (Termination/Except now correctly null)
title_caps_centered 0 3 +3 NEW true positives (SPA title + 2 others)

Why Phase 1 only (not Phase 2)

Peer-reviewed with both Gemini and Codex CLI before implementation. Codex flagged that:

  • The existing signals are too noisy to bless with a top-level verdict field today (the 12 false-positive run_in_header entries on main are exactly that noise).
  • The existing header string field is wired into toon header-stripping (document_view.ts:363) and grep locators (packages/docx-mcp/src/tools/grep.ts:56) — repurposing it would be more breaking than adding new fields later.
  • The Google Docs renderer is structurally cast to DocumentViewNode at packages/docx-mcp/src/tools/gdocs/read_file.ts:21, so any new top-level field is immediately a cross-renderer contract.

Phase 1 lands the cleaner signals and the documented precedence rule first; the Phase 2 schema (heading: null | { text, source, level } per Codex's preference, vs. nested heading: { is_heading, tier, level, source, source_detail } per Gemini's) will be designed against stable signals in a follow-up issue.

Documentation

  • skills/docx-editing/SKILL.md — new "Heading Detection (read_file JSON output)" subsection with the canonical precedence table.
  • packages/docx-mcp/README.md — brief paragraph documenting list_metadata.header_text / header_style (previously undocumented), pointing to SKILL.md.

Out of scope (tracked in follow-up)

  • Top-level is_heading / heading_level schema work.
  • Refining Heading 1Heading 6 style detection (e.g., the HeadingPara1 / HeadingPara2 footgun in the NVCA fixture where styles inherit from heading styles but behave like body).
  • Mirroring on the Google Docs renderer (its structural cast already auto-mirrors any new top-level fields, so no Phase 1 changes break it).

Test plan

  • npm -w @usejunior/docx-core test — 1137 pass, 1 skipped, 0 fail (5 new branch tests added).
  • npm -w @usejunior/docx-mcp test — 675 pass, 0 fail (1 new end-to-end JSON-shape test against the SPA fixture).
  • npm run lint:workspaces — 0 errors (1 pre-existing warning in edit.test.ts unchanged).
  • Manual smoke against the actual NVCA SPA fixture: SPA title detected as title_caps_centered, Termination of Section … and Except as described in Section … no longer surface a non-null header_text, all 12 main-branch run_in_header false positives suppressed.
  • Governing Law and Venue: … style inline headers still detected as title_with_colon (no regression in document_view.branch.test.ts:150).
  • New regression test for a real bold-prefix + body-suffix paragraph confirms run_in_header still fires on legitimate cases.

Ref: #157

Three concrete bugs in `read_file(format="json")` heading signals,
verified against `tests/test_documents/nvca-regression/source.docx`:

1. False negative on the SPA's center-aligned ALL-CAPS bold title
   (`SERIES […] PREFERRED STOCK PURCHASE AGREEMENT`). It produced
   no heading signal because none of the existing detectors targeted
   centered-title formatting.
2. False positives in `extractHeaderInfo()`: the long-paragraph regex
   branch returned `title_bare` for body prose that began with a
   capitalized word, surfacing the leading word as `header_text`
   (e.g. `Termination` from `Termination of Section 2.2(d)(i) shall not
   affect …`).
3. False positives in `detectRunInHeader()`: paragraphs whose visible
   runs were entirely bold/underline (signature labels, defined-term
   lead-in sentences) were classified as `run_in_header` because the
   detector never required a real header-prefix → body transition.

Why this scope (Phase 1 only — no `is_heading` schema yet): peer
review (gemini + codex) flagged that the existing signals were too
noisy to bless with a top-level verdict field. Across the 367-paragraph
NVCA SPA, the `run_in_header` count drops from 12 (all false positives,
each carrying a body-prose `header_text`) to 0 with the transition
guard. With the signals now correct, derived `is_heading` /
`heading_level` fields can be designed against a stable foundation in
a follow-up issue.

Changes:
- `extractHeaderInfo()`: long-paragraph regex branch returns null
  when no explicit `.` or `:` terminator was matched. Short-paragraph
  branch (≤ 50 chars) and the explicit-terminator paths are unchanged
  — the existing `Governing Law and Venue: …` test still passes.
- `detectRunInHeader()`: adds a formatting-transition invariant
  (`headerCharCount < totalVisibleChars`). Whole-paragraph
  bold/underline blocks are no longer surfaced as run-in headers; a
  real bold-prefix → non-bold-body transition still works.
- New `detectTitleCapsCentered()`: final fallback for centered,
  ALL-CAPS, bold standalone titles. Strict gates (no lowercase, no
  list label, not in a table cell, ≤ 60 chars). Emits
  `header_style: 'title_caps_centered'`.
- Documents the canonical heading-detection precedence rule in
  `skills/docx-editing/SKILL.md` and adds a brief pointer in
  `packages/docx-mcp/README.md`.
- Regression tests: 5 new branch-coverage tests in
  `document_view.branch.test.ts` (synthetic XML) and 1 end-to-end
  JSON-shape test in `nvca_spa_regression.test.ts` against the actual
  SPA fixture.

The Google Docs renderer (`packages/google-docs-core`) is structurally
cast to `DocumentViewNode`; its schema is unchanged. Phase 2 design
(the deferred `is_heading` / `heading_level` fields) will be tracked
in a follow-up issue.

Ref: #157
@vercel
Copy link
Copy Markdown

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
site Ready Ready Preview, Comment May 10, 2026 8:13pm

Request Review

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 86.66667% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
packages/docx-core/src/primitives/document_view.ts 86.66% 6 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Peer review of #178 (gemini + codex CLI) and a smoke against ~/Downloads
docx fixtures surfaced four false-positive / false-negative cases that
slip past the Phase 1 gates. Address all four in this commit:

1. detectTitleCapsCentered: add a content gate so single-token bracketed
   placeholders ([COMPANY], [___], <NAME>) and underscore-only signature
   lines no longer classify as titles. Require >= 3 ASCII letters AND
   >= 2 whitespace-separated word-tokens. (Both Codex and the smoke
   independently reported real fixture hits.)

2. detectTitleCapsCentered: split the length cap into its own constant
   MAX_CENTERED_TITLE_LENGTH = 120. Long corporate titles like
   `AMENDED AND RESTATED CERTIFICATE OF INCORPORATION OF FOO INC.` were
   rejected by the 60-char body-prose cap, producing real false negatives
   on the bundled NVCA COI fixture.

3. buildNodesForDocumentView: run detectTitleCapsCentered BEFORE
   extractHeaderInfo so short ALL-CAPS centered titles like
   `SCHEDULE OF PURCHASERS` surface as title_caps_centered (matching
   the documented precedence in SKILL.md) instead of as title_bare.
   No production consumers depend on the old emission ordering.

4. detectRunInHeader: replace the headerCharCount >= totalVisibleChars
   guard with an explicit body-text accumulator so trailing-whitespace-only
   "body" runs no longer defeat the whole-paragraph-bold rejection.
   Real false positive observed in Downloads/Co-Founder Agreement.docx
   (`Compensation & Benefits.` + trailing space-only run, all bold).

Add five regression tests covering each fix. Update SKILL.md and the
mcp README to reflect the new gates and 120-char cap.

All tests still pass: docx-core 1147 (1146 + 1 skip), docx-mcp 675.
@stevenobiajulu
Copy link
Copy Markdown
Member Author

Pushed 1672f55 addressing all four findings from the gemini + codex peer review and the local smoke against ~/Downloads:

  1. detectTitleCapsCentered content gate — require ≥ 3 ASCII letters AND ≥ 2 word-tokens. Rejects [COMPANY] placeholders (smoke) and underscore signature lines like ____________________________________________ (Codex flagged on bundled ILPA fixtures).
  2. MAX_CENTERED_TITLE_LENGTH = 120 (split from MAX_HEADER_TEXT_LENGTH). Lets long titles like AMENDED AND RESTATED CERTIFICATE OF INCORPORATION OF FOO HOLDINGS INC. classify (Codex flagged the NVCA COI fixture being silently rejected at 60).
  3. Reorder: detectTitleCapsCentered before extractHeaderInfo — short ALL-CAPS centered titles like SCHEDULE OF PURCHASERS now surface as title_caps_centered instead of title_bare, matching the documented precedence in SKILL.md. No production consumers depend on the old emission ordering (only the emitter itself references the enum values).
  4. detectRunInHeader whole-paragraph guard, take 2 — replaced headerCharCount >= totalVisibleChars with an explicit body-text accumulator that requires non-whitespace body. The old guard failed open when trailing whitespace lived in a separate non-bold run (real false positive observed in Co-Founder Agreement.docx for Compensation & Benefits. — fully bold + trailing space-only run).

Five new branch tests added covering each gate; SKILL.md and the MCP README updated to describe the new gates and the 120-char cap.

Test status: docx-core 1146 pass + 1 skip, docx-mcp 675 pass, lint clean (1 pre-existing warning unchanged).

Smoke distribution after the fix (sampled from ~/Downloads):

File run_in_header title_caps_centered
Co-Founder Agreement 5 → 4 1 ([COMPANY]) → 1 (TERMS OF EMPLOYMENT)
City Office REIT LPA 0 0 → 4 (real titles)
Blackstone PE LPA 0 0 → 1 (TABLE OF CONTENTS)

Confirmed conclusions kept: headerCharCount >= totalVisibleChars was the right invariant direction (just needed the body-text refinement); removing the long-paragraph bare-title return path is fine; Phase 2 (is_heading / heading_level) deferral still right.

Review artifacts:

  • prompt: /tmp/peer-review-pr178-1762720000.md
  • gemini: /tmp/gemini-pr178-review.txt
  • codex: /tmp/codex-pr178-review.txt

@stevenobiajulu stevenobiajulu merged commit 725f44d into main May 10, 2026
23 checks passed
@stevenobiajulu
Copy link
Copy Markdown
Member Author

Post-merge smoke passed

Merged: 725f44d
Built from: main @ 725f44d (origin/main HEAD after pull)
Smoke: npm install && npm run build && npm run lint:workspaces && npm test plus a feature-specific run of scripts/smoke_pr178.mjs against ~/Downloads fixtures.

Steps

  • install — ok
  • build — ok (workspaces: docx-core, docx-mcp, google-docs-core, safe-docx, safe-docx-mcpb)
  • lint — ok (1 pre-existing warning in edit.test.ts, unchanged)
  • tests — ok across all workspaces:
    • docx-core: 1218 passed + 1 skipped
    • docx-mcp: 724 passed
    • google-docs-core: 113 passed + 35 skipped
    • safe-docx-mcpb: 3 passed
  • feature smoke — ok. Re-ran heading-detection smoke against 6 ~/Downloads fixtures from main HEAD. Confirmed post-merge behavior matches pre-push:
    • [COMPANY] placeholder no longer classifies as title_caps_centered
    • Compensation & Benefits. (whole-bold + trailing space) no longer mis-classifies as run_in_header (count 5 → 4)
    • City Office REIT LPA picks up 4 real centered titles (AMENDED AND RESTATED, AGREEMENT OF LIMITED PARTNERSHIP, CITY OFFICE REIT OPERATING PARTNERSHIP, L.P., TABLE OF CONTENTS) previously missed
    • Blackstone PE LPA picks up TABLE OF CONTENTS previously missed

Log: /tmp/automerge-smoke-178-1778444682.log (local).

stevenobiajulu added a commit that referenced this pull request May 10, 2026
…#187) (#188)

* fix(docx-core): suppress non-sectional false-positive headings (#187)

Two cross-fixture residual error classes after PR #178:

1. **Table cells**: `extractHeaderInfo()` ran on every paragraph
   regardless of `tableContext`, even though `detectTitleCapsCentered()`
   already explicitly skipped table cells. Bonterms NDA was 13/13
   `title_bare` and 10/10 `title_with_colon` table-cell false positives
   (100%); Mutual NDA was 17/18; Common Paper NDA was 16/17. One-line
   fix adds the `&& !tableContext` gate to mirror the existing one on
   the centered-title path.

2. **Signature/field-label clusters**: NVCA SPA paragraphs 312–321
   (`COMPANY:` / `By:` / `Name:` / `Title:` / `Address:` / `Email:` /
   `PURCHASER:` / `By:` / `Name:` / `Title:`) are body-level (not in
   tables) and were each individually classified as `title_with_colon`
   or `title_bare`. New `suppressSignatureClusters()` post-pass scans
   the assembled `nodes` array for runs of ≥4 consecutive paragraphs
   where ≥75% match a signature-label regex (mixed-case `Name:` form
   OR all-caps `COMPANY: …` form), then nulls `header_text` /
   `header_style` / `header_formatting` on every node in the cluster.

Why these two specifically: a second peer review (Gemini + Codex CLI)
measured 80/251 `title_bare` (32%) and 22/56 `title_with_colon` (39%)
in-table false positives across 8 fixtures. None of PR #178's four
fixes added a `!tableContext` guard to `extractHeaderInfo`, and the
signature blocks live outside tables so the table fix alone wouldn't
catch them.

Cluster heuristic notes:
- No "near document end" anchor (Gemini: MSAs and amended docs have
  signature blocks throughout).
- Doesn't require all to be colon-suffixed (Codex: NVCA's
  `COMPANY: [Insert Name]` lacks colon-only suffix).
- ≥4-paragraph minimum prevents single-label false suppression.

Cross-fixture verification (8 fixtures):
- Bonterms NDA: 13/13 + 10/10 → 0/0 + 0/0
- Mutual NDA: 16/15 + 0/0 → 1/0 + 0/0
- Common Paper NDA: 15/14 + 0/0 → 1/0 + 0/0
- NVCA SPA signature block (paragraphs 312–321): all `header_style: null`
- ILPA LPAs (×2): unchanged at 20/0 + 0/0 — real section headings preserved

Tests: 4 new (2 branch-coverage with synthetic XML, 2 end-to-end MCP
JSON-shape against actual fixtures). Suites green: docx-core 1220 pass
+ 1 skip; docx-mcp 726 pass.

Out of scope (deferred):
- Placeholder/bracket rule for `title_bare` path (waits for footnote
  marker normalization first).
- Confidence tiering.

Peer review (Gemini + Codex) pending — see Step 8.

Closes #187.

* fix(docx-core): preserve adjacent headings + extend table gate to run-in detection

Two issues caught by Codex peer review on PR #188:

1. **HIGH: cluster post-pass erased real headings adjacent to signature
   blocks.** suppressSignatureClusters() suppressed every node covered
   by any qualifying ≥4-paragraph window with ≥75% match density,
   including nodes that themselves did NOT match a signature-label
   regex. Codex reproduced: 4 signature labels followed by a Heading1
   `Product` paragraph — the window [0..4] has 4/5 = 80% density, so
   the Heading1 was silently nulled. This is a production-shape failure
   because signature pages are routinely followed by the next section
   heading.

   Fix: only suppress nodes that themselves match a signature-label
   regex. The window is the density signal that authorizes suppression;
   adjacent non-matching nodes are preserved.

2. **NORMAL: detectRunInHeader still ran inside table cells**, despite
   SKILL.md saying heading detection is suppressed there. A table cell
   containing bold "Purpose:" plus body text would still emit
   `header_style: "run_in_header"`. Mirror the existing !tableContext
   gates already present on detectTitleCapsCentered and
   extractHeaderInfo.

Two new regression tests:
- All-caps `COMPANY:` / `PURCHASER:` cluster (NVCA-style mixed
  prefix + label rows, 9 paragraphs) all suppressed.
- Boundary-preservation: 4 signature labels + adjacent Heading1
  paragraph — labels suppressed, Heading1 preserved.

Suites green: docx-core 1222 pass + 1 skip (was 1220, +2 new tests),
docx-mcp 726 unchanged.

Ref: #187

* fix(docx-core): suppress dense signature blocks; preserve only Word-styled headings adjacent

Codex re-review found the previous fix was too lax. After narrowing
suppression to "only nodes that match the signature-label regex,"
common signature-block shapes like:

  ACME CORP
  By:
  Name:
  Title:

still leaked `ACME CORP` as `header_style: "title_bare"` (heuristic
short-paragraph fallback fires; the regex doesn't match all-caps
company names without a colon terminator).

The right rule is in between the two extremes:
- The original "any covering window" suppressed too much (erased
  adjacent Heading1 paragraphs).
- "Only matching nodes" suppressed too little (leaked company names).

Refined: inside a cluster window, suppress every node EXCEPT real
Word-styled headings (`paragraph_style_id` matching `/^Heading[1-6]$/`).
The density signal authorizes blanket suppression; only an explicit
Word heading style is strong enough to overrule it.

This handles both reproductions:
- `By:`/`Name:`/`Title:`/`Address:`/`Product[Heading1]` → labels
  suppressed, Heading1 preserved (Codex's first repro).
- `ACME CORP`/`By:`/`Name:`/`Title:` → ACME CORP suppressed (no Word
  style; only heuristic title_bare) (Codex's second repro).

New regression test asserts the company-name lead-in case.

Suites green: docx-core 1223 pass + 1 skip (was 1222, +1 test);
docx-mcp 726 unchanged.

Out of scope (deferred to a follow-up issue):
- Broader regex coverage for `INVESTOR 1:`, `PURCHASER(S):`,
  `BORROWER[S]:` (digits, parens, brackets). Both reviewers flagged
  these as nice-to-have; they don't block this fix.

Ref: #187

* fix(docx-core): narrow signature-cluster suppression to label-matching nodes

Address Gemini + Codex peer review of PR #188 (issue #187 follow-up).

Codex empirically reproduced the over-suppression bug: a window meeting
the 4-paragraph / 75%-label density gate also covers non-label
neighbors, and the previous code cleared `header_*` on every paragraph
in the window (saving only `/^Heading[1-6]$/`-styled paragraphs).
Empirically confirmed false negatives:
  - `[By:, Name:, Title:, Address:, Product]` — `Product` was cleared
    even though it doesn't match any signature-label regex.
  - `Heading7`, `Title`, `Subtitle`, `Article5L1` (all present in the
    fixture corpus) classified as `title_bare` when alone but were
    cleared when adjacent to a 4-label cluster.

Fix: in the final clear loop, only mutate paragraphs that themselves
satisfy `isSignatureClusterLabel`. The density gate still authorizes
suppression of labels inside the window; non-label neighbors keep
their detected heading metadata regardless of paragraph style. This
obsoletes the `Heading1-6` exact-id escape hatch — real headings are
preserved by virtue of not matching the label predicate.

Tests:
  - Rewrite the "ACME CORP lead-in" test to assert the corrected
    behavior: the three label paragraphs are suppressed; ACME CORP
    keeps its heuristic `title_bare` classification (it's not a label).
  - Add a regression test for non-label body neighbor (`Product` after
    four labels).
  - Add a regression test for custom-style headings (`Heading7`,
    `Title`, `Subtitle`, `Article5L1`) adjacent to a signature cluster.
  - Add a regression test locking in correct WHEREAS-recital handling
    (no false suppression).

Verification:
  - `npm -w @usejunior/docx-core test`: 1226 passed + 1 skipped (was
    1220 + 1 skipped — 6 net new test assertions).
  - `npm -w @usejunior/docx-mcp test`: 726 passed (unchanged).
  - `npm run lint:workspaces`: 0 errors (1 pre-existing warning).
  - Cross-fixture spot check: in-table residual count remains 0 across
    all 7 fixtures. Some legitimate non-cluster headings now correctly
    preserved (Letter of Intent went from 6 → 11 residuals, all
    `Heading1`/`Heading2`-styled real section headings).

Refs: peer-review feedback at
  #188 (comment)
stevenobiajulu added a commit that referenced this pull request May 11, 2026
#179) (#190)

* feat(docx-core): add derived heading object to DocumentViewNode (#179)

Phase 2 of #157: expose a sparse top-level `heading` field on every
paragraph node, locking the precedence rule into the wire schema so
consumers stop re-implementing it.

Why now (after #157 / #178 / #188 detector hardening): the underlying
signals are now stable enough to bless. Without this, every consumer
re-implements the 5-step precedence ladder from `paragraph_style_id` +
`list_metadata.header_style`, and every new heuristic added to the
detector silently breaks every consumer's hardcoded `if/else` (as just
happened with `title_caps_centered` in #178). Centralizing the verdict
in the schema makes future detector changes non-breaking for downstream
code.

Schema (sparse — field omitted entirely for body paragraphs):

  heading?: {
    text: string;
    source: 'word_style'
          | 'run_in_header'
          | 'title_with_period'
          | 'title_with_colon'
          | 'title_caps_centered'
          | 'title_bare';
    level: number | null;   // 1..6 only when source === 'word_style'
  }

Derivation rules (locked in #179):
- `level` is set to 1..6 ONLY when `paragraph_style_id` matches
  /^Heading([1-6])$/ exactly. Style inheritance does NOT count —
  `HeadingPara1` / `HeadingPara2` styles in the NVCA fixture inherit
  from heading styles but behave like body paragraphs (`next: Normal`),
  so they MUST NOT produce a level.
- `level` is `null` for all heuristic sources.
- Word built-in heading wins over heuristics when both signals fire.
- Field is OMITTED (not `null`) for body paragraphs — keeps payload
  small and gives a clean jq idiom: `.[] | select(.heading)`.

Cross-renderer: the Google Docs renderer
(packages/google-docs-core/src/document-view.ts) is structurally cast
to DocumentViewNode at packages/docx-mcp/src/tools/gdocs/read_file.ts.
Mirrored the derivation there but kept it Word-style-only (Google Docs
has built-in heading styles; heuristic detectors do not run on the
GDocs path).

Bonus typing improvement: narrowed `header_style` from `string | null`
to `HeuristicHeadingSource | null` (an exported type union) so consumers
get autocomplete on the enum values.

Tests: 8 new tests (7 in `document_view.branch.test.ts`, 1 e2e in
`nvca_spa_regression.test.ts`):
- Heading1 → `{source: 'word_style', level: 1}`
- Heading6 → `{source: 'word_style', level: 6}`
- HeadingPara1 (basedOn Heading1) → no `heading` key
- title_caps_centered → `{source: 'title_caps_centered', level: null}`
- run_in_header → `{source: 'run_in_header', level: null}`
- Body paragraph → no `heading` key
- Precedence: Heading2 + title_with_colon → `{source: 'word_style', level: 2}`
- E2E: NVCA SPA title → `{source: 'title_caps_centered', level: null}`

Suites: docx-core 1224 pass + 1 skip (was 1220, +4 net new tests after
the helper-typing test churn). docx-mcp 733 pass; 1 pre-existing
failure in `read_file_comments.test.ts:1028` unchanged from main —
unrelated to this PR (introduced by PR #180's comments-namespace work).

Documentation: SKILL.md and packages/docx-mcp/README.md updated to
recommend `node.heading != null` as the canonical "is this a heading"
check; raw `header_style` becomes the per-detector explanation layer.
GDocs asymmetry documented explicitly.

Peer review (Gemini + Codex) pending — see Step 8.

Closes #179.

* fix(docx-core): suppress heuristic heading promotion in table cells (peer review #190)

Both Gemini and Codex independently flagged that heuristic detectors
(run_in_header, title_with_period, title_with_colon, title_bare) still
fired inside table cells, so the new canonical predicate
`node.heading != null` was promoting ordinary label/value cell content
("Notice Address:", "Closing.", "Indemnification. ...") into structural
headings. word_style still fires inside cells (legitimate Word
Heading1..6 inside a table is still a heading).

Other peer-review follow-ups:
- Document HeadingValue.text semantic split (word_style = whole para,
  heuristic sources = extracted prefix) in JSDoc.
- Export HeuristicHeadingSource so external consumers can name the
  narrowed ListMetadata.header_style union.
- Fix CachedParagraph.paragraphId JSDoc — it stores namedStyleType
  (HEADING_1, NORMAL_TEXT), not the Google native paragraph ID.
- Add GDocs renderer tests covering HEADING_n normalization,
  NORMAL_TEXT/TITLE/SUBTITLE rejection, and table-cell word_style.
- Add docx-core regression tests for title_with_period, title_bare,
  table-cell suppression, and word_style-in-cell still firing.

Test deltas: docx-core 1224 → 1227, google-docs-core 113 → 116.
Pre-existing read_file_comments.test.ts:1028 failure unchanged.
@stevenobiajulu stevenobiajulu deleted the 157-heading-detection-20260505 branch May 27, 2026 17:07
This was referenced Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant