fix: fix markdown parser edge cases in page_index_md.py by octo-patch · Pull Request #249 · VectifyAI/PageIndex

octo-patch · 2026-04-26T03:19:41Z

Fixes #245

Problem

The extract_nodes_from_markdown function in pageindex/page_index_md.py has three categories of bugs that cause incorrect or missing content when parsing common Markdown patterns:

ATX closing hashes included in title — ## My Section ## yields title "My Section ##" instead of "My Section".
Tilde code fences and mismatched fence lengths not handled — ~~~python blocks are not recognized as code fences, so any # comment inside a tilde block is falsely detected as a header. Similarly, a 3-backtick sequence incorrectly closes a 4-backtick fence.
Headerless documents produce zero nodes — if a Markdown file has no ATX headers, the entire document is silently discarded.

Solution

1. ATX closing hashes

Updated the header regex from r'^(#{1,6})\s+(.+)$' to r'^(#{1,6})\s+(.+?)(?:\s+#+\s*)?$' to strip optional trailing hash sequences.

2. Tilde fences + mismatched lengths

Replaced the simple backtick toggle with proper fence tracking:

Extended the pattern to r'^({3,}|~{3,})'to match both `` ``` `` and~~~` with 3+ characters.
Records the opening fence character (\`` or ~`) and length; a closing fence must use the same character and be at least as long as the opener.

3. Headerless documents

In md_to_tree, after extract_node_text_content returns an empty list, a single root node is created containing the full document text, preventing silent data loss.

Testing

All three cases verified manually:

Case 5 (Tilde fences): PASS  ['Header', 'Real Header']
Case 6 (Mismatched fence lengths): PASS  ['Before', 'After']
Case 7 (ATX closing hashes): PASS  My Section
Case 2 (Headerless file): PASS  1 node returned
Standard ATX headers: PASS
Mixed fence types (~~~, ```): PASS

Note: This PR fixes the same cases described in #245 but targets page_index_md.py in the main branch. PR #246 addresses the same issues in the new pageindex/parser/markdown.py module on the dev branch — these two fixes are complementary and non-conflicting.

…yAI#245) Three robustness fixes for extract_nodes_from_markdown and md_to_tree: 1. ATX closing hashes: update header regex to strip trailing ## markers so "## My Section ##" yields title "My Section" instead of "My Section ##" 2. Tilde + mismatched fence lengths: replace the simple backtick-only toggle with proper fence tracking that records the opening fence character ('`' or '~') and length. A closing fence must use the same character and be at least as long as the opening fence, so: - ~~~python...~~~ correctly suppresses headers inside tilde blocks - ````...``` does NOT close a 4-backtick block (3 < 4) 3. Headerless documents: when no headers are found, md_to_tree now creates a single root node containing the full document text instead of returning an empty structure, preventing silent data loss. Co-Authored-By: Octopus <liyuan851277048@icloud.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix markdown parser edge cases in page_index_md.py#249

fix: fix markdown parser edge cases in page_index_md.py#249
octo-patch wants to merge 1 commit intoVectifyAI:mainfrom
octo-patch:fix/issue-245-markdown-parser-edge-cases

octo-patch commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octo-patch commented Apr 26, 2026

Problem

Solution

1. ATX closing hashes

2. Tilde fences + mismatched lengths

3. Headerless documents

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant