Skip to content

fix: fix markdown parser edge cases in page_index_md.py#249

Open
octo-patch wants to merge 1 commit intoVectifyAI:mainfrom
octo-patch:fix/issue-245-markdown-parser-edge-cases
Open

fix: fix markdown parser edge cases in page_index_md.py#249
octo-patch wants to merge 1 commit intoVectifyAI:mainfrom
octo-patch:fix/issue-245-markdown-parser-edge-cases

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #245

Problem

The extract_nodes_from_markdown function in pageindex/page_index_md.py has three categories of bugs that cause incorrect or missing content when parsing common Markdown patterns:

  1. ATX closing hashes included in title## My Section ## yields title "My Section ##" instead of "My Section".
  2. Tilde code fences and mismatched fence lengths not handled~~~python blocks are not recognized as code fences, so any # comment inside a tilde block is falsely detected as a header. Similarly, a 3-backtick sequence incorrectly closes a 4-backtick fence.
  3. Headerless documents produce zero nodes — if a Markdown file has no ATX headers, the entire document is silently discarded.

Solution

1. ATX closing hashes

Updated the header regex from r'^(#{1,6})\s+(.+)$' to r'^(#{1,6})\s+(.+?)(?:\s+#+\s*)?$' to strip optional trailing hash sequences.

2. Tilde fences + mismatched lengths

Replaced the simple backtick toggle with proper fence tracking:

  • Extended the pattern to r'^({3,}|~{3,})'to match both `` ``` `` and~~~` with 3+ characters.
  • Records the opening fence character (\`` or ~`) and length; a closing fence must use the same character and be at least as long as the opener.

3. Headerless documents

In md_to_tree, after extract_node_text_content returns an empty list, a single root node is created containing the full document text, preventing silent data loss.

Testing

All three cases verified manually:

Case 5 (Tilde fences): PASS  ['Header', 'Real Header']
Case 6 (Mismatched fence lengths): PASS  ['Before', 'After']
Case 7 (ATX closing hashes): PASS  My Section
Case 2 (Headerless file): PASS  1 node returned
Standard ATX headers: PASS
Mixed fence types (~~~, ```): PASS

Note: This PR fixes the same cases described in #245 but targets page_index_md.py in the main branch. PR #246 addresses the same issues in the new pageindex/parser/markdown.py module on the dev branch — these two fixes are complementary and non-conflicting.

…yAI#245)

Three robustness fixes for extract_nodes_from_markdown and md_to_tree:

1. ATX closing hashes: update header regex to strip trailing ## markers
   so "## My Section ##" yields title "My Section" instead of "My Section ##"

2. Tilde + mismatched fence lengths: replace the simple backtick-only
   toggle with proper fence tracking that records the opening fence
   character ('`' or '~') and length. A closing fence must use the same
   character and be at least as long as the opening fence, so:
   - ~~~python...~~~ correctly suppresses headers inside tilde blocks
   - ````...``` does NOT close a 4-backtick block (3 < 4)

3. Headerless documents: when no headers are found, md_to_tree now
   creates a single root node containing the full document text instead
   of returning an empty structure, preventing silent data loss.

Co-Authored-By: Octopus <liyuan851277048@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Markdown parser fails on common edge cases

1 participant