fix: make markdown parser robust by saccharin98 · Pull Request #246 · VectifyAI/PageIndex

saccharin98 · 2026-04-24T06:59:35Z

This PR closes #245

saccharin98 · 2026-04-24T07:15:14Z

Review notes after testing the branch locally (69 passed, 2 skipped):

The switch to markdown-it-py fixes the main CommonMark cases from #245: preamble/headerless docs, setext headings, tilde/long fences, and ATX closing hashes all look covered by tests. I found two follow-up issues worth fixing before merge:

pageindex/parser/markdown.py:27 strips frontmatter before extracting headings, but the parser never carries the removed line offset forward. That changes ContentNode.index / structure line_num from original-file line numbers to line numbers in the stripped buffer. For example, a file with 3 lines of frontmatter, a blank line, then # Intro now returns index=2 instead of the original line 5. The old parser exposed original Markdown line numbers, and downstream storage uses n.index as the page/line identifier, so this is a user-visible regression. Please preserve the frontmatter offset and add it back when assigning heading indexes.
_strip_frontmatter() treats any file that starts with --- and later has another --- line as YAML frontmatter. That can silently delete valid Markdown content/thematic-break based documents, e.g. ---\n# Intended heading\n---\n... becomes metadata and the heading disappears. Since pyyaml is already a dependency, it would be safer to validate the fenced block as frontmatter and add a regression test for a real Markdown document that starts with a thematic break.

Once those two are addressed, the overall direction looks good to me.

rejojer · 2026-04-24T09:05:48Z

Code review

Found 2 issues:

_strip_frontmatter returns lines[i + 1:] without carrying the stripped-line offset forward. _extract_headers then enumerates the trimmed buffer starting at line_num=1, and _build_nodes assigns that to ContentNode.index. So for a file with N lines of frontmatter, every downstream node.index is shifted by N relative to the original file. This is user-visible: pageindex/index/pipeline.py build_tree_from_levels maps node.index to "line_num", and get_md_page_content uses node.index for range-based content lookup — both now return positions in the stripped buffer, not the real file. (The PR author already flagged this same point in an earlier self-review note.)

PageIndex/pageindex/parser/markdown.py

Lines 36 to 52 in 50aab66

    
               @staticmethod 
        
               def _strip_frontmatter(lines: list[str]) -> tuple[list[str], dict | None]: 
        
                   """Strip YAML frontmatter (--- delimited) from the beginning of the file. 
        
                   Returns the remaining lines and raw frontmatter as metadata. 
        
                   """ 
        
                   if not lines or not _FRONTMATTER_FENCE.match(lines[0]): 
        
                       return lines, None 
        
                   for i in range(1, len(lines)): 
        
                       if _FRONTMATTER_FENCE.match(lines[i]): 
        
                           raw = "\n".join(lines[1:i]) 
        
                           remaining = lines[i + 1:] 
        
                           return remaining, {"frontmatter": raw} 
        
                   # No closing fence found — not valid frontmatter, return as-is 
        
                   return lines, None

Suggested fix: have _strip_frontmatter also return the stripped line count, and add it to each header["line_num"] (or to ContentNode.index) before returning.

Setext header detection accepts any non-empty, non-fence, non-ATX previous line as a valid title. It does not reject list items (- item, * item, 1. item), blockquotes (> ...), or table rows. Per CommonMark, setext headers require a paragraph preceding the underline. So a list like - item\n---\n will be promoted to an H2 header with title - item, silently reshaping the document tree.

PageIndex/pageindex/parser/markdown.py

Lines 91 to 109 in 50aab66

    
           # Setext headers: underline on next line detected by looking back 
        
           # We check if *this* line is an underline and the previous line is text 
        
           if line_num >= 2: 
        
               prev = lines[line_num - 2].strip()  # previous line (0-indexed) 
        
               if prev and not _FENCE_OPEN.match(prev) and not _ATX_HEADER.match(prev): 
        
                   if _SETEXT_H1.match(stripped): 
        
                       headers.append({ 
        
                           "title": prev, 
        
                           "level": 1, 
        
                           "line_num": line_num - 1, 
        
                       }) 
        
                       continue 
        
                   if _SETEXT_H2.match(stripped): 
        
                       headers.append({ 
        
                           "title": prev, 
        
                           "level": 2, 
        
                           "line_num": line_num - 1, 
        
                       }) 
        
                       continue

Suggested fix: additionally reject prev lines that start with list markers (-, *, +, <digits>., <digits>)) or blockquote (>), or verify prev looks like a paragraph.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

saccharin98 · 2026-04-24T09:30:31Z

Review notes after testing the latest commit (a349946) locally: 75 passed, 2 skipped.

The prior line-offset and setext non-paragraph issues look addressed. I found two remaining Markdown parsing gaps worth fixing before merge:

pageindex/parser/markdown.py:76-88 treats any line starting with the fence marker as a closing fence while already inside a fence. In CommonMark, a closing fence cannot have non-space trailing characters. Repro:

# Before

```
```not a closing fence
# Still code
```

# After

The parser currently returns titles Before and Still code; After is swallowed into the fake section. This is close to the original fence robustness issue in #245, just with an info-string-like line inside the code block. Please require the closing fence line to be only the marker plus spaces/tabs, and add this regression test.

pageindex/parser/markdown.py:10-11 only recognizes setext underlines with three or more =/- characters. Valid Markdown allows one or more marker characters, so Title\n-\n and Title\n=\n currently become a headerless content node instead of an H2/H1 section. Please relax those regexes and cover the one-character cases.

The rest of the PR direction looks good.

saccharin98 · 2026-04-24T09:45:31Z

Follow-up review after the latest commit (c244249): local tests pass with 80 passed, 2 skipped.

No blocking issues found in the latest changes. I rechecked the indented-code and fence cases:

4-space/tab-indented # ... lines are now ignored as code.
3-space-indented headings still parse.
3-space-indented fences still suppress fake headings inside.
Fence closing with trailing spaces works.
Fence-like lines with trailing text stay inside the code block.

One low-priority compatibility gap remains: valid empty ATX headings like # or ## are still treated as plain/headerless content instead of heading nodes. That is a CommonMark edge case, but I would not consider it blocking for this PR.

fix: make markdown parser robust

50aab66

saccharin98 force-pushed the markdown_fix branch from 91b1b25 to 50aab66 Compare April 24, 2026 08:50

fix: line index offset issue and setext non-header prefix

a349946

saccharin98 added 2 commits April 24, 2026 17:38

fix: handle markdown fence and setext edge cases

3dbc5bb

fix: ignore indented markdown code headings

c244249

octo-patch mentioned this pull request Apr 26, 2026

fix: fix markdown parser edge cases in page_index_md.py #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make markdown parser robust#246

fix: make markdown parser robust#246
saccharin98 wants to merge 4 commits intoVectifyAI:devfrom
saccharin98:markdown_fix

saccharin98 commented Apr 24, 2026

Uh oh!

saccharin98 commented Apr 24, 2026

Uh oh!

rejojer commented Apr 24, 2026 •

edited

Loading

Uh oh!

saccharin98 commented Apr 24, 2026

Uh oh!

saccharin98 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saccharin98 commented Apr 24, 2026

Uh oh!

saccharin98 commented Apr 24, 2026

Uh oh!

rejojer commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

saccharin98 commented Apr 24, 2026

Uh oh!

saccharin98 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rejojer commented Apr 24, 2026 •

edited

Loading