Skip to content

fix: make markdown parser robust#246

Open
saccharin98 wants to merge 4 commits intoVectifyAI:devfrom
saccharin98:markdown_fix
Open

fix: make markdown parser robust#246
saccharin98 wants to merge 4 commits intoVectifyAI:devfrom
saccharin98:markdown_fix

Conversation

@saccharin98
Copy link
Copy Markdown
Collaborator

This PR closes #245

@saccharin98
Copy link
Copy Markdown
Collaborator Author

Review notes after testing the branch locally (69 passed, 2 skipped):

The switch to markdown-it-py fixes the main CommonMark cases from #245: preamble/headerless docs, setext headings, tilde/long fences, and ATX closing hashes all look covered by tests. I found two follow-up issues worth fixing before merge:

  1. pageindex/parser/markdown.py:27 strips frontmatter before extracting headings, but the parser never carries the removed line offset forward. That changes ContentNode.index / structure line_num from original-file line numbers to line numbers in the stripped buffer. For example, a file with 3 lines of frontmatter, a blank line, then # Intro now returns index=2 instead of the original line 5. The old parser exposed original Markdown line numbers, and downstream storage uses n.index as the page/line identifier, so this is a user-visible regression. Please preserve the frontmatter offset and add it back when assigning heading indexes.

  2. _strip_frontmatter() treats any file that starts with --- and later has another --- line as YAML frontmatter. That can silently delete valid Markdown content/thematic-break based documents, e.g. ---\n# Intended heading\n---\n... becomes metadata and the heading disappears. Since pyyaml is already a dependency, it would be safer to validate the fenced block as frontmatter and add a regression test for a real Markdown document that starts with a thematic break.

Once those two are addressed, the overall direction looks good to me.

@rejojer
Copy link
Copy Markdown
Member

rejojer commented Apr 24, 2026

Code review

Found 2 issues:

  1. _strip_frontmatter returns lines[i + 1:] without carrying the stripped-line offset forward. _extract_headers then enumerates the trimmed buffer starting at line_num=1, and _build_nodes assigns that to ContentNode.index. So for a file with N lines of frontmatter, every downstream node.index is shifted by N relative to the original file. This is user-visible: pageindex/index/pipeline.py build_tree_from_levels maps node.index to "line_num", and get_md_page_content uses node.index for range-based content lookup — both now return positions in the stripped buffer, not the real file. (The PR author already flagged this same point in an earlier self-review note.)

@staticmethod
def _strip_frontmatter(lines: list[str]) -> tuple[list[str], dict | None]:
"""Strip YAML frontmatter (--- delimited) from the beginning of the file.
Returns the remaining lines and raw frontmatter as metadata.
"""
if not lines or not _FRONTMATTER_FENCE.match(lines[0]):
return lines, None
for i in range(1, len(lines)):
if _FRONTMATTER_FENCE.match(lines[i]):
raw = "\n".join(lines[1:i])
remaining = lines[i + 1:]
return remaining, {"frontmatter": raw}
# No closing fence found — not valid frontmatter, return as-is
return lines, None

Suggested fix: have _strip_frontmatter also return the stripped line count, and add it to each header["line_num"] (or to ContentNode.index) before returning.

  1. Setext header detection accepts any non-empty, non-fence, non-ATX previous line as a valid title. It does not reject list items (- item, * item, 1. item), blockquotes (> ...), or table rows. Per CommonMark, setext headers require a paragraph preceding the underline. So a list like - item\n---\n will be promoted to an H2 header with title - item, silently reshaping the document tree.

# Setext headers: underline on next line detected by looking back
# We check if *this* line is an underline and the previous line is text
if line_num >= 2:
prev = lines[line_num - 2].strip() # previous line (0-indexed)
if prev and not _FENCE_OPEN.match(prev) and not _ATX_HEADER.match(prev):
if _SETEXT_H1.match(stripped):
headers.append({
"title": prev,
"level": 1,
"line_num": line_num - 1,
})
continue
if _SETEXT_H2.match(stripped):
headers.append({
"title": prev,
"level": 2,
"line_num": line_num - 1,
})
continue

Suggested fix: additionally reject prev lines that start with list markers (-, *, +, <digits>., <digits>)) or blockquote (>), or verify prev looks like a paragraph.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@saccharin98
Copy link
Copy Markdown
Collaborator Author

Review notes after testing the latest commit (a349946) locally: 75 passed, 2 skipped.

The prior line-offset and setext non-paragraph issues look addressed. I found two remaining Markdown parsing gaps worth fixing before merge:

  1. pageindex/parser/markdown.py:76-88 treats any line starting with the fence marker as a closing fence while already inside a fence. In CommonMark, a closing fence cannot have non-space trailing characters. Repro:
# Before

```
```not a closing fence
# Still code
```

# After

The parser currently returns titles Before and Still code; After is swallowed into the fake section. This is close to the original fence robustness issue in #245, just with an info-string-like line inside the code block. Please require the closing fence line to be only the marker plus spaces/tabs, and add this regression test.

  1. pageindex/parser/markdown.py:10-11 only recognizes setext underlines with three or more =/- characters. Valid Markdown allows one or more marker characters, so Title\n-\n and Title\n=\n currently become a headerless content node instead of an H2/H1 section. Please relax those regexes and cover the one-character cases.

The rest of the PR direction looks good.

@saccharin98
Copy link
Copy Markdown
Collaborator Author

Follow-up review after the latest commit (c244249): local tests pass with 80 passed, 2 skipped.

No blocking issues found in the latest changes. I rechecked the indented-code and fence cases:

  • 4-space/tab-indented # ... lines are now ignored as code.
  • 3-space-indented headings still parse.
  • 3-space-indented fences still suppress fake headings inside.
  • Fence closing with trailing spaces works.
  • Fence-like lines with trailing text stay inside the code block.

One low-priority compatibility gap remains: valid empty ATX headings like # or ## are still treated as plain/headerless content instead of heading nodes. That is a CommonMark edge case, but I would not consider it blocking for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants