Description
The current MarkdownParser in pageindex/parser/markdown.py has several robustness issues. It silently drops content or produces incorrect results on valid Markdown inputs that are common in real-world documents.
All cases below have been verified against the old parser — 6 out of 7 confirmed failing.
Failing Test Cases
1. Content before the first header is silently dropped
This is an introduction paragraph.
Some important context here.
# Chapter 1
Body text.
Expected: The preamble text should be captured as a node.
Actual: _build_nodes only iterates over headers, so everything before # Chapter 1 is lost. (1 node returned, preamble missing)
2. Headerless Markdown files produce zero nodes
Just some plain text document
with multiple lines of content.
No headers at all.
Expected: At least one node containing the full content.
Actual: _extract_headers returns [], _build_nodes returns [] — the entire document is silently discarded.
3. YAML frontmatter — not a separate bug
---
title: My Document
author: Alice
---
# Introduction
Hello world.
Verified: The frontmatter does NOT leak into node content — it is silently dropped as part of the preamble (same root cause as Case 1). However, the new parser now properly extracts frontmatter into metadata, which is the correct behavior.
4. Setext-style headers are not recognized
Main Title
==========
Some content here.
Sub Section
-----------
More content.
Expected: Two nodes — H1 "Main Title" and H2 "Sub Section".
Actual: Zero headers detected, zero nodes returned. Entire document lost.
5. Tilde code fences are not tracked
# Header
~~~python
# This is a comment, not a header
def foo():
pass
~~~
# Real Header
Expected: 2 nodes: "Header" and "Real Header".
Actual: 3 nodes — the comment # This is a comment, not a header inside the tilde fence is incorrectly detected as a header.
6. Mismatched fence lengths cause incorrect header detection
Example input (a 4-backtick fence containing 3-backtick lines):
# Before
````python
# Inside 4-backtick fence
```
# Still inside — 3 backticks can't close a 4-backtick fence
```
````
# After
Expected: 2 headers: "Before" and "After".
Actual: 3 headers detected: "Before", "Still inside", "After". The parser uses a simple toggle on any ``` line, so it incorrectly "closes" the 4-backtick fence at the first 3-backtick line.
7. ATX closing hashes are included in title
## My Section ##
Content here.
Expected: Title is "My Section".
Actual: Title is "My Section ##" — the trailing ## is not stripped.
Reproduction Script
Self-contained script that inlines the old parser and runs all 7 cases. Save as test_old_parser.py and run with python test_old_parser.py.
Click to expand test script
"""Test the OLD markdown parser against the 7 edge cases from issue #245."""
import re
import sys
import tempfile
from pathlib import Path
from dataclasses import dataclass
# --- Inline the old parser so we don't need to mess with imports ---
@dataclass
class ContentNode:
content: str
tokens: int
title: str | None = None
index: int | None = None
level: int | None = None
images: list | None = None
@dataclass
class ParsedDocument:
doc_name: str
nodes: list
metadata: dict | None = None
def count_tokens(text, model=None):
return len(text.split()) # cheap approximation for testing
class OldMarkdownParser:
"""Exact copy of the parser BEFORE the fix (commit 6d29886)."""
def supported_extensions(self):
return [".md", ".markdown"]
def parse(self, file_path, **kwargs):
path = Path(file_path)
model = kwargs.get("model")
with open(path, "r", encoding="utf-8") as f:
content = f.read()
lines = content.split("\n")
headers = self._extract_headers(lines)
nodes = self._build_nodes(headers, lines, model)
return ParsedDocument(doc_name=path.stem, nodes=nodes)
def _extract_headers(self, lines):
header_pattern = r"^(#{1,6})\s+(.+)$"
code_block_pattern = r"^```"
headers = []
in_code_block = False
for line_num, line in enumerate(lines, 1):
stripped = line.strip()
if re.match(code_block_pattern, stripped):
in_code_block = not in_code_block
continue
if not in_code_block and stripped:
match = re.match(header_pattern, stripped)
if match:
headers.append({
"title": match.group(2).strip(),
"level": len(match.group(1)),
"line_num": line_num,
})
return headers
def _build_nodes(self, headers, lines, model):
nodes = []
for i, header in enumerate(headers):
start = header["line_num"] - 1
end = headers[i + 1]["line_num"] - 1 if i + 1 < len(headers) else len(lines)
text = "\n".join(lines[start:end]).strip()
tokens = count_tokens(text, model=model)
nodes.append(ContentNode(
content=text,
tokens=tokens,
title=header["title"],
index=header["line_num"],
level=header["level"],
))
return nodes
# --- Helper ---
def parse_string(parser, content, name="test"):
with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False) as f:
f.write(content)
f.flush()
return parser.parse(f.name)
# --- Tests ---
parser = OldMarkdownParser()
failures = []
passes = []
# Case 1: Preamble before first header
result = parse_string(parser, """This is an introduction paragraph.
Some important context here.
# Chapter 1
Body text.
""")
preamble_found = any("introduction" in n.content for n in result.nodes)
if preamble_found:
passes.append("Case 1: Preamble")
else:
failures.append(
"Case 1: Preamble before first header -> DROPPED "
"(got {} nodes, none contain 'introduction')".format(len(result.nodes))
)
# Case 2: Headerless file
result = parse_string(parser, """Just some plain text document
with multiple lines of content.
No headers at all.
""")
if len(result.nodes) > 0:
passes.append("Case 2: Headerless file")
else:
failures.append("Case 2: Headerless file -> 0 nodes returned, entire document LOST")
# Case 3: YAML frontmatter
result = parse_string(parser, """---
title: My Document
author: Alice
tags: [python, markdown]
---
# Introduction
Hello world.
""")
frontmatter_in_content = any("title: My Document" in n.content for n in result.nodes)
if not frontmatter_in_content:
passes.append("Case 3: Frontmatter (not a separate bug, dropped with preamble)")
else:
failures.append("Case 3: YAML frontmatter -> leaked into node content")
# Case 4: Setext headers
result = parse_string(parser, """Main Title
==========
Some content here.
Sub Section
-----------
More content.
""")
titles = [n.title for n in result.nodes if n.title]
if "Main Title" in titles and "Sub Section" in titles:
passes.append("Case 4: Setext headers")
else:
failures.append(
"Case 4: Setext headers -> not recognized "
"(got titles: {}, {} nodes)".format(titles, len(result.nodes))
)
# Case 5: Tilde fences
result = parse_string(parser, """# Header
~~~python
# This is a comment, not a header
def foo():
pass
~~~
# Real Header
""")
titles = [n.title for n in result.nodes if n.title]
if "This is a comment, not a header" not in titles:
passes.append("Case 5: Tilde fences")
else:
failures.append(
"Case 5: Tilde fences -> comment inside ~~~ detected as header "
"(got titles: {})".format(titles)
)
# Case 6: Mismatched fence lengths
content = """# Before
````python
# Inside 4-backtick fence
```
# Still inside
```
````
# After
"""
result = parse_string(parser, content)
titles = [n.title for n in result.nodes if n.title]
has_fake = "Inside 4-backtick fence" in titles or "Still inside" in titles
if not has_fake:
passes.append("Case 6: Mismatched fence lengths")
else:
failures.append(
"Case 6: Mismatched fence lengths -> fake headers detected "
"inside code block (got titles: {})".format(titles)
)
# Case 7: ATX closing hashes
result = parse_string(parser, """## My Section ##
Content here.
""")
if result.nodes and result.nodes[0].title == "My Section":
passes.append("Case 7: ATX closing hashes")
else:
actual_title = result.nodes[0].title if result.nodes else "N/A"
failures.append(
"Case 7: ATX closing hashes -> title is '{}' "
"instead of 'My Section'".format(actual_title)
)
# --- Report ---
print("=" * 60)
print("OLD PARSER EDGE CASE VERIFICATION")
print("=" * 60)
print()
for p in passes:
print(f" PASS {p}")
for f in failures:
print(f" FAIL {f}")
print()
print(f"Result: {len(failures)} failed, {len(passes)} passed out of 7")
sys.exit(1 if failures else 0)
Output:
============================================================
OLD PARSER EDGE CASE VERIFICATION
============================================================
PASS Case 3: Frontmatter (not a separate bug, dropped with preamble)
FAIL Case 1: Preamble before first header -> DROPPED (got 1 nodes, none contain 'introduction')
FAIL Case 2: Headerless file -> 0 nodes returned, entire document LOST
FAIL Case 4: Setext headers -> not recognized (got titles: [], 0 nodes)
FAIL Case 5: Tilde fences -> comment inside ~~~ detected as header (got titles: ['Header', 'This is a comment, not a header', 'Real Header'])
FAIL Case 6: Mismatched fence lengths -> fake headers detected inside code block (got titles: ['Before', 'Still inside', 'After'])
FAIL Case 7: ATX closing hashes -> title is 'My Section ##' instead of 'My Section'
Result: 6 failed, 1 passed out of 7
Impact
These issues affect real-world Markdown documents:
- README files and documentation commonly have preamble text before the first header
- Technical docs use setext headers and various code fence styles
- Some Markdown editors add closing hashes to ATX headers
- A headerless
.md file being completely discarded is a data loss bug
Environment
pageindex version: 0.3.0.dev1
- Python: 3.11
Description
The current
MarkdownParserinpageindex/parser/markdown.pyhas several robustness issues. It silently drops content or produces incorrect results on valid Markdown inputs that are common in real-world documents.All cases below have been verified against the old parser — 6 out of 7 confirmed failing.
Failing Test Cases
1. Content before the first header is silently dropped
This is an introduction paragraph. Some important context here. # Chapter 1 Body text.Expected: The preamble text should be captured as a node.
Actual:
_build_nodesonly iterates over headers, so everything before# Chapter 1is lost. (1 node returned, preamble missing)2. Headerless Markdown files produce zero nodes
Expected: At least one node containing the full content.
Actual:
_extract_headersreturns[],_build_nodesreturns[]— the entire document is silently discarded.3. YAML frontmatter — not a separate bug
Verified: The frontmatter does NOT leak into node content — it is silently dropped as part of the preamble (same root cause as Case 1). However, the new parser now properly extracts frontmatter into
metadata, which is the correct behavior.4. Setext-style headers are not recognized
Expected: Two nodes — H1 "Main Title" and H2 "Sub Section".
Actual: Zero headers detected, zero nodes returned. Entire document lost.
5. Tilde code fences are not tracked
Expected: 2 nodes: "Header" and "Real Header".
Actual: 3 nodes — the comment
# This is a comment, not a headerinside the tilde fence is incorrectly detected as a header.6. Mismatched fence lengths cause incorrect header detection
Example input (a 4-backtick fence containing 3-backtick lines):
Expected: 2 headers: "Before" and "After".
Actual: 3 headers detected: "Before", "Still inside", "After". The parser uses a simple toggle on any
```line, so it incorrectly "closes" the 4-backtick fence at the first 3-backtick line.7. ATX closing hashes are included in title
## My Section ## Content here.Expected: Title is
"My Section".Actual: Title is
"My Section ##"— the trailing##is not stripped.Reproduction Script
Self-contained script that inlines the old parser and runs all 7 cases. Save as
test_old_parser.pyand run withpython test_old_parser.py.Click to expand test script
Output:
Impact
These issues affect real-world Markdown documents:
.mdfile being completely discarded is a data loss bugEnvironment
pageindexversion: 0.3.0.dev1