Skip to content

Markdown parser fails on common edge cases #245

@saccharin98

Description

@saccharin98

Description

The current MarkdownParser in pageindex/parser/markdown.py has several robustness issues. It silently drops content or produces incorrect results on valid Markdown inputs that are common in real-world documents.

All cases below have been verified against the old parser — 6 out of 7 confirmed failing.

Failing Test Cases

1. Content before the first header is silently dropped

This is an introduction paragraph.
Some important context here.

# Chapter 1
Body text.

Expected: The preamble text should be captured as a node.
Actual: _build_nodes only iterates over headers, so everything before # Chapter 1 is lost. (1 node returned, preamble missing)

2. Headerless Markdown files produce zero nodes

Just some plain text document
with multiple lines of content.
No headers at all.

Expected: At least one node containing the full content.
Actual: _extract_headers returns [], _build_nodes returns [] — the entire document is silently discarded.

3. YAML frontmatter — not a separate bug

---
title: My Document
author: Alice
---

# Introduction
Hello world.

Verified: The frontmatter does NOT leak into node content — it is silently dropped as part of the preamble (same root cause as Case 1). However, the new parser now properly extracts frontmatter into metadata, which is the correct behavior.

4. Setext-style headers are not recognized

Main Title
==========

Some content here.

Sub Section
-----------

More content.

Expected: Two nodes — H1 "Main Title" and H2 "Sub Section".
Actual: Zero headers detected, zero nodes returned. Entire document lost.

5. Tilde code fences are not tracked

# Header

~~~python
# This is a comment, not a header
def foo():
    pass
~~~

# Real Header

Expected: 2 nodes: "Header" and "Real Header".
Actual: 3 nodes — the comment # This is a comment, not a header inside the tilde fence is incorrectly detected as a header.

6. Mismatched fence lengths cause incorrect header detection

Example input (a 4-backtick fence containing 3-backtick lines):

# Before

````python
# Inside 4-backtick fence
```
# Still inside — 3 backticks can't close a 4-backtick fence
```
````

# After

Expected: 2 headers: "Before" and "After".
Actual: 3 headers detected: "Before", "Still inside", "After". The parser uses a simple toggle on any ``` line, so it incorrectly "closes" the 4-backtick fence at the first 3-backtick line.

7. ATX closing hashes are included in title

## My Section ##

Content here.

Expected: Title is "My Section".
Actual: Title is "My Section ##" — the trailing ## is not stripped.

Reproduction Script

Self-contained script that inlines the old parser and runs all 7 cases. Save as test_old_parser.py and run with python test_old_parser.py.

Click to expand test script
"""Test the OLD markdown parser against the 7 edge cases from issue #245."""
import re
import sys
import tempfile
from pathlib import Path
from dataclasses import dataclass


# --- Inline the old parser so we don't need to mess with imports ---

@dataclass
class ContentNode:
    content: str
    tokens: int
    title: str | None = None
    index: int | None = None
    level: int | None = None
    images: list | None = None

@dataclass
class ParsedDocument:
    doc_name: str
    nodes: list
    metadata: dict | None = None

def count_tokens(text, model=None):
    return len(text.split())  # cheap approximation for testing

class OldMarkdownParser:
    """Exact copy of the parser BEFORE the fix (commit 6d29886)."""
    def supported_extensions(self):
        return [".md", ".markdown"]

    def parse(self, file_path, **kwargs):
        path = Path(file_path)
        model = kwargs.get("model")
        with open(path, "r", encoding="utf-8") as f:
            content = f.read()
        lines = content.split("\n")
        headers = self._extract_headers(lines)
        nodes = self._build_nodes(headers, lines, model)
        return ParsedDocument(doc_name=path.stem, nodes=nodes)

    def _extract_headers(self, lines):
        header_pattern = r"^(#{1,6})\s+(.+)$"
        code_block_pattern = r"^```"
        headers = []
        in_code_block = False
        for line_num, line in enumerate(lines, 1):
            stripped = line.strip()
            if re.match(code_block_pattern, stripped):
                in_code_block = not in_code_block
                continue
            if not in_code_block and stripped:
                match = re.match(header_pattern, stripped)
                if match:
                    headers.append({
                        "title": match.group(2).strip(),
                        "level": len(match.group(1)),
                        "line_num": line_num,
                    })
        return headers

    def _build_nodes(self, headers, lines, model):
        nodes = []
        for i, header in enumerate(headers):
            start = header["line_num"] - 1
            end = headers[i + 1]["line_num"] - 1 if i + 1 < len(headers) else len(lines)
            text = "\n".join(lines[start:end]).strip()
            tokens = count_tokens(text, model=model)
            nodes.append(ContentNode(
                content=text,
                tokens=tokens,
                title=header["title"],
                index=header["line_num"],
                level=header["level"],
            ))
        return nodes


# --- Helper ---
def parse_string(parser, content, name="test"):
    with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False) as f:
        f.write(content)
        f.flush()
        return parser.parse(f.name)


# --- Tests ---
parser = OldMarkdownParser()
failures = []
passes = []

# Case 1: Preamble before first header
result = parse_string(parser, """This is an introduction paragraph.
Some important context here.

# Chapter 1
Body text.
""")
preamble_found = any("introduction" in n.content for n in result.nodes)
if preamble_found:
    passes.append("Case 1: Preamble")
else:
    failures.append(
        "Case 1: Preamble before first header -> DROPPED "
        "(got {} nodes, none contain 'introduction')".format(len(result.nodes))
    )

# Case 2: Headerless file
result = parse_string(parser, """Just some plain text document
with multiple lines of content.
No headers at all.
""")
if len(result.nodes) > 0:
    passes.append("Case 2: Headerless file")
else:
    failures.append("Case 2: Headerless file -> 0 nodes returned, entire document LOST")

# Case 3: YAML frontmatter
result = parse_string(parser, """---
title: My Document
author: Alice
tags: [python, markdown]
---

# Introduction
Hello world.
""")
frontmatter_in_content = any("title: My Document" in n.content for n in result.nodes)
if not frontmatter_in_content:
    passes.append("Case 3: Frontmatter (not a separate bug, dropped with preamble)")
else:
    failures.append("Case 3: YAML frontmatter -> leaked into node content")

# Case 4: Setext headers
result = parse_string(parser, """Main Title
==========

Some content here.

Sub Section
-----------

More content.
""")
titles = [n.title for n in result.nodes if n.title]
if "Main Title" in titles and "Sub Section" in titles:
    passes.append("Case 4: Setext headers")
else:
    failures.append(
        "Case 4: Setext headers -> not recognized "
        "(got titles: {}, {} nodes)".format(titles, len(result.nodes))
    )

# Case 5: Tilde fences
result = parse_string(parser, """# Header

~~~python
# This is a comment, not a header
def foo():
    pass
~~~

# Real Header
""")
titles = [n.title for n in result.nodes if n.title]
if "This is a comment, not a header" not in titles:
    passes.append("Case 5: Tilde fences")
else:
    failures.append(
        "Case 5: Tilde fences -> comment inside ~~~ detected as header "
        "(got titles: {})".format(titles)
    )

# Case 6: Mismatched fence lengths
content = """# Before

````python
# Inside 4-backtick fence
```
# Still inside
```
````

# After
"""
result = parse_string(parser, content)
titles = [n.title for n in result.nodes if n.title]
has_fake = "Inside 4-backtick fence" in titles or "Still inside" in titles
if not has_fake:
    passes.append("Case 6: Mismatched fence lengths")
else:
    failures.append(
        "Case 6: Mismatched fence lengths -> fake headers detected "
        "inside code block (got titles: {})".format(titles)
    )

# Case 7: ATX closing hashes
result = parse_string(parser, """## My Section ##

Content here.
""")
if result.nodes and result.nodes[0].title == "My Section":
    passes.append("Case 7: ATX closing hashes")
else:
    actual_title = result.nodes[0].title if result.nodes else "N/A"
    failures.append(
        "Case 7: ATX closing hashes -> title is '{}' "
        "instead of 'My Section'".format(actual_title)
    )

# --- Report ---
print("=" * 60)
print("OLD PARSER EDGE CASE VERIFICATION")
print("=" * 60)
print()
for p in passes:
    print(f"  PASS  {p}")
for f in failures:
    print(f"  FAIL  {f}")
print()
print(f"Result: {len(failures)} failed, {len(passes)} passed out of 7")
sys.exit(1 if failures else 0)

Output:

============================================================
OLD PARSER EDGE CASE VERIFICATION
============================================================

  PASS  Case 3: Frontmatter (not a separate bug, dropped with preamble)
  FAIL  Case 1: Preamble before first header -> DROPPED (got 1 nodes, none contain 'introduction')
  FAIL  Case 2: Headerless file -> 0 nodes returned, entire document LOST
  FAIL  Case 4: Setext headers -> not recognized (got titles: [], 0 nodes)
  FAIL  Case 5: Tilde fences -> comment inside ~~~ detected as header (got titles: ['Header', 'This is a comment, not a header', 'Real Header'])
  FAIL  Case 6: Mismatched fence lengths -> fake headers detected inside code block (got titles: ['Before', 'Still inside', 'After'])
  FAIL  Case 7: ATX closing hashes -> title is 'My Section ##' instead of 'My Section'

Result: 6 failed, 1 passed out of 7

Impact

These issues affect real-world Markdown documents:

  • README files and documentation commonly have preamble text before the first header
  • Technical docs use setext headers and various code fence styles
  • Some Markdown editors add closing hashes to ATX headers
  • A headerless .md file being completely discarded is a data loss bug

Environment

  • pageindex version: 0.3.0.dev1
  • Python: 3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions