Skip to content

feat(convert): improve HTML-to-Markdown conversion quality#79

Merged
chaliy merged 1 commit intomainfrom
claude/issue-73-improve-conversion
Mar 27, 2026
Merged

feat(convert): improve HTML-to-Markdown conversion quality#79
chaliy merged 1 commit intomainfrom
claude/issue-73-improve-conversion

Conversation

@chaliy
Copy link
Copy Markdown
Contributor

@chaliy chaliy commented Mar 27, 2026

What

Major improvements to HTML→Markdown conversion — fix broken links, add tables, images, ordered lists, definition lists, and expand entity support.

Why

Current conversion quality is too low for agents that need to understand page structure. Broken links and missing tables lose critical information.

How

  • Links: Track link text position in output buffer; on </a> wrap collected text in [text](href). Empty text uses autolink <href> format.
  • Tables: Collect cells into rows, render as markdown table with | separators and header separator row.
  • Images: Emit ![alt](src) from <img> tags.
  • Ordered lists: Use stack of (is_ordered, counter) tuples. Ordered items get 1., 2., etc.
  • Definition lists: <dt>**term**, <dd>: definition
  • Entities: Expanded from ~10 to 40+ named entities (trade, bull, hellip, smart quotes, currency, arrows, fractions)
  • Whitespace: clean_whitespace() now preserves indentation after newlines for proper nested list rendering.

No new external dependencies — all custom implementation.

Risk

  • Medium — changes core conversion behavior
  • All existing tests updated and passing
  • 12 new tests for links, tables, images, ordered lists, entities, definition lists

Checklist

  • Unit tests passed (all 229)
  • Clippy clean
  • Docs build clean
  • No new dependencies

Closes #73

Fix critical conversion issues:
- Links: proper [text](href) format instead of broken ](href)
- Tables: convert to markdown tables with header separator
- Images: emit ![alt](src) instead of discarding
- Ordered lists: use 1. 2. 3. numbering instead of all bullets
- Definition lists: <dl>/<dt>/<dd> support
- Entities: expand from ~10 to 40+ named entities (trade, bull,
  hellip, smart quotes, currency symbols, arrows, etc.)
- Whitespace: preserve indentation for nested list rendering

Closes #73
@chaliy chaliy force-pushed the claude/issue-73-improve-conversion branch from 233d552 to 673b1e5 Compare March 27, 2026 03:11
@chaliy chaliy merged commit e74c328 into main Mar 27, 2026
10 checks passed
@chaliy chaliy deleted the claude/issue-73-improve-conversion branch March 27, 2026 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: improve HTML→Markdown conversion (links, tables, images, ordered lists, entities)

1 participant