feat(convert): improve HTML-to-Markdown conversion quality#79
Merged
Conversation
Fix critical conversion issues: - Links: proper [text](href) format instead of broken ](href) - Tables: convert to markdown tables with header separator - Images: emit  instead of discarding - Ordered lists: use 1. 2. 3. numbering instead of all bullets - Definition lists: <dl>/<dt>/<dd> support - Entities: expand from ~10 to 40+ named entities (trade, bull, hellip, smart quotes, currency symbols, arrows, etc.) - Whitespace: preserve indentation for nested list rendering Closes #73
233d552 to
673b1e5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Major improvements to HTML→Markdown conversion — fix broken links, add tables, images, ordered lists, definition lists, and expand entity support.
Why
Current conversion quality is too low for agents that need to understand page structure. Broken links and missing tables lose critical information.
How
</a>wrap collected text in[text](href). Empty text uses autolink<href>format.|separators and header separator row.from<img>tags.(is_ordered, counter)tuples. Ordered items get1.,2., etc.<dt>→**term**,<dd>→: definitionclean_whitespace()now preserves indentation after newlines for proper nested list rendering.No new external dependencies — all custom implementation.
Risk
Checklist
Closes #73