extract2md is all about “HTML in → Markdown out.” You can start from a live
URL, a file on disk, or an already-loaded HTML string.
It can be used from CLI or as a Python library.
pip install extract2mdPrerequisites:
- Python 3.10+ runtime
- Node.js (recommended for best results; powers Readability.js content extraction)
extract2md https://www.iana.org/help/example-domainsextract2md https://www.iana.org/help/example-domains > sample-output.md# convert file
extract2md sample-page.html
# or from stdin
cat sample-page.html | extract2md -Usage: extract2md [OPTIONS] SOURCE
source: HTTP(S) URL, filesystem path, or-when reading HTML from stdin.
--ignore-robots: skip robots.txt validation (use sparingly).--proxy URL: HTTP(S) proxy forwarded to httpx.--timeout SECONDS: request timeout (default 30 seconds).--user-agent STRING: override the default identifier.
--rewrite-relative-urls/--no-rewrite-relative-urls: enable or disable rewriting relativehref/srcattributes to absolute links (default on).--base-url URL: optional base URL for rewriting relative URLs (defaultsource).
--converter NAME: choose the HTML conversion backend. Defaults totrafilatura;readability(requires Node.js) is also available.
EXTRACT2MD_NODE_PATH: Set theEXTRACT2MD_NODE_PATHenvironment variable to the Node.js binary (or its directory) if Readability.js cannot findnodeon yourPATH.
extract2md can also be used as a Python library.
from extract2md import fetch_to_markdown
markdown = fetch_to_markdown("https://www.iana.org/help/example-domains")from extract2md import file_to_markdown
markdown_from_file = file_to_markdown("sample-page.html")from extract2md import html_to_markdown
html = "<html><body><h1>Offline HTML</h1></body></html>"
markdown_from_html = html_to_markdown(html)
# Optionally disable replacing relative links with absolute URLs
markdown_custom = html_to_markdown(
html,
rewrite_relative_urls=False,
)
# Or replace relative links with a custom base URL
markdown_custom = html_to_markdown(
html,
rewrite_relative_urls=False,
base_url="https://example.com/docs/",
)
# Pick an alternate conversion backend (e.g., Readability)
markdown_readability = html_to_markdown(html, converter="readability")Need to store markup or run your own converter? Use fetch and skip the Markdown
step entirely:
from extract2md import fetch
raw_html, content_type = fetch("https://example.com/docs")- The CLI and library both fetch live webpages from URLs; network availability and site rate limits apply.
- Inspired by the Fetch MCP Server.
- Thanks go to these libraries for the heavy lifting:
- ReadabiliPy with Mozilla's Readability.js Node.js package
- Markdownify
- Trafilatura