Convert any URL to clean markdown for AI agents.
npx rdrr https://react.dev/learn- Fast: no headless browser, lightweight
- Smart: 20+ site-specific extractors (Wiki, Reddit, X, Hacker News, GitHub, ChatGPT, Claude, Substack, ...)
- LLM-ready: strips ads, navigation, footers; keeps code blocks, tables, math
- Versatile: webpages, GitHub issues/PRs, PDFs, X profiles, YouTube transcripts
npm install rdrr# Webpage
rdrr https://react.dev/learn
# YouTube transcript
rdrr https://www.youtube.com/watch?v=dQw4w9WgXcQ
# GitHub issue with comments
rdrr https://github.com/mozilla/readability/issues/1
# X timeline
rdrr https://x.com/discotune -n 10
# Save to file
rdrr https://example.com -o article.md
# JSON with metadata
rdrr https://example.com --jsonimport { parse } from "rdrr"
const result = await parse("https://en.wikipedia.org/wiki/TypeScript")
result.title // "TypeScript"
result.content // clean markdown
result.wordCount // 2847
result.siteName // "Wikipedia"| Flag | Description |
|---|---|
-o, --output <file> |
Save to file instead of stdout |
-j, --json |
Full JSON with metadata |
-p, --property <name> |
Extract a single field (title, content, ...) |
-l, --language <code> |
Preferred language (BCP 47) |
-n, --limit <n> |
Max items for aggregate URLs (default: 10) |
--order <order> |
newest (default) or oldest |
--check |
Probe if URL is readable (exit 0/1) |
--llms |
Append site's /llms.txt |
--debug |
Pipeline diagnostics to stderr |
import { parse } from "rdrr"
const result = await parse(url, {
language: "en",
includeLlmsTxt: true,
})Returns a ParseResult with type, title, author, content, description, domain, siteName, published, wordCount, readTime, and more. The result is narrowed by type: "webpage", "youtube", "github", "pdf", or "x-profile".
Run the extraction engine on raw HTML: useful for saved pages or pipelines where you already have the bytes.
import { parseHtml } from "rdrr"
const result = await parseHtml(html, {
url: "https://example.com/article",
})Lightweight pre-check: will this URL yield a meaningful article? Useful for routing in AI agents.
import { isProbablyReaderable } from "rdrr"
await isProbablyReaderable("https://example.com") // true | falseAlso available as direct imports: parseWeb, parseYouTube, parseGitHub, parsePdf, detectUrlType, extractVideoId, normalizeUrl.
| Type | What it handles |
|---|---|
| Webpages | Any HTML page with 20+ site-specific extractors |
| YouTube | Transcripts with chapters, speakers, timestamps |
| GitHub | Issues, PRs (with comments), raw files |
| PDFs | Any public .pdf (requires optional unpdf) |
| X/Twitter | Single posts and full profile timelines |
| llms.txt | Appended on demand via --llms or includeLlmsTxt |
- Discussion, questions, site-extractor requests: GitHub Discussions
- Bugs: GitHub Issues
- Security: see SECURITY.md
Contributions welcome! See CONTRIBUTING.md.
Want to add a site extractor? Check out src/extract/sites/: each one is a self-contained file.
MIT