markfetch is a zero-config MCP server that fetches web pages and returns clean, token-efficient Markdown with structured YAML frontmatter — no API keys, no external services required.
When an LLM reads a webpage, it receives everything: cookie banners, nav links, footer noise, related-article widgets. That's hundreds of wasted tokens per page — and the actual content gets buried.
The official mcp-server-fetch is buggy and insecure, and paid services like Firecrawl require API keys. There's no good free option that just works.
markfetch runs a multi-layer extraction pipeline (Defuddle → Mozilla Readability → Turndown → plain text) that strips noise and returns tight, readable Markdown. Large pages auto-condense to a heading outline. PDFs are extracted to structured text. Multiple URLs fetch in parallel. Search works out of the box. All with SSRF protection, robots.txt compliance, and zero configuration.
Prerequisites: Node.js 18+
claude mcp add markfetch -- npx -y markfetchOr add manually to ~/.claude/settings.json:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}Restart Claude Code — the tools are available immediately.
Other clients (Claude Desktop, Cursor, Windsurf, Codex)
~/Library/Application Support/Claude/claude_desktop_config.json (Mac)
%APPDATA%\Claude\claude_desktop_config.json (Windows)
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}.cursor/mcp.json (project) or ~/.cursor/mcp.json (global):
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}~/.codex/config.json:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"]
}
}
}| Tool | Description |
|---|---|
web_fetch |
Fetch a URL → clean Markdown with YAML frontmatter. Auto-condenses large pages. |
web_fetch_batch |
Fetch multiple URLs in parallel with per-domain concurrency limits. |
web_fetch_links |
Extract and classify all links from a page (internal vs external). |
web_fetch_raw |
Return raw HTML/text without markdown conversion. |
web_search |
Search the web — works out of the box, no API keys needed. |
web_diff |
Compare current page content against a prior snapshot as a unified diff. |
Fetches a URL and runs it through the extraction pipeline. Returns Markdown with YAML frontmatter (title, author, word count, extraction method). Large pages (>12K chars) auto-condense to a heading outline unless max_chars is set.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | yes | — | URL to fetch |
extract_main |
boolean | no | true |
Extract main article content (strips nav/ads). Set false for full page. |
start_index |
number | no | 0 |
Byte offset for pagination. Use nextIndex from the prior response. |
max_chars |
number | no | auto | Max characters to return (500–200,000). Omit for auto-condensing. |
headers |
object | no | {} |
Custom HTTP headers. Bypasses the shared cache. |
respect_robots |
boolean | no | true |
Check robots.txt before fetching. Set false for internal/authorized URLs. |
Example response:
---
title: "Page Title"
url: "https://example.com/article"
author: "Author Name"
published_date: "2026-01-15"
site_name: "Example"
schema_type: "Article"
canonical_url: "https://example.com/article"
language: "en"
word_count: 1523
extraction_method: "defuddle"
---
# Page Title
Page content here...
Additional frontmatter fields appear when relevant: page_count for PDFs, rendered: true when Playwright was used, blocked: true when bot detection is encountered.
Fetches multiple URLs in parallel using Promise.allSettled — one failure does not block others. Default concurrency: 5 total, 3 per domain.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls |
string[] | yes | — | Array of URLs to fetch (max 20) |
extract_main |
boolean | no | true |
Extract main article content for all URLs. |
headers |
object | no | {} |
Custom headers applied to every request. |
respect_robots |
boolean | no | true |
Check robots.txt before fetching. |
Extracts all links from a page, deduplicating by URL and classifying as internal or external.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | yes | — | URL to extract links from |
filter_external |
boolean | no | false |
Only return external links. |
filter_internal |
boolean | no | false |
Only return internal links. |
respect_robots |
boolean | no | true |
Check robots.txt before fetching. |
Returns the raw HTML or text response without conversion. Useful when markdown conversion loses structure you need.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | yes | — | URL to fetch |
max_chars |
number | no | 50000 |
Max characters to return (100–100,000). |
respect_robots |
boolean | no | true |
Check robots.txt before fetching. |
Searches the web and returns a Markdown table of results. Works out of the box using DDGS (Dux Distributed Global Search) — a metasearch library that aggregates results from Bing, Brave, DuckDuckGo, Google, Mojeek, Yandex, Yahoo, and Wikipedia. Auto-installed into a Python venv on first use via python3 (or uvx fallback). No API keys required. Set SEARXNG_URL for a self-hosted SearXNG backend instead.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query |
string | yes | — | Search query |
count |
number | no | 10 |
Number of results (1–20) |
language |
string | no | — | ISO 639 language code (e.g., en, de) |
time_range |
string | no | — | day, week, month, or year |
safesearch |
number | no | — | 0 (off), 1 (moderate), 2 (strict) |
categories |
string | no | — | Comma-separated SearXNG categories |
Compares current page content against the most recent snapshot and returns a unified diff. First call on a new URL saves a baseline snapshot; subsequent calls show changes. YAML frontmatter is stripped before diffing.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | yes | — | URL to compare |
extract_main |
boolean | no | true |
Extract main article content. |
headers |
object | no | {} |
Custom HTTP headers. |
respect_robots |
boolean | no | true |
Check robots.txt before fetching. |
Snapshots are stored at $SNAPSHOT_DIR (default ~/.markfetch/snapshots/) and pruned after SNAPSHOT_MAX_AGE_DAYS days (default 30).
| Client | Status |
|---|---|
| Claude Code | ✅ Supported |
| Claude Desktop | ✅ Supported |
| Cursor | ✅ Supported |
| Windsurf | ✅ Supported |
| OpenAI Codex | ✅ Supported |
| Any MCP client | ✅ Supported |
Pass environment variables via the env key in your MCP config:
{
"mcpServers": {
"markfetch": {
"command": "npx",
"args": ["-y", "markfetch"],
"env": {
"CACHE_TTL_MS": "600000",
"FETCH_TIMEOUT_MS": "30000",
"RATE_LIMIT_PER_DOMAIN": "20"
}
}
}
}| Variable | Default | Description |
|---|---|---|
CACHE_TTL_MS |
300000 (5 min) |
In-memory cache TTL for fetched pages |
FETCH_TIMEOUT_MS |
15000 (15 sec) |
Per-request fetch timeout |
RATE_LIMIT_PER_DOMAIN |
10 (req/min) |
Requests per domain per minute. 0 to disable. |
BATCH_CONCURRENCY |
5 |
Max simultaneous URLs in web_fetch_batch |
BATCH_PER_DOMAIN |
3 |
Max simultaneous URLs per domain in batch |
SNAPSHOT_DIR |
~/.markfetch/snapshots/ |
Snapshot storage for web_diff |
SNAPSHOT_MAX_AGE_DAYS |
30 |
Days to retain snapshots |
SNAPSHOTS_ENABLED |
true |
Set false to disable auto-snapshotting |
| Variable | Default | Description |
|---|---|---|
PLAYWRIGHT_DISABLED |
false |
Set true to disable Playwright fallback for JS-rendered pages |
PLAYWRIGHT_TIMEOUT_MS |
30000 (30 sec) |
Timeout for Playwright page render |
SEARXNG_URL |
(unset) | SearXNG instance URL for web_search. Without it, DDGS is used. |
SEARXNG_AUTH_TOKEN |
(unset) | Bearer token for auth-gated SearXNG instances |
SEARCH_BACKEND |
auto |
Force backend: auto, searxng, or ddgs |
COMPLIANCE_MODE |
permissive |
strict, balanced, or permissive (see Compliance) |
HONEST_UA |
false |
Send markfetch/<version> as User-Agent. Auto-enabled in strict mode. |
| Variable | Default | Description |
|---|---|---|
PORT |
3030 |
HTTP port. Setting this also activates HTTP transport (no --http flag needed). |
By default, markfetch runs as a stdio MCP server. For remote or multi-client deployments, start in HTTP mode:
node dist/index.js --http
node dist/index.js --http --cors --port 3030| Flag | Default | Description |
|---|---|---|
--http |
off | Enable Streamable HTTP transport |
--port <n> |
3030 |
HTTP port |
--host <addr> |
127.0.0.1 |
Bind address |
--cors |
off | Enable CORS (Access-Control-Allow-Origin: *) |
A /health endpoint returns { "status": "ok", "transport": "streamable-http" } for liveness checks. The MCP endpoint is POST /mcp.
| Package | Purpose |
|---|---|
@modelcontextprotocol/sdk |
MCP server framework (stdio + Streamable HTTP transport) |
defuddle |
Primary content extractor — produces markdown with metadata |
@mozilla/readability |
Secondary content extractor (fallback when Defuddle fails) |
turndown |
HTML-to-Markdown converter (fallback after Readability) |
turndown-plugin-gfm |
GFM tables and strikethrough support for Turndown |
jsdom |
DOM implementation for Node.js — used by Readability and link extraction |
zod |
Schema validation for MCP tool input parameters |
pdf-parse |
PDF text extraction to page-structured markdown |
diff |
Unified diff generation for web_diff snapshots |
express |
HTTP server for Streamable HTTP transport mode |
p-limit |
Concurrency limiter for batch fetch and search |
robots-parser |
robots.txt parsing for compliance checks |
| Package | Purpose |
|---|---|
playwright |
Headless browser for JS-rendered pages (optional peer dependency) |
ddgs (Python) |
DDGS metasearch library — auto-installed into a venv on first web_search call |
markfetch uses a 4-layer extraction cascade — each layer catches errors silently and falls back to the next:
- Defuddle — produces markdown natively with rich metadata (title, author, schema.org data)
- Mozilla Readability — extracts main article content from the DOM
- Turndown — converts full HTML to markdown with GFM table/strikethrough support
- Plain text — strips all tags as a last resort
A 50-character minimum threshold determines whether an extraction is successful. If static extraction yields too little content and Playwright is available, the page is re-fetched via a headless browser.
Additional pipeline features:
- SSRF protection — DNS resolution checks before fetch AND after redirect, blocking private IP ranges
- Block detection — identifies Cloudflare, PerimeterX, Akamai, DataDome, and CAPTCHA pages
- Domain cooldown — backs off after repeated block signals (5 min, extended to 30 min)
- robots.txt / agents.txt — compliance checking with configurable enforcement
- PDF extraction —
pdf-parseconverts PDFs to page-structured markdown - Smart chunking — splits at paragraph/heading boundaries, never inside code blocks or tables
Playwright is auto-detected at startup. If installed as a peer dependency, markfetch automatically retries via a headless browser when static extraction yields low content. No configuration required.
To install Playwright:
npm install playwright
npx playwright install chromiumTo disable Playwright even when installed:
{
"env": { "PLAYWRIGHT_DISABLED": "true" }
}When Playwright is used, the response frontmatter includes rendered: true.
| Mode | Behavior |
|---|---|
permissive (default) |
v2-identical behavior — no enforcement beyond robots.txt opt-in |
balanced |
agents.txt advisory warnings, block detection surfaced, no fetch gating |
strict |
Full enforcement: robots.txt, agents.txt, honest UA, domain cooldown |
{
"env": { "COMPLIANCE_MODE": "strict" }
}No breaking changes — all new behaviors default OFF.
| What changed | Default | Action required |
|---|---|---|
web_search works without SEARXNG_URL |
DDGS auto-activates via python3 or uvx |
None |
| Block detection metadata in frontmatter | Always surfaced on detection | None — new blocked / reason fields |
Search responses include backend field |
Always present | None — new field in output |
| Compliance mode | permissive (v2-identical) |
Set COMPLIANCE_MODE=strict for full enforcement |
| Playwright auto-detection | Auto-detects when installed | Use PLAYWRIGHT_DISABLED=true to opt out |
git clone https://github.com/thoaud/llm-markdown-proxy
cd llm-markdown-proxy
npm install # builds via prepare script
npm run dev # live TypeScript reload
npm test # unit tests (no network)
npm run test:integration # requires internetPoint a client at your local build:
{
"mcpServers": {
"markfetch": {
"command": "node",
"args": ["/absolute/path/to/dist/index.js"]
}
}
}- SSRF protection on every code path — pre-fetch DNS check and post-redirect hostname verification block private IP ranges (127.x, 10.x, 192.168.x, 172.16-31.x, 169.254.x, IPv6 loopback/link-local/ULA)
- robots.txt compliance with per-domain caching (1-hour TTL)
- Per-domain rate limiting prevents accidental abuse (default 10 req/min)
- Block detection with automatic domain cooldown
- No outbound data — markfetch only reads pages, never sends user data externally
Tools don't appear after adding the config
- Verify Node.js 18+:
node --version - Test the server starts:
npx -y markfetch(should hang waiting for stdin — that's correct) - Check your config file is valid JSON
- Restart the client completely (not just reload)
Fetch fails for internal/private URLs
markfetch blocks requests to private IP ranges (127.x, 10.x, 192.168.x, etc.) to prevent SSRF attacks. It is designed for public URLs only.
Response is truncated or shows a heading outline
Pages over 12K characters auto-condense to a heading outline. Pass max_chars (e.g., max_chars: 50000) to get full content, or paginate with start_index using the nextIndex value from the previous response.
SearXNG URL blocked by SSRF guard
Because markfetch blocks private IP ranges, SEARXNG_URL=http://localhost:8888 will be blocked. For local SearXNG, use a non-loopback hostname (e.g., http://host.docker.internal:8888 or a LAN address).
Playwright fallback error: "Chromium is not installed"
Run npx playwright install chromium to install the browser binary, then retry. To disable Playwright fallback entirely, set PLAYWRIGHT_DISABLED=true.