markfetch

markfetch is a zero-config MCP server that fetches web pages and returns clean, token-efficient Markdown with structured YAML frontmatter — no API keys, no external services required.

Why this exists

When an LLM reads a webpage, it receives everything: cookie banners, nav links, footer noise, related-article widgets. That's hundreds of wasted tokens per page — and the actual content gets buried.

The official mcp-server-fetch is buggy and insecure, and paid services like Firecrawl require API keys. There's no good free option that just works.

markfetch runs a multi-layer extraction pipeline (Defuddle → Mozilla Readability → Turndown → plain text) that strips noise and returns tight, readable Markdown. Large pages auto-condense to a heading outline. PDFs are extracted to structured text. Multiple URLs fetch in parallel. Search works out of the box. All with SSRF protection, robots.txt compliance, and zero configuration.

Quick start

Prerequisites: Node.js 18+

Claude Code

claude mcp add markfetch -- npx -y markfetch

Or add manually to ~/.claude/settings.json:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Restart Claude Code — the tools are available immediately.

Other clients (Claude Desktop, Cursor, Windsurf, Codex)

Claude Desktop

~/Library/Application Support/Claude/claude_desktop_config.json (Mac) %APPDATA%\Claude\claude_desktop_config.json (Windows)

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Cursor

.cursor/mcp.json (project) or ~/.cursor/mcp.json (global):

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Windsurf

~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

OpenAI Codex

~/.codex/config.json:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Tools

Tool	Description
`web_fetch`	Fetch a URL → clean Markdown with YAML frontmatter. Auto-condenses large pages.
`web_fetch_batch`	Fetch multiple URLs in parallel with per-domain concurrency limits.
`web_fetch_links`	Extract and classify all links from a page (internal vs external).
`web_fetch_raw`	Return raw HTML/text without markdown conversion.
`web_search`	Search the web — works out of the box, no API keys needed.
`web_diff`	Compare current page content against a prior snapshot as a unified diff.

`web_fetch`

Fetches a URL and runs it through the extraction pipeline. Returns Markdown with YAML frontmatter (title, author, word count, extraction method). Large pages (>12K chars) auto-condense to a heading outline unless max_chars is set.

Parameter	Type	Required	Default	Description
`url`	string	yes	—	URL to fetch
`extract_main`	boolean	no	`true`	Extract main article content (strips nav/ads). Set `false` for full page.
`start_index`	number	no	`0`	Byte offset for pagination. Use `nextIndex` from the prior response.
`max_chars`	number	no	auto	Max characters to return (500–200,000). Omit for auto-condensing.
`headers`	object	no	`{}`	Custom HTTP headers. Bypasses the shared cache.
`respect_robots`	boolean	no	`true`	Check robots.txt before fetching. Set `false` for internal/authorized URLs.

Example response:

---
title: "Page Title"
url: "https://example.com/article"
author: "Author Name"
published_date: "2026-01-15"
site_name: "Example"
schema_type: "Article"
canonical_url: "https://example.com/article"
language: "en"
word_count: 1523
extraction_method: "defuddle"
---

# Page Title

Page content here...

Additional frontmatter fields appear when relevant: page_count for PDFs, rendered: true when Playwright was used, blocked: true when bot detection is encountered.

`web_fetch_batch`

Fetches multiple URLs in parallel using Promise.allSettled — one failure does not block others. Default concurrency: 5 total, 3 per domain.

Parameter	Type	Required	Default	Description
`urls`	string[]	yes	—	Array of URLs to fetch (max 20)
`extract_main`	boolean	no	`true`	Extract main article content for all URLs.
`headers`	object	no	`{}`	Custom headers applied to every request.
`respect_robots`	boolean	no	`true`	Check robots.txt before fetching.

`web_fetch_links`

Extracts all links from a page, deduplicating by URL and classifying as internal or external.

Parameter	Type	Required	Default	Description
`url`	string	yes	—	URL to extract links from
`filter_external`	boolean	no	`false`	Only return external links.
`filter_internal`	boolean	no	`false`	Only return internal links.
`respect_robots`	boolean	no	`true`	Check robots.txt before fetching.

`web_fetch_raw`

Returns the raw HTML or text response without conversion. Useful when markdown conversion loses structure you need.

Parameter	Type	Required	Default	Description
`url`	string	yes	—	URL to fetch
`max_chars`	number	no	`50000`	Max characters to return (100–100,000).
`respect_robots`	boolean	no	`true`	Check robots.txt before fetching.

`web_search`

Searches the web and returns a Markdown table of results. Works out of the box using DDGS (Dux Distributed Global Search) — a metasearch library that aggregates results from Bing, Brave, DuckDuckGo, Google, Mojeek, Yandex, Yahoo, and Wikipedia. Auto-installed into a Python venv on first use via python3 (or uvx fallback). No API keys required. Set SEARXNG_URL for a self-hosted SearXNG backend instead.

Parameter	Type	Required	Default	Description
`query`	string	yes	—	Search query
`count`	number	no	`10`	Number of results (1–20)
`language`	string	no	—	ISO 639 language code (e.g., `en`, `de`)
`time_range`	string	no	—	`day`, `week`, `month`, or `year`
`safesearch`	number	no	—	0 (off), 1 (moderate), 2 (strict)
`categories`	string	no	—	Comma-separated SearXNG categories

`web_diff`

Compares current page content against the most recent snapshot and returns a unified diff. First call on a new URL saves a baseline snapshot; subsequent calls show changes. YAML frontmatter is stripped before diffing.

Parameter	Type	Required	Default	Description
`url`	string	yes	—	URL to compare
`extract_main`	boolean	no	`true`	Extract main article content.
`headers`	object	no	`{}`	Custom HTTP headers.
`respect_robots`	boolean	no	`true`	Check robots.txt before fetching.

Snapshots are stored at $SNAPSHOT_DIR (default ~/.markfetch/snapshots/) and pruned after SNAPSHOT_MAX_AGE_DAYS days (default 30).

Compatibility

Client	Status
Claude Code	✅ Supported
Claude Desktop	✅ Supported
Cursor	✅ Supported
Windsurf	✅ Supported
OpenAI Codex	✅ Supported
Any MCP client	✅ Supported

Configuration

Pass environment variables via the env key in your MCP config:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "CACHE_TTL_MS": "600000",
        "FETCH_TIMEOUT_MS": "30000",
        "RATE_LIMIT_PER_DOMAIN": "20"
      }
    }
  }
}

Core (always active)

Variable	Default	Description
`CACHE_TTL_MS`	`300000` (5 min)	In-memory cache TTL for fetched pages
`FETCH_TIMEOUT_MS`	`15000` (15 sec)	Per-request fetch timeout
`RATE_LIMIT_PER_DOMAIN`	`10` (req/min)	Requests per domain per minute. `0` to disable.
`BATCH_CONCURRENCY`	`5`	Max simultaneous URLs in `web_fetch_batch`
`BATCH_PER_DOMAIN`	`3`	Max simultaneous URLs per domain in batch
`SNAPSHOT_DIR`	`~/.markfetch/snapshots/`	Snapshot storage for `web_diff`
`SNAPSHOT_MAX_AGE_DAYS`	`30`	Days to retain snapshots
`SNAPSHOTS_ENABLED`	`true`	Set `false` to disable auto-snapshotting

Opt-in

Variable	Default	Description
`PLAYWRIGHT_DISABLED`	`false`	Set `true` to disable Playwright fallback for JS-rendered pages
`PLAYWRIGHT_TIMEOUT_MS`	`30000` (30 sec)	Timeout for Playwright page render
`SEARXNG_URL`	(unset)	SearXNG instance URL for `web_search`. Without it, DDGS is used.
`SEARXNG_AUTH_TOKEN`	(unset)	Bearer token for auth-gated SearXNG instances
`SEARCH_BACKEND`	`auto`	Force backend: `auto`, `searxng`, or `ddgs`
`COMPLIANCE_MODE`	`permissive`	`strict`, `balanced`, or `permissive` (see Compliance)
`HONEST_UA`	`false`	Send `markfetch/<version>` as User-Agent. Auto-enabled in `strict` mode.

HTTP transport

Variable	Default	Description
`PORT`	`3030`	HTTP port. Setting this also activates HTTP transport (no `--http` flag needed).

HTTP transport (remote mode)

By default, markfetch runs as a stdio MCP server. For remote or multi-client deployments, start in HTTP mode:

node dist/index.js --http
node dist/index.js --http --cors --port 3030

Flag	Default	Description
`--http`	off	Enable Streamable HTTP transport
`--port <n>`	`3030`	HTTP port
`--host <addr>`	`127.0.0.1`	Bind address
`--cors`	off	Enable CORS (`Access-Control-Allow-Origin: *`)

A /health endpoint returns { "status": "ok", "transport": "streamable-http" } for liveness checks. The MCP endpoint is POST /mcp.

Dependencies

Runtime

Package	Purpose
`@modelcontextprotocol/sdk`	MCP server framework (stdio + Streamable HTTP transport)
`defuddle`	Primary content extractor — produces markdown with metadata
`@mozilla/readability`	Secondary content extractor (fallback when Defuddle fails)
`turndown`	HTML-to-Markdown converter (fallback after Readability)
`turndown-plugin-gfm`	GFM tables and strikethrough support for Turndown
`jsdom`	DOM implementation for Node.js — used by Readability and link extraction
`zod`	Schema validation for MCP tool input parameters
`pdf-parse`	PDF text extraction to page-structured markdown
`diff`	Unified diff generation for `web_diff` snapshots
`express`	HTTP server for Streamable HTTP transport mode
`p-limit`	Concurrency limiter for batch fetch and search
`robots-parser`	robots.txt parsing for compliance checks

Optional

Package	Purpose
`playwright`	Headless browser for JS-rendered pages (optional peer dependency)
`ddgs` (Python)	DDGS metasearch library — auto-installed into a venv on first `web_search` call

How it works

markfetch uses a 4-layer extraction cascade — each layer catches errors silently and falls back to the next:

Defuddle — produces markdown natively with rich metadata (title, author, schema.org data)
Mozilla Readability — extracts main article content from the DOM
Turndown — converts full HTML to markdown with GFM table/strikethrough support
Plain text — strips all tags as a last resort

A 50-character minimum threshold determines whether an extraction is successful. If static extraction yields too little content and Playwright is available, the page is re-fetched via a headless browser.

Additional pipeline features:

SSRF protection — DNS resolution checks before fetch AND after redirect, blocking private IP ranges
Block detection — identifies Cloudflare, PerimeterX, Akamai, DataDome, and CAPTCHA pages
Domain cooldown — backs off after repeated block signals (5 min, extended to 30 min)
robots.txt / agents.txt — compliance checking with configurable enforcement
PDF extraction — pdf-parse converts PDFs to page-structured markdown
Smart chunking — splits at paragraph/heading boundaries, never inside code blocks or tables

Playwright fallback

Playwright is auto-detected at startup. If installed as a peer dependency, markfetch automatically retries via a headless browser when static extraction yields low content. No configuration required.

To install Playwright:

npm install playwright
npx playwright install chromium

To disable Playwright even when installed:

{
  "env": { "PLAYWRIGHT_DISABLED": "true" }
}

When Playwright is used, the response frontmatter includes rendered: true.

Compliance modes

Mode	Behavior
`permissive` (default)	v2-identical behavior — no enforcement beyond robots.txt opt-in
`balanced`	agents.txt advisory warnings, block detection surfaced, no fetch gating
`strict`	Full enforcement: robots.txt, agents.txt, honest UA, domain cooldown

{
  "env": { "COMPLIANCE_MODE": "strict" }
}

Migration guide (v2 → v3)

No breaking changes — all new behaviors default OFF.

What changed	Default	Action required
`web_search` works without `SEARXNG_URL`	DDGS auto-activates via `python3` or `uvx`	None
Block detection metadata in frontmatter	Always surfaced on detection	None — new `blocked` / `reason` fields
Search responses include `backend` field	Always present	None — new field in output
Compliance mode	`permissive` (v2-identical)	Set `COMPLIANCE_MODE=strict` for full enforcement
Playwright auto-detection	Auto-detects when installed	Use `PLAYWRIGHT_DISABLED=true` to opt out

Local development

git clone https://github.com/thoaud/llm-markdown-proxy
cd llm-markdown-proxy
npm install          # builds via prepare script
npm run dev          # live TypeScript reload
npm test             # unit tests (no network)
npm run test:integration  # requires internet

Point a client at your local build:

{
  "mcpServers": {
    "markfetch": {
      "command": "node",
      "args": ["/absolute/path/to/dist/index.js"]
    }
  }
}

Security

SSRF protection on every code path — pre-fetch DNS check and post-redirect hostname verification block private IP ranges (127.x, 10.x, 192.168.x, 172.16-31.x, 169.254.x, IPv6 loopback/link-local/ULA)
robots.txt compliance with per-domain caching (1-hour TTL)
Per-domain rate limiting prevents accidental abuse (default 10 req/min)
Block detection with automatic domain cooldown
No outbound data — markfetch only reads pages, never sends user data externally

Troubleshooting

Tools don't appear after adding the config

Verify Node.js 18+: node --version
Test the server starts: npx -y markfetch (should hang waiting for stdin — that's correct)
Check your config file is valid JSON
Restart the client completely (not just reload)

Fetch fails for internal/private URLs

markfetch blocks requests to private IP ranges (127.x, 10.x, 192.168.x, etc.) to prevent SSRF attacks. It is designed for public URLs only.

Response is truncated or shows a heading outline

Pages over 12K characters auto-condense to a heading outline. Pass max_chars (e.g., max_chars: 50000) to get full content, or paginate with start_index using the nextIndex value from the previous response.

SearXNG URL blocked by SSRF guard

Because markfetch blocks private IP ranges, SEARXNG_URL=http://localhost:8888 will be blocked. For local SearXNG, use a non-loopback hostname (e.g., http://host.docker.internal:8888 or a LAN address).

Playwright fallback error: "Chromium is not installed"

Run npx playwright install chromium to install the browser binary, then retry. To disable Playwright fallback entirely, set PLAYWRIGHT_DISABLED=true.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
.github/workflows		.github/workflows
bin		bin
docs		docs
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
GOOD_CITIZENSHIP.md		GOOD_CITIZENSHIP.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

markfetch

Why this exists

Quick start

Claude Code

Claude Desktop

Cursor

Windsurf

OpenAI Codex

Tools

web_fetch

web_fetch_batch

web_fetch_links

web_fetch_raw

web_search

web_diff

Compatibility

Configuration

Core (always active)

Opt-in

HTTP transport

HTTP transport (remote mode)

Dependencies

Runtime

Optional

How it works

Playwright fallback

Compliance modes

Migration guide (v2 → v3)

Local development

Security

Troubleshooting

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`web_fetch`

`web_fetch_batch`

`web_fetch_links`

`web_fetch_raw`

`web_search`

`web_diff`

Packages