Skip to content

agenticbuildingblocks/markfetch

Repository files navigation

markfetch

npm version CI License: MIT

markfetch is a zero-config MCP server that fetches web pages and returns clean, token-efficient Markdown with structured YAML frontmatter — no API keys, no external services required.


Why this exists

When an LLM reads a webpage, it receives everything: cookie banners, nav links, footer noise, related-article widgets. That's hundreds of wasted tokens per page — and the actual content gets buried.

The official mcp-server-fetch is buggy and insecure, and paid services like Firecrawl require API keys. There's no good free option that just works.

markfetch runs a multi-layer extraction pipeline (Defuddle → Mozilla Readability → Turndown → plain text) that strips noise and returns tight, readable Markdown. Large pages auto-condense to a heading outline. PDFs are extracted to structured text. Multiple URLs fetch in parallel. Search works out of the box. All with SSRF protection, robots.txt compliance, and zero configuration.


Quick start

Prerequisites: Node.js 18+

Claude Code

claude mcp add markfetch -- npx -y markfetch

Or add manually to ~/.claude/settings.json:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Restart Claude Code — the tools are available immediately.

Other clients (Claude Desktop, Cursor, Windsurf, Codex)

Claude Desktop

~/Library/Application Support/Claude/claude_desktop_config.json (Mac) %APPDATA%\Claude\claude_desktop_config.json (Windows)

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Cursor

.cursor/mcp.json (project) or ~/.cursor/mcp.json (global):

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Windsurf

~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

OpenAI Codex

~/.codex/config.json:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"]
    }
  }
}

Tools

Tool Description
web_fetch Fetch a URL → clean Markdown with YAML frontmatter. Auto-condenses large pages.
web_fetch_batch Fetch multiple URLs in parallel with per-domain concurrency limits.
web_fetch_links Extract and classify all links from a page (internal vs external).
web_fetch_raw Return raw HTML/text without markdown conversion.
web_search Search the web — works out of the box, no API keys needed.
web_diff Compare current page content against a prior snapshot as a unified diff.

web_fetch

Fetches a URL and runs it through the extraction pipeline. Returns Markdown with YAML frontmatter (title, author, word count, extraction method). Large pages (>12K chars) auto-condense to a heading outline unless max_chars is set.

Parameter Type Required Default Description
url string yes URL to fetch
extract_main boolean no true Extract main article content (strips nav/ads). Set false for full page.
start_index number no 0 Byte offset for pagination. Use nextIndex from the prior response.
max_chars number no auto Max characters to return (500–200,000). Omit for auto-condensing.
headers object no {} Custom HTTP headers. Bypasses the shared cache.
respect_robots boolean no true Check robots.txt before fetching. Set false for internal/authorized URLs.

Example response:

---
title: "Page Title"
url: "https://example.com/article"
author: "Author Name"
published_date: "2026-01-15"
site_name: "Example"
schema_type: "Article"
canonical_url: "https://example.com/article"
language: "en"
word_count: 1523
extraction_method: "defuddle"
---

# Page Title

Page content here...

Additional frontmatter fields appear when relevant: page_count for PDFs, rendered: true when Playwright was used, blocked: true when bot detection is encountered.

web_fetch_batch

Fetches multiple URLs in parallel using Promise.allSettled — one failure does not block others. Default concurrency: 5 total, 3 per domain.

Parameter Type Required Default Description
urls string[] yes Array of URLs to fetch (max 20)
extract_main boolean no true Extract main article content for all URLs.
headers object no {} Custom headers applied to every request.
respect_robots boolean no true Check robots.txt before fetching.

web_fetch_links

Extracts all links from a page, deduplicating by URL and classifying as internal or external.

Parameter Type Required Default Description
url string yes URL to extract links from
filter_external boolean no false Only return external links.
filter_internal boolean no false Only return internal links.
respect_robots boolean no true Check robots.txt before fetching.

web_fetch_raw

Returns the raw HTML or text response without conversion. Useful when markdown conversion loses structure you need.

Parameter Type Required Default Description
url string yes URL to fetch
max_chars number no 50000 Max characters to return (100–100,000).
respect_robots boolean no true Check robots.txt before fetching.

web_search

Searches the web and returns a Markdown table of results. Works out of the box using DDGS (Dux Distributed Global Search) — a metasearch library that aggregates results from Bing, Brave, DuckDuckGo, Google, Mojeek, Yandex, Yahoo, and Wikipedia. Auto-installed into a Python venv on first use via python3 (or uvx fallback). No API keys required. Set SEARXNG_URL for a self-hosted SearXNG backend instead.

Parameter Type Required Default Description
query string yes Search query
count number no 10 Number of results (1–20)
language string no ISO 639 language code (e.g., en, de)
time_range string no day, week, month, or year
safesearch number no 0 (off), 1 (moderate), 2 (strict)
categories string no Comma-separated SearXNG categories

web_diff

Compares current page content against the most recent snapshot and returns a unified diff. First call on a new URL saves a baseline snapshot; subsequent calls show changes. YAML frontmatter is stripped before diffing.

Parameter Type Required Default Description
url string yes URL to compare
extract_main boolean no true Extract main article content.
headers object no {} Custom HTTP headers.
respect_robots boolean no true Check robots.txt before fetching.

Snapshots are stored at $SNAPSHOT_DIR (default ~/.markfetch/snapshots/) and pruned after SNAPSHOT_MAX_AGE_DAYS days (default 30).


Compatibility

Client Status
Claude Code ✅ Supported
Claude Desktop ✅ Supported
Cursor ✅ Supported
Windsurf ✅ Supported
OpenAI Codex ✅ Supported
Any MCP client ✅ Supported

Configuration

Pass environment variables via the env key in your MCP config:

{
  "mcpServers": {
    "markfetch": {
      "command": "npx",
      "args": ["-y", "markfetch"],
      "env": {
        "CACHE_TTL_MS": "600000",
        "FETCH_TIMEOUT_MS": "30000",
        "RATE_LIMIT_PER_DOMAIN": "20"
      }
    }
  }
}

Core (always active)

Variable Default Description
CACHE_TTL_MS 300000 (5 min) In-memory cache TTL for fetched pages
FETCH_TIMEOUT_MS 15000 (15 sec) Per-request fetch timeout
RATE_LIMIT_PER_DOMAIN 10 (req/min) Requests per domain per minute. 0 to disable.
BATCH_CONCURRENCY 5 Max simultaneous URLs in web_fetch_batch
BATCH_PER_DOMAIN 3 Max simultaneous URLs per domain in batch
SNAPSHOT_DIR ~/.markfetch/snapshots/ Snapshot storage for web_diff
SNAPSHOT_MAX_AGE_DAYS 30 Days to retain snapshots
SNAPSHOTS_ENABLED true Set false to disable auto-snapshotting

Opt-in

Variable Default Description
PLAYWRIGHT_DISABLED false Set true to disable Playwright fallback for JS-rendered pages
PLAYWRIGHT_TIMEOUT_MS 30000 (30 sec) Timeout for Playwright page render
SEARXNG_URL (unset) SearXNG instance URL for web_search. Without it, DDGS is used.
SEARXNG_AUTH_TOKEN (unset) Bearer token for auth-gated SearXNG instances
SEARCH_BACKEND auto Force backend: auto, searxng, or ddgs
COMPLIANCE_MODE permissive strict, balanced, or permissive (see Compliance)
HONEST_UA false Send markfetch/<version> as User-Agent. Auto-enabled in strict mode.

HTTP transport

Variable Default Description
PORT 3030 HTTP port. Setting this also activates HTTP transport (no --http flag needed).

HTTP transport (remote mode)

By default, markfetch runs as a stdio MCP server. For remote or multi-client deployments, start in HTTP mode:

node dist/index.js --http
node dist/index.js --http --cors --port 3030
Flag Default Description
--http off Enable Streamable HTTP transport
--port <n> 3030 HTTP port
--host <addr> 127.0.0.1 Bind address
--cors off Enable CORS (Access-Control-Allow-Origin: *)

A /health endpoint returns { "status": "ok", "transport": "streamable-http" } for liveness checks. The MCP endpoint is POST /mcp.


Dependencies

Runtime

Package Purpose
@modelcontextprotocol/sdk MCP server framework (stdio + Streamable HTTP transport)
defuddle Primary content extractor — produces markdown with metadata
@mozilla/readability Secondary content extractor (fallback when Defuddle fails)
turndown HTML-to-Markdown converter (fallback after Readability)
turndown-plugin-gfm GFM tables and strikethrough support for Turndown
jsdom DOM implementation for Node.js — used by Readability and link extraction
zod Schema validation for MCP tool input parameters
pdf-parse PDF text extraction to page-structured markdown
diff Unified diff generation for web_diff snapshots
express HTTP server for Streamable HTTP transport mode
p-limit Concurrency limiter for batch fetch and search
robots-parser robots.txt parsing for compliance checks

Optional

Package Purpose
playwright Headless browser for JS-rendered pages (optional peer dependency)
ddgs (Python) DDGS metasearch library — auto-installed into a venv on first web_search call

How it works

markfetch uses a 4-layer extraction cascade — each layer catches errors silently and falls back to the next:

  1. Defuddle — produces markdown natively with rich metadata (title, author, schema.org data)
  2. Mozilla Readability — extracts main article content from the DOM
  3. Turndown — converts full HTML to markdown with GFM table/strikethrough support
  4. Plain text — strips all tags as a last resort

A 50-character minimum threshold determines whether an extraction is successful. If static extraction yields too little content and Playwright is available, the page is re-fetched via a headless browser.

Additional pipeline features:

  • SSRF protection — DNS resolution checks before fetch AND after redirect, blocking private IP ranges
  • Block detection — identifies Cloudflare, PerimeterX, Akamai, DataDome, and CAPTCHA pages
  • Domain cooldown — backs off after repeated block signals (5 min, extended to 30 min)
  • robots.txt / agents.txt — compliance checking with configurable enforcement
  • PDF extractionpdf-parse converts PDFs to page-structured markdown
  • Smart chunking — splits at paragraph/heading boundaries, never inside code blocks or tables

Playwright fallback

Playwright is auto-detected at startup. If installed as a peer dependency, markfetch automatically retries via a headless browser when static extraction yields low content. No configuration required.

To install Playwright:

npm install playwright
npx playwright install chromium

To disable Playwright even when installed:

{
  "env": { "PLAYWRIGHT_DISABLED": "true" }
}

When Playwright is used, the response frontmatter includes rendered: true.


Compliance modes

Mode Behavior
permissive (default) v2-identical behavior — no enforcement beyond robots.txt opt-in
balanced agents.txt advisory warnings, block detection surfaced, no fetch gating
strict Full enforcement: robots.txt, agents.txt, honest UA, domain cooldown
{
  "env": { "COMPLIANCE_MODE": "strict" }
}

Migration guide (v2 → v3)

No breaking changes — all new behaviors default OFF.

What changed Default Action required
web_search works without SEARXNG_URL DDGS auto-activates via python3 or uvx None
Block detection metadata in frontmatter Always surfaced on detection None — new blocked / reason fields
Search responses include backend field Always present None — new field in output
Compliance mode permissive (v2-identical) Set COMPLIANCE_MODE=strict for full enforcement
Playwright auto-detection Auto-detects when installed Use PLAYWRIGHT_DISABLED=true to opt out

Local development

git clone https://github.com/thoaud/llm-markdown-proxy
cd llm-markdown-proxy
npm install          # builds via prepare script
npm run dev          # live TypeScript reload
npm test             # unit tests (no network)
npm run test:integration  # requires internet

Point a client at your local build:

{
  "mcpServers": {
    "markfetch": {
      "command": "node",
      "args": ["/absolute/path/to/dist/index.js"]
    }
  }
}

Security

  • SSRF protection on every code path — pre-fetch DNS check and post-redirect hostname verification block private IP ranges (127.x, 10.x, 192.168.x, 172.16-31.x, 169.254.x, IPv6 loopback/link-local/ULA)
  • robots.txt compliance with per-domain caching (1-hour TTL)
  • Per-domain rate limiting prevents accidental abuse (default 10 req/min)
  • Block detection with automatic domain cooldown
  • No outbound data — markfetch only reads pages, never sends user data externally

Troubleshooting

Tools don't appear after adding the config
  1. Verify Node.js 18+: node --version
  2. Test the server starts: npx -y markfetch (should hang waiting for stdin — that's correct)
  3. Check your config file is valid JSON
  4. Restart the client completely (not just reload)
Fetch fails for internal/private URLs

markfetch blocks requests to private IP ranges (127.x, 10.x, 192.168.x, etc.) to prevent SSRF attacks. It is designed for public URLs only.

Response is truncated or shows a heading outline

Pages over 12K characters auto-condense to a heading outline. Pass max_chars (e.g., max_chars: 50000) to get full content, or paginate with start_index using the nextIndex value from the previous response.

SearXNG URL blocked by SSRF guard

Because markfetch blocks private IP ranges, SEARXNG_URL=http://localhost:8888 will be blocked. For local SearXNG, use a non-loopback hostname (e.g., http://host.docker.internal:8888 or a LAN address).

Playwright fallback error: "Chromium is not installed"

Run npx playwright install chromium to install the browser binary, then retry. To disable Playwright fallback entirely, set PLAYWRIGHT_DISABLED=true.


License

MIT

About

Fetch any URL as clean, LLM-ready Markdown. A zero-config MCP server with SSRF protection, TTL caching, and a 4-layer extraction pipeline.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors