Structured web extraction for AI agents. Config over code.
Wraith sits between your AI agent and the web. Instead of dumping raw HTML into an LLM and burning tokens, Wraith extracts exactly what your agent needs and returns clean, structured JSON.
The problem: AI agents using headless browsers today open a page, download everything including ads and trackers, dump 15,000 tokens of raw HTML to the LLM, and hope it finds the three paragraphs that matter.
Wraith's approach: Define what you need in a TOML config. Wraith blocks the junk, extracts the signal, and returns 200 tokens of structured data your agent can reason over immediately.
| Page | Raw HTML | Wraith Output | Savings |
|---|---|---|---|
| books.toscrape.com | 12,818 tokens | 119 tokens | 99.1% |
| quotes.toscrape.com | 2,755 tokens | 38 tokens | 98.6% |
pip install wraith-agent
playwright install chromiumpython -m wraith.extract.pipeline \
--url "https://example.com/article" \
--config extractors/article.toml \
--summaryReturns:
{
"url": "https://example.com/article",
"title": "Article Title",
"author": "Jane Smith",
"date": "2026-01-15",
"body": "Clean article text...",
"meta_description": "...",
"jsonld_publisher": "Example News",
"_token_stats": {
"raw_html_tokens": 12818,
"wraith_output_tokens": 119,
"tokens_saved": 12699,
"savings_pct": 99.1
}
}Register Wraith as an MCP server and Claude can browse the web through it:
claude mcp add --transport stdio wraith-agent -- \
python -m wraith.mcpThen ask Claude things like:
- "Use wraith to extract the article from https://example.com"
- "What extractors does wraith have available?"
- "Extract all links from https://books.toscrape.com"
Claude calls Wraith's tools autonomously and gets structured data back — not raw HTML.
5 MCP tools:
| Tool | What it does |
|---|---|
wraith_extract |
Extract structured data using a named config (article, product, search, docs, job) |
wraith_read |
Read a page as clean markdown — no nav, ads, or boilerplate |
wraith_links |
Extract all links from a page |
wraith_document |
Extract structured data from documents (PDF, Word, Excel, CSV, PowerPoint) |
wraith_configs |
List available extractor configurations |
wraith-serverYour agent connects to http://localhost:8420 and calls:
POST /extract-- extract structured data with a named configPOST /navigate-- get markdown + metadata from any pagePOST /links-- extract all links from a pagePOST /extract/batch-- extract from multiple URLs concurrentlyGET /extractors-- list available extractor configs
python -m wraith.workflow.runner \
--workflow workflows/research_loop.toml \
--params '{"query": "AI agent infrastructure 2026"}'# Extract with token stats
wraith extract --url "https://example.com" --config extractors/article.toml --summary
# List configs
wraith list extractors
wraith list workflows
# Run a workflow
wraith workflow --file workflows/research_loop.toml --params '{"query": "AI startups"}'
# Extract from a document (PDF, Word, Excel, CSV, PowerPoint)
wraith document --path report.pdf --summary
# Start the server
wraith server --port 8420Extractors are TOML files, not code. To extract from a new type of page, create a config:
[meta]
name = "my_site"
description = "Extracts product data from my-shop.com"
engine = "playwright"
wait_for = ".price"
[fields.name]
selector = "h1.product-title"
type = "text"
required = true
[fields.price]
selector = ".price"
type = "number"
[fields.description]
selector = ".product-details"
type = "text"
max_length = 2000Save it as extractors/my_site.toml and use it immediately:
wraith extract --url "https://my-shop.com/widget" --config extractors/my_site.toml| Type | Returns | Use for |
|---|---|---|
text |
String | Titles, body text, descriptions |
number |
Float | Prices, ratings, counts |
link |
URL string | Single link (e.g. apply button) |
links |
List of URLs | Multiple links |
list |
List of strings | Repeated elements (results, items) |
date |
Date string | Publication dates, timestamps |
html |
Raw HTML | When you need the markup |
attribute |
Attribute value | Image src, data attributes |
Wraith also extracts structured data from documents — same principle as web extraction. Send a file, get clean JSON back.
wraith document --path quarterly_report.pdf --summarySupported formats:
| Format | What you get |
|---|---|
PDF (.pdf) |
Text per page, tables, metadata (title, author, dates) |
Word (.docx) |
Sections grouped by heading, tables, properties |
Excel (.xlsx) |
Each sheet as a table with headers and rows |
CSV (.csv) |
Rows as dicts keyed by column headers |
PowerPoint (.pptx) |
Text per slide, speaker notes |
Works from the CLI, the MCP server (wraith_document tool), or programmatically:
from wraith.extract.documents import extract_document
result = await extract_document("report.pdf")
print(result.content) # Full text
print(result.tables) # Structured tables
print(result.token_stats) # Token savings vs raw fileAlso accepts URLs — Wraith downloads the file first, then extracts.
Your Agent (Claude Code, custom agent, etc.)
|
v
Wraith MCP Server (stdio or FastAPI)
|
v
Extraction Pipeline
|-- CSS extractor (your TOML fields)
|-- Metadata extractor (OG, meta, title)
|-- JSON-LD extractor (schema.org)
|-- Markdown converter (clean text)
|-- Token counter (raw vs extracted)
|
v
Browser Engine (swappable)
|-- Playwright + Chromium (default, JS-capable)
|-- httpx lightweight (10x faster for static pages)
|
v
Request Filter (blocks images, trackers, ads)
|
v
The Web
The browser engine is an abstraction. Today it's Playwright driving Chromium. When Lightpanda or another engine matures, swap it in one line of config. Your extractors and workflows don't change.
| Config | Use for |
|---|---|
article.toml |
News articles, blog posts |
product.toml |
Ecommerce product pages |
search.toml |
Search results pages |
docs.toml |
Documentation and reference sites |
job.toml |
Job listings |
Workflows chain multiple pages into a single operation, defined as a DAG in TOML:
[meta]
name = "research_loop"
[steps.search]
action = "navigate"
url_template = "https://www.google.com/search?q={query}"
extractor = "search"
[steps.read_articles]
action = "for_each"
depends_on = "search"
source_field = "search.result_links"
max_items = 5
parallel = true
[steps.read_articles.sub_step]
action = "navigate"
url_from = "item"
extractor = "article"Steps execute in dependency order. Independent steps run in parallel. Supports navigate, click, fill, wait, for_each, and collect actions.
git clone https://github.com/farazfookeer/wraith-agent.git
cd wraith-agent
pip install -e ".[dev]"
playwright install chromium
pytest tests/ -v # 134 testsMIT
Built by Faraz Fookeer with Claude Opus 4.6 via Claude Code.