Wraith

Structured web extraction for AI agents. Config over code.

Wraith sits between your AI agent and the web. Instead of dumping raw HTML into an LLM and burning tokens, Wraith extracts exactly what your agent needs and returns clean, structured JSON.

The problem: AI agents using headless browsers today open a page, download everything including ads and trackers, dump 15,000 tokens of raw HTML to the LLM, and hope it finds the three paragraphs that matter.

Wraith's approach: Define what you need in a TOML config. Wraith blocks the junk, extracts the signal, and returns 200 tokens of structured data your agent can reason over immediately.

Token savings (real numbers)

Page	Raw HTML	Wraith Output	Savings
books.toscrape.com	12,818 tokens	119 tokens	99.1%
quotes.toscrape.com	2,755 tokens	38 tokens	98.6%

Quick Start

pip install wraith-agent
playwright install chromium

Extract from a single page

python -m wraith.extract.pipeline \
  --url "https://example.com/article" \
  --config extractors/article.toml \
  --summary

Returns:

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "author": "Jane Smith",
  "date": "2026-01-15",
  "body": "Clean article text...",
  "meta_description": "...",
  "jsonld_publisher": "Example News",
  "_token_stats": {
    "raw_html_tokens": 12818,
    "wraith_output_tokens": 119,
    "tokens_saved": 12699,
    "savings_pct": 99.1
  }
}

Use with Claude Code (MCP)

Register Wraith as an MCP server and Claude can browse the web through it:

claude mcp add --transport stdio wraith-agent -- \
  python -m wraith.mcp

Then ask Claude things like:

"Use wraith to extract the article from https://example.com"
"What extractors does wraith have available?"
"Extract all links from https://books.toscrape.com"

Claude calls Wraith's tools autonomously and gets structured data back — not raw HTML.

5 MCP tools:

Tool	What it does
`wraith_extract`	Extract structured data using a named config (article, product, search, docs, job)
`wraith_read`	Read a page as clean markdown — no nav, ads, or boilerplate
`wraith_links`	Extract all links from a page
`wraith_document`	Extract structured data from documents (PDF, Word, Excel, CSV, PowerPoint)
`wraith_configs`	List available extractor configurations

Run the HTTP server

wraith-server

Your agent connects to http://localhost:8420 and calls:

POST /extract -- extract structured data with a named config
POST /navigate -- get markdown + metadata from any page
POST /links -- extract all links from a page
POST /extract/batch -- extract from multiple URLs concurrently
GET /extractors -- list available extractor configs

Run a workflow

python -m wraith.workflow.runner \
  --workflow workflows/research_loop.toml \
  --params '{"query": "AI agent infrastructure 2026"}'

CLI

# Extract with token stats
wraith extract --url "https://example.com" --config extractors/article.toml --summary

# List configs
wraith list extractors
wraith list workflows

# Run a workflow
wraith workflow --file workflows/research_loop.toml --params '{"query": "AI startups"}'

# Extract from a document (PDF, Word, Excel, CSV, PowerPoint)
wraith document --path report.pdf --summary

# Start the server
wraith server --port 8420

Writing Extractors

Extractors are TOML files, not code. To extract from a new type of page, create a config:

[meta]
name = "my_site"
description = "Extracts product data from my-shop.com"
engine = "playwright"
wait_for = ".price"

[fields.name]
selector = "h1.product-title"
type = "text"
required = true

[fields.price]
selector = ".price"
type = "number"

[fields.description]
selector = ".product-details"
type = "text"
max_length = 2000

Save it as extractors/my_site.toml and use it immediately:

wraith extract --url "https://my-shop.com/widget" --config extractors/my_site.toml

Field types

Type	Returns	Use for
`text`	String	Titles, body text, descriptions
`number`	Float	Prices, ratings, counts
`link`	URL string	Single link (e.g. apply button)
`links`	List of URLs	Multiple links
`list`	List of strings	Repeated elements (results, items)
`date`	Date string	Publication dates, timestamps
`html`	Raw HTML	When you need the markup
`attribute`	Attribute value	Image src, data attributes

Document Extraction

Wraith also extracts structured data from documents — same principle as web extraction. Send a file, get clean JSON back.

wraith document --path quarterly_report.pdf --summary

Supported formats:

Format	What you get
PDF (`.pdf`)	Text per page, tables, metadata (title, author, dates)
Word (`.docx`)	Sections grouped by heading, tables, properties
Excel (`.xlsx`)	Each sheet as a table with headers and rows
CSV (`.csv`)	Rows as dicts keyed by column headers
PowerPoint (`.pptx`)	Text per slide, speaker notes

Works from the CLI, the MCP server (wraith_document tool), or programmatically:

from wraith.extract.documents import extract_document

result = await extract_document("report.pdf")
print(result.content)       # Full text
print(result.tables)        # Structured tables
print(result.token_stats)   # Token savings vs raw file

Also accepts URLs — Wraith downloads the file first, then extracts.

Architecture

Your Agent (Claude Code, custom agent, etc.)
    |
    v
Wraith MCP Server (stdio or FastAPI)
    |
    v
Extraction Pipeline
    |-- CSS extractor (your TOML fields)
    |-- Metadata extractor (OG, meta, title)
    |-- JSON-LD extractor (schema.org)
    |-- Markdown converter (clean text)
    |-- Token counter (raw vs extracted)
    |
    v
Browser Engine (swappable)
    |-- Playwright + Chromium (default, JS-capable)
    |-- httpx lightweight (10x faster for static pages)
    |
    v
Request Filter (blocks images, trackers, ads)
    |
    v
The Web

The browser engine is an abstraction. Today it's Playwright driving Chromium. When Lightpanda or another engine matures, swap it in one line of config. Your extractors and workflows don't change.

Included Extractors

Config	Use for
`article.toml`	News articles, blog posts
`product.toml`	Ecommerce product pages
`search.toml`	Search results pages
`docs.toml`	Documentation and reference sites
`job.toml`	Job listings

Workflows

Workflows chain multiple pages into a single operation, defined as a DAG in TOML:

[meta]
name = "research_loop"

[steps.search]
action = "navigate"
url_template = "https://www.google.com/search?q={query}"
extractor = "search"

[steps.read_articles]
action = "for_each"
depends_on = "search"
source_field = "search.result_links"
max_items = 5
parallel = true

[steps.read_articles.sub_step]
action = "navigate"
url_from = "item"
extractor = "article"

Steps execute in dependency order. Independent steps run in parallel. Supports navigate, click, fill, wait, for_each, and collect actions.

Development

git clone https://github.com/farazfookeer/wraith-agent.git
cd wraith-agent
pip install -e ".[dev]"
playwright install chromium
pytest tests/ -v  # 134 tests

Licence

MIT

Built by Faraz Fookeer with Claude Opus 4.6 via Claude Code.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
extractors		extractors
tests		tests
workflows		workflows
wraith		wraith
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CLAUDE_CODE_GUIDE.md		CLAUDE_CODE_GUIDE.md
README.md		README.md
plan.md		plan.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wraith

Token savings (real numbers)

Quick Start

Extract from a single page

Use with Claude Code (MCP)

Run the HTTP server

Run a workflow

CLI

Writing Extractors

Field types

Document Extraction

Architecture

Included Extractors

Workflows

Development

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wraith

Token savings (real numbers)

Quick Start

Extract from a single page

Use with Claude Code (MCP)

Run the HTTP server

Run a workflow

CLI

Writing Extractors

Field types

Document Extraction

Architecture

Included Extractors

Workflows

Development

Licence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages