Skip to content

farazfookeer/wraith-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wraith

Structured web extraction for AI agents. Config over code.

Wraith sits between your AI agent and the web. Instead of dumping raw HTML into an LLM and burning tokens, Wraith extracts exactly what your agent needs and returns clean, structured JSON.

The problem: AI agents using headless browsers today open a page, download everything including ads and trackers, dump 15,000 tokens of raw HTML to the LLM, and hope it finds the three paragraphs that matter.

Wraith's approach: Define what you need in a TOML config. Wraith blocks the junk, extracts the signal, and returns 200 tokens of structured data your agent can reason over immediately.

Token savings (real numbers)

Page Raw HTML Wraith Output Savings
books.toscrape.com 12,818 tokens 119 tokens 99.1%
quotes.toscrape.com 2,755 tokens 38 tokens 98.6%

Quick Start

pip install wraith-agent
playwright install chromium

Extract from a single page

python -m wraith.extract.pipeline \
  --url "https://example.com/article" \
  --config extractors/article.toml \
  --summary

Returns:

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "author": "Jane Smith",
  "date": "2026-01-15",
  "body": "Clean article text...",
  "meta_description": "...",
  "jsonld_publisher": "Example News",
  "_token_stats": {
    "raw_html_tokens": 12818,
    "wraith_output_tokens": 119,
    "tokens_saved": 12699,
    "savings_pct": 99.1
  }
}

Use with Claude Code (MCP)

Register Wraith as an MCP server and Claude can browse the web through it:

claude mcp add --transport stdio wraith-agent -- \
  python -m wraith.mcp

Then ask Claude things like:

Claude calls Wraith's tools autonomously and gets structured data back — not raw HTML.

5 MCP tools:

Tool What it does
wraith_extract Extract structured data using a named config (article, product, search, docs, job)
wraith_read Read a page as clean markdown — no nav, ads, or boilerplate
wraith_links Extract all links from a page
wraith_document Extract structured data from documents (PDF, Word, Excel, CSV, PowerPoint)
wraith_configs List available extractor configurations

Run the HTTP server

wraith-server

Your agent connects to http://localhost:8420 and calls:

  • POST /extract -- extract structured data with a named config
  • POST /navigate -- get markdown + metadata from any page
  • POST /links -- extract all links from a page
  • POST /extract/batch -- extract from multiple URLs concurrently
  • GET /extractors -- list available extractor configs

Run a workflow

python -m wraith.workflow.runner \
  --workflow workflows/research_loop.toml \
  --params '{"query": "AI agent infrastructure 2026"}'

CLI

# Extract with token stats
wraith extract --url "https://example.com" --config extractors/article.toml --summary

# List configs
wraith list extractors
wraith list workflows

# Run a workflow
wraith workflow --file workflows/research_loop.toml --params '{"query": "AI startups"}'

# Extract from a document (PDF, Word, Excel, CSV, PowerPoint)
wraith document --path report.pdf --summary

# Start the server
wraith server --port 8420

Writing Extractors

Extractors are TOML files, not code. To extract from a new type of page, create a config:

[meta]
name = "my_site"
description = "Extracts product data from my-shop.com"
engine = "playwright"
wait_for = ".price"

[fields.name]
selector = "h1.product-title"
type = "text"
required = true

[fields.price]
selector = ".price"
type = "number"

[fields.description]
selector = ".product-details"
type = "text"
max_length = 2000

Save it as extractors/my_site.toml and use it immediately:

wraith extract --url "https://my-shop.com/widget" --config extractors/my_site.toml

Field types

Type Returns Use for
text String Titles, body text, descriptions
number Float Prices, ratings, counts
link URL string Single link (e.g. apply button)
links List of URLs Multiple links
list List of strings Repeated elements (results, items)
date Date string Publication dates, timestamps
html Raw HTML When you need the markup
attribute Attribute value Image src, data attributes

Document Extraction

Wraith also extracts structured data from documents — same principle as web extraction. Send a file, get clean JSON back.

wraith document --path quarterly_report.pdf --summary

Supported formats:

Format What you get
PDF (.pdf) Text per page, tables, metadata (title, author, dates)
Word (.docx) Sections grouped by heading, tables, properties
Excel (.xlsx) Each sheet as a table with headers and rows
CSV (.csv) Rows as dicts keyed by column headers
PowerPoint (.pptx) Text per slide, speaker notes

Works from the CLI, the MCP server (wraith_document tool), or programmatically:

from wraith.extract.documents import extract_document

result = await extract_document("report.pdf")
print(result.content)       # Full text
print(result.tables)        # Structured tables
print(result.token_stats)   # Token savings vs raw file

Also accepts URLs — Wraith downloads the file first, then extracts.

Architecture

Your Agent (Claude Code, custom agent, etc.)
    |
    v
Wraith MCP Server (stdio or FastAPI)
    |
    v
Extraction Pipeline
    |-- CSS extractor (your TOML fields)
    |-- Metadata extractor (OG, meta, title)
    |-- JSON-LD extractor (schema.org)
    |-- Markdown converter (clean text)
    |-- Token counter (raw vs extracted)
    |
    v
Browser Engine (swappable)
    |-- Playwright + Chromium (default, JS-capable)
    |-- httpx lightweight (10x faster for static pages)
    |
    v
Request Filter (blocks images, trackers, ads)
    |
    v
The Web

The browser engine is an abstraction. Today it's Playwright driving Chromium. When Lightpanda or another engine matures, swap it in one line of config. Your extractors and workflows don't change.

Included Extractors

Config Use for
article.toml News articles, blog posts
product.toml Ecommerce product pages
search.toml Search results pages
docs.toml Documentation and reference sites
job.toml Job listings

Workflows

Workflows chain multiple pages into a single operation, defined as a DAG in TOML:

[meta]
name = "research_loop"

[steps.search]
action = "navigate"
url_template = "https://www.google.com/search?q={query}"
extractor = "search"

[steps.read_articles]
action = "for_each"
depends_on = "search"
source_field = "search.result_links"
max_items = 5
parallel = true

[steps.read_articles.sub_step]
action = "navigate"
url_from = "item"
extractor = "article"

Steps execute in dependency order. Independent steps run in parallel. Supports navigate, click, fill, wait, for_each, and collect actions.

Development

git clone https://github.com/farazfookeer/wraith-agent.git
cd wraith-agent
pip install -e ".[dev]"
playwright install chromium
pytest tests/ -v  # 134 tests

Licence

MIT


Built by Faraz Fookeer with Claude Opus 4.6 via Claude Code.

About

AI agent orchestration layer for web browsing. Structured extraction, workflow DAGs, and MCP server on top of headless browsers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages