vouch

Curated AI search for agents. Bring your own sources, your own LLM. The self-hosted, agent-ready alternative to Tavily and Perplexity for trusted-source research.

vouch is a Python library that lets you register a list of trusted sites with metadata (category, description, tags), and then runs intelligent searches across them using any LLM — including local models via Ollama. The AI dynamically figures out how to search each site (no manual selectors), caches what it learns, and returns structured results ready to feed an agent.

from vouch import SearchEngine, Site

engine = SearchEngine(llm="ollama/qwen2.5:14b")
engine.add(Site("cvm.gov.br", category="regulação financeira BR"))
engine.add(Site("arxiv.org", category="papers acadêmicos", tags=["ml", "ai"]))

results = engine.search("LLM agents in 2026", depth=2)
for r in results.chunks:
    print(f"[{r.source_url}] {r.title}\n{r.snippet}\n")

Why vouch
How it compares
Installation
Quick start
Three ways to use
Core concepts
The Site model
Search depth
Smart Site Routing
Selector cache
Profile registry
LLM configuration (BYOK)
Using as an agent tool
Optional features
Configuration reference
Architecture
Cost and performance
Limits and what's out of scope
Roadmap
Contributing
References and prior art

What's new in 0.2

CSS selector pinning — the AI now examines DOM context and emits reusable {container, title, url, snippet, date, author} selectors. Cached in SQLite; subsequent calls replay via lxml.cssselect — no LLM, milliseconds.
Auto-escalation chain — http → browser → stealth. The engine learns the cheapest tier each site needs and persists it.
Browser pool — one shared Chromium across all searches in a session. Eliminates the per-call launch overhead and the Windows-on-fresh-fork instability.
Plugin model — pip install vouch-adapter-arxiv (or anything matching vouch-adapter-*, vouch-router-*, vouch-profiles-*) and the next SearchEngine picks it up automatically via Python entry_points. See docs/plugins.md.
Profile registry + auto-update — 24 curated profiles ship in the box (arxiv, github, huggingface, gov.uk, irs.gov, sede.agenciatributaria, jusbrasil, MDN, pypi, …). vouch profiles update pulls the latest community registry.
Auto-DNS resolution — fixes wrong-host catalog entries automatically (agenciatributaria.es → sede.agenciatributaria.gob.es).
Async API — await engine.asearch(query, ...) alongside the sync API.
Result formatting — result.to_markdown(), to_json().
Smarter quality detector — flags megamenu shells, recursive search-page links, no-results sentinels in PT/ES/EN.
Probe crawl on add() — opt-in: site profile is learned the moment you add it, not on the first user query.
Package renamed from curio to vouch (the curio name on PyPI is taken by David Beazley's async library since 2016).

See CHANGELOG.md for the full list.

Why vouch

The agentic AI ecosystem has two kinds of search tools:

Open-web search (Tavily, Exa, Perplexica, Firecrawl /search): great for broad questions, but you can't tell them "only use these sources I trust."
Browser agents (Stagehand, Skyvern, browser-use): great at navigating one site at a time, but they don't know which site to navigate to and don't keep a registry of your trusted sources.

vouch fills the gap in between: a persistent catalog of sites you trust, with an LLM that reads your descriptions to route each query to the right sources, a browser layer that learns how to search each site once and caches the selectors, and a clean tool interface for CrewAI / LangChain / PydanticAI / MCP.

It's what you'd build if you wanted Tavily, but for your own list of sources, using your own LLM, without paying anyone.

How it compares

	vouch	Tavily / Exa	Perplexica	browser-use	Stagehand	Kagi Lenses
Curated source list with metadata	✅	⚠️ per call	❌	❌	❌	✅
LLM routing by description	✅	❌	❌	❌	❌	⚠️
Dynamic search-bar discovery	✅	❌	❌	✅	✅	❌
Cached selectors after first visit	✅	❌	❌	❌	✅	❌
Adjustable depth (0–3)	✅	⚠️	⚠️	❌	❌	❌
BYOK + local LLM (Ollama)	✅	❌	✅	✅	✅	❌
Self-hosted Python `pip install`	✅	❌	❌ TS	✅	⚠️ TS-first	❌
First-class agent tool	✅	✅	❌	⚠️	⚠️	❌

Installation

Minimal install (HTTP-only, no browser):

pip install vouch

Most users want the browser layer too:

pip install "vouch[browser]"
vouch install-browser   # downloads Chromium for Playwright

Optional extras:

pip install "vouch[browser,stealth]"     # patchright for harder sites
pip install "vouch[browser,ocr]"         # Tesseract CAPTCHA tier (CPU, no GPU)
pip install "vouch[browser,vision]"      # image handling for vision-LLM CAPTCHA
pip install "vouch[browser,monitor]"     # APScheduler for change tracking
pip install "vouch[browser,server]"      # FastAPI dashboard
pip install "vouch[mcp]"                 # MCP server (Claude Desktop, Cursor)
pip install "vouch[all]"                 # everything above
pip install "vouch[crewai]"              # framework integrations (separate)
pip install "vouch[langchain]"
pip install "vouch[pydantic-ai]"

Requirements: Python 3.10+. Linux, macOS, or Windows. ~500MB disk after install-browser.

Quick start

30 seconds with Ollama

# 1. Make sure Ollama is running with a capable model
ollama pull qwen2.5:14b

# 2. Install vouch
pip install "vouch[browser]"
vouch install-browser

# 3. One-liner search
python -c "from vouch import search; print(search('arxiv attention transformers', sites=['arxiv.org']))"

30 seconds with a paid API

export ANTHROPIC_API_KEY=sk-ant-...
pip install "vouch[browser]"
vouch install-browser

python -c "from vouch import search; print(search('FDA drug approvals 2025', sites=['fda.gov'], llm='anthropic/claude-haiku-4-5'))"

Three ways to use

vouch has three progressive levels of API. Pick the one that fits your case — they all coexist and use the same engine underneath.

Level 1 — One-shot function

For quick scripts or trying it out:

from vouch import search

results = search(
    query="regulamentação fundos 2026",
    sites=["cvm.gov.br", "valor.globo.com"],
    llm="ollama/qwen2.5:14b",
    depth=1,
)

No catalog, no metadata, no persistence. Ephemeral. Good for python -c "...".

Level 2 — Programmatic catalog

The typical workflow. Build a catalog in code:

from vouch import SearchEngine, Site

engine = SearchEngine(
    llm="ollama/qwen2.5:14b",
    cache_dir="~/.vouch",   # selectors and metadata persist here
)

engine.add(Site(
    url="cvm.gov.br",
    category="regulação financeira BR",
    description="Comissão de Valores Mobiliários — regulamentos, deliberações, consultas públicas",
    tags=["gov", "financeiro", "regulação"],
))

engine.add(Site(
    url="arxiv.org",
    category="papers acadêmicos",
    description="Preprints em CS, math, physics, biology",
    tags=["research", "ml", "ai"],
))

# All metadata fields are optional. The LLM router uses whatever is provided.
engine.add(Site("news.ycombinator.com"))

# Search with automatic routing
results = engine.search("transformer architectures", depth=2)

The catalog persists in SQLite (~/.vouch/catalog.db). You can engine.list_sites(), engine.remove("arxiv.org"), engine.update("arxiv.org", tags=[...]). (engine.list() still works as a deprecated alias through v0.x.)

Level 3 — YAML-declared catalog

For projects you want to version-control and share:

# sites.yaml
defaults:
  llm: ollama/qwen2.5:14b
  depth: 1
  humanize: true

sites:
  - url: cvm.gov.br
    category: regulação financeira BR
    description: Comissão de Valores Mobiliários
    tags: [gov, financeiro]

  - url: arxiv.org
    category: papers acadêmicos
    description: Preprints em CS, math, physics
    tags: [research, ml, ai]

  - url: news.ycombinator.com
    # no metadata — that's fine, LLM will infer from URL

  - url: site-protegido.com
    behavior: stealth          # use patchright for this site
    rate_limit: "1/30s"        # be extra polite

from vouch import SearchEngine

engine = SearchEngine.from_yaml("sites.yaml")
results = engine.search("LLM regulation")

Same engine, same SQLite cache, just a different way to declare the catalog. Good for CI, sharing with teammates, or templating per-project.

Core concepts

vouch is built around four primitives. Understanding them helps you customize what you need:

┌──────────────────────────────────────────────────────────────┐
│                     SearchEngine                             │
│   the orchestrator — you mostly talk to this                 │
└──────────────────────────────────────────────────────────────┘
        │
        ├─── Catalog ──────── persistent registry of Sites
        │
        ├─── Router ────────── picks relevant Sites for a query
        │                      (uses Site metadata + LLM)
        │
        ├─── SiteAdapter ───── per-site search executor
        │     ├─ discovers selectors (LLM, first time)
        │     ├─ caches selectors (SQLite, every time after)
        │     └─ executes search + extraction
        │
        └─── Aggregator ────── ranks, dedups, returns SearchResult

Each is a Protocol — you can replace any of them. Want a custom router that uses a vector DB instead of LLM? Implement Router. Want to plug a paid scraping service? Implement SiteAdapter.

The Site model

from vouch import Site

Site(
    url: str,                       # required: "cvm.gov.br" or "https://cvm.gov.br"
    category: str | None = None,    # optional: "regulação financeira"
    description: str | None = None, # optional: free-text, fed to router LLM
    tags: list[str] | None = None,  # optional: ["gov", "financeiro"]
    
    # Behavioral overrides (all optional)
    behavior: str = "natural",      # "natural" | "stealth" | "external"
    rate_limit: str | None = None,  # "10/min", "1/30s", etc
    requires_login: bool = False,   # tells engine to handle auth flow
    search_url_template: str | None = None,  # bypass discovery: "/search?q={query}"
    
    # Custom hooks (advanced)
    pre_search: Callable | None = None,
    post_extract: Callable | None = None,
)

How the optional metadata is used

Category and description are concatenated and fed to the router LLM when there are too many sites to search them all. The LLM reads your descriptions and picks the most relevant subset for the query. With no description, the router falls back to the URL and any cached info.
Tags can be used as filters (engine.search(query, only_tags=["ml"])) and as routing hints.
Behavior controls the browser strategy (see Optional features).
search_url_template is the escape hatch: if you already know how the site's search works (https://github.com/search?q={query}), provide it and vouch skips discovery entirely.

Search depth

The depth parameter is the central knob for cost vs. quality:

Depth	Behavior	Latency	Cost (typical)
0	Fetch homepage + extract recent content. No search interaction.	~2s/site	~$0
1	Discover & execute search, return result snippets.	~5–10s/site	$0.005–0.02
2	Enter top-N results, extract full content.	~20–40s/site	$0.05–0.15
3	Follow internal links from results, deep crawl.	~1–3min/site	$0.20–0.60

# Quick check — just see what's on the homepages
engine.search("breaking news", depth=0)

# Standard — get search results
engine.search("LLM benchmarks", depth=1)

# Research mode — full content extraction
engine.search("GPT-5 architecture details", depth=2)

# Deep dive — agent navigates internal links
engine.search("complete history of CVM rule 175", depth=3)

Costs above assume Claude Haiku or Gemini Flash. With Ollama: $0 for everything except hardware/electricity.

Smart Site Routing

When your catalog grows beyond ~5 sites, searching all of them for every query wastes time and tokens. vouch uses an LLM to route: given a query and your catalog metadata, pick the top-K most relevant sites.

engine = SearchEngine(
    llm="ollama/qwen2.5:14b",
    router_top_k=3,                    # search top 3 sites per query
    router_strategy="llm",             # "llm" | "embedding" | "all" | "tags"
    router_explain=True,               # log why each site was chosen
)

Strategies

llm (default): single LLM call reads catalog + query, returns ranked sites. ~200 tokens, ~$0.0001 with Haiku, free with Ollama. Most accurate.
embedding: pre-embeds catalog descriptions; routes via cosine similarity. ~10ms, no LLM call. Cheaper at scale, slightly less accurate. Uses sentence-transformers if installed.
all: searches every site every time. Honest fallback for small catalogs (≤5).
tags: filter by tag match. Useful when query domain is explicit.

# Examples of router behavior
engine.search("Resolução CVM 175")
# → routes to: cvm.gov.br, valor.globo.com (skips arxiv, hacker news)

engine.search("attention is all you need paper")
# → routes to: arxiv.org (skips financial sites)

engine.search("python async best practices")
# → routes to: news.ycombinator.com (skips arxiv if no description match)

Inspect what the router did:

results = engine.search("LLM eval methods")
print(results.routing_decisions)
# [
#   RouteDecision(site="arxiv.org", score=0.92, reason="academic ML papers"),
#   RouteDecision(site="news.ycombinator.com", score=0.61, reason="developer discussions"),
# ]

Selector cache

The first time vouch visits a site, it spends an LLM call to figure out the search interface (where's the input? where's the submit? how do results render?). It caches what it learns in SQLite. Subsequent searches on the same site use Playwright directly, with no LLM in the loop — milliseconds, $0.

The cache is keyed by (domain, dom_fingerprint). If the site redesigns and the cached selectors fail, vouch invalidates the entry and re-discovers automatically.

# First search on cvm.gov.br: ~8s, $0.01 (LLM discovery)
# Second search on cvm.gov.br: ~2s, $0 (cached)
# After 1000 searches: still ~2s, still $0

engine.cache_stats()
# {
#   "cvm.gov.br": {"discovered": "2026-05-09", "hits": 47, "fails": 0},
#   "arxiv.org":  {"discovered": "2026-05-10", "hits": 12, "fails": 1},
# }

# Force re-discovery for a site
engine.invalidate_cache("cvm.gov.br")

The cache lives at ~/.vouch/selectors.db — a tiny SQLite file you can ship in version control if you want a head-start for teammates.

Profile registry

vouch ships with curated profiles for popular sites — preconfigured Site objects with the right search_url_template, tags, and category. You don't have to figure out arxiv's search URL or how to phrase HuggingFace's description.

from vouch import SearchEngine, get_profile, list_profiles

engine = SearchEngine(llm="ollama/qwen2.5:14b")

# List what's bundled
list_profiles()
# → ['arxiv.org', 'bbc.com', 'crates.io', 'developer.mozilla.org',
#    'docs.python.org', 'elpais.com', 'en.wikipedia.org', 'es.wikipedia.org',
#    'folha.uol.com.br', 'github.com', 'gov.uk', 'huggingface.co', 'irs.gov',
#    'jusbrasil.com.br', 'news.ycombinator.com', 'npmjs.com',
#    'paperswithcode.com', 'pkg.go.dev', 'planalto.gov.br', 'pt.wikipedia.org',
#    'pypi.org', 'reddit.com', 'sede.agenciatributaria.gob.es',
#    'stackoverflow.com']

# Use a profile directly
engine.add(get_profile("arxiv.org"))
engine.add(get_profile("github.com"))
engine.add(get_profile("jusbrasil.com.br"))   # for BR legal/tax queries

results = engine.search("LLM benchmarks 2026")

Or via CLI:

vouch profiles list                      # show bundled profiles
vouch profiles show arxiv.org            # see one in detail
vouch profiles import arxiv.org,github.com,huggingface.co
vouch profiles import all                # add everything to your local catalog

The profile YAML lives at vouch/profiles/builtin.yaml. Community contributions via PR welcome — see the new_profile issue template.

A separate community-maintained registry (vouch-profiles repo) is on the roadmap; teams can publish their domain-specific profiles there and run vouch profiles update to pull them. The update command needs internet access — it falls back silently to the bundled profiles on any network failure, so air-gapped installs keep working.

LLM configuration (BYOK)

vouch uses LiteLLM underneath, which means any of 100+ providers work with the same syntax.

Local (Ollama, LM Studio)

Recommended models:

ollama/qwen2.5:14b — best quality/size balance, runs in 16GB RAM
ollama/qwen2.5:7b — faster, lower hardware
ollama/qwen2.5-vl:7b — multimodal, needed for vision features
ollama/llama3.2:8b — alternative

engine = SearchEngine(llm="ollama/qwen2.5:14b")

# Or split: cheap model for routing, capable model for extraction
engine = SearchEngine(
    llm="ollama/qwen2.5:14b",          # discovery + extraction
    router_llm="ollama/qwen2.5:7b",    # routing decisions
)

API providers

# Anthropic
engine = SearchEngine(llm="anthropic/claude-haiku-4-5")
# requires ANTHROPIC_API_KEY env var

# OpenAI
engine = SearchEngine(llm="openai/gpt-4.1-mini")

# Google
engine = SearchEngine(llm="gemini/gemini-2.5-flash")

# Mix providers
engine = SearchEngine(
    llm="anthropic/claude-sonnet-4-6",
    router_llm="gemini/gemini-2.5-flash-lite",  # cheaper routing
    vision_llm="anthropic/claude-haiku-4-5",    # for CAPTCHA / visual extraction
)

Pass keys explicitly

engine = SearchEngine(
    llm="anthropic/claude-haiku-4-5",
    api_keys={
        "anthropic": "sk-ant-...",
        "openai": "sk-...",
    },
)

Fallback chain

engine = SearchEngine(
    llm=["anthropic/claude-haiku-4-5", "openai/gpt-4.1-mini", "ollama/qwen2.5:14b"],
)
# Tries each in order if previous fails

Using as an agent tool

vouch's primary audience is agentic frameworks. First-class adapters:

CrewAI

from crewai import Agent, Task, Crew
from vouch.integrations.crewai import VouchSearchTool

search_tool = VouchSearchTool(catalog="sites.yaml", default_depth=2)

researcher = Agent(
    role="Financial Regulation Researcher",
    goal="Find recent CVM rules on fund management",
    tools=[search_tool],
    llm="anthropic/claude-sonnet-4-6",
)

LangChain / LangGraph

from langchain.agents import AgentExecutor
from vouch.integrations.langchain import VouchSearchTool

tools = [VouchSearchTool.from_yaml("sites.yaml")]
agent = AgentExecutor(tools=tools, ...)

PydanticAI

from pydantic_ai import Agent
from vouch.integrations.pydantic_ai import vouch_tool

agent = Agent("anthropic:claude-sonnet-4-6", tools=[vouch_tool("sites.yaml")])

MCP server (Claude Desktop, Cursor, Cline)

vouch mcp serve --catalog sites.yaml

Configure in claude_desktop_config.json:

{
  "mcpServers": {
    "vouch": {
      "command": "vouch",
      "args": ["mcp", "serve", "--catalog", "/path/to/sites.yaml"]
    }
  }
}

Now Claude Desktop can call search_curated_sources(query, depth) directly.

Optional features

Human-like behavior

Reduces bot-detection signals on sites with simple behavioral analytics. Does not bypass Cloudflare/DataDome/PerimeterX — see limits.

engine = SearchEngine(
    llm="ollama/qwen2.5:14b",
    humanize=True,                    # default: lognormal pauses, variable typing speed
    typing_speed="natural",           # "fast" | "natural" | "slow"
    business_hours_only=False,        # restrict to user-local business hours
)

What it does:

Lognormal pause distribution between actions (60–250ms typing, 0.5–3s reading)
Random typo injection with backspace correction (~2% rate)
Optional business-hours-only execution (business_hours_only=True)
Single, recent Chrome user-agent applied per session (no rotation)

Stealth mode (per site)

Drop-in Patchright replacement for Playwright. Helps with mild Cloudflare Bot Fight Mode, fingerprint checks. Does not defeat Turnstile invisible or DataDome.

engine.add(Site("portal-protegido.com", behavior="stealth"))
# OR globally:
engine = SearchEngine(default_behavior="stealth")

CAPTCHA assist (lightweight by default)

Optional, opt-in, multi-backend. Two tiers used in order from cheapest to heaviest:

1. Tesseract OCR — for distorted-text CAPTCHAs (Receita, CVM, Jusbrasil, agencia tributaria, gov.uk). CPU-only, ~50-200 ms per image, $0, no GPU, no model download.

pip install "vouch[ocr]"
# plus the system binary:
#   Ubuntu/Debian: sudo apt install tesseract-ocr tesseract-ocr-por tesseract-ocr-spa
#   macOS:         brew install tesseract tesseract-lang
#   Windows:       https://github.com/UB-Mannheim/tesseract/wiki

That's already enough for the common text-CAPTCHA case — no vision_llm needed.

2. Vision LLM — upgrade for image-grid CAPTCHAs ("select all traffic lights").

engine = SearchEngine(
    llm="ollama/qwen2.5:14b",
    vision_llm="ollama/qwen2.5vl:7b",   # ~5 GB, enables image-grid solving
    captcha_min_confidence=0.7,
    captcha_max_attempts=2,
)

The CaptchaSolver chains them automatically: Tesseract first → vision LLM on low confidence → return best. result.solver tells you which one produced the answer.

Supported types: text OCR, reCAPTCHA v2 image grid, hCaptcha image grid (~50–80% success rate). Not supported: reCAPTCHA v3 (invisible), Cloudflare Turnstile, Arkose 3D puzzles. These return status="blocked".

# When solver fails or type is unsupported:
result.status      # "blocked"
result.reason      # "captcha_unsupported"
result.suggestion  # "Use the official API: https://..."

Change monitoring

Watch sites for changes, get notified.

from vouch import Monitor

monitor = Monitor(engine)
monitor.watch(
    site="cvm.gov.br",
    query="resoluções recentes",
    interval="1h",
    notify_via="email",   # or "slack", "discord", "webhook"
)
monitor.start()

Uses 3-tier change detection: hash → diff → embedding similarity. Avoids LLM calls when content is unchanged, keeping monitoring cheap (~$1–3/month for 100 sites hourly with Gemini Flash-Lite).

FastAPI dashboard

vouch serve --port 8080
# → web UI at http://localhost:8080
# → manage catalog, view cache, run searches, see costs

Configuration reference

`SearchEngine(...)`

SearchEngine(
    # LLM
    llm: str | list[str] = "ollama/qwen2.5:14b",
    router_llm: str | None = None,        # defaults to llm
    vision_llm: str | None = None,        # disables CAPTCHA features if None
    api_keys: dict | None = None,
    
    # Routing
    router_strategy: str = "llm",         # "llm" | "embedding" | "all" | "tags"
    router_top_k: int = 3,
    router_explain: bool = False,
    
    # Search
    default_depth: int = 1,
    parallel_sites: int = 3,              # how many sites to search concurrently
    
    # Browser
    default_behavior: str = "natural",    # "natural" | "stealth" | "external"
    humanize: bool = True,
    typing_speed: str = "natural",
    headless: bool = True,
    
    # Caching
    cache_dir: str = "~/.vouch",
    cache_ttl_days: int = 30,
    
    # Politeness
    respect_robots_txt: bool = True,
    default_rate_limit: str = "2/min",
    user_agent: str | None = None,        # defaults to recent Chrome
    
    # Captcha (only if vision_llm is set)
    captcha_min_confidence: float = 0.7,
    captcha_max_attempts: int = 2,
    
    # Misc
    cache_dir: str = "~/.vouch",
    verbose: bool = False,
)

`engine.search(...)`

engine.search(
    query: str,
    depth: int = None,                # overrides default_depth
    sites: list[str] | None = None,   # restrict to specific sites
    only_tags: list[str] | None = None,
    exclude_tags: list[str] | None = None,
    max_results: int = 10,
    timeout: float = 60.0,
) -> SearchResult

`SearchResult`

SearchResult(
    query: str,
    chunks: list[Chunk],
    sites_searched: list[str],
    routing_decisions: list[RouteDecision],
    duration_ms: float,
    cost_estimate_usd: float,
    tokens_used: dict[str, int],   # {"input": ..., "output": ...}
    cache_stats: dict,             # {"hits": ..., "misses": ...}
)

Chunk(
    source_url: str,
    site: str,                     # "cvm.gov.br"
    site_category: str | None,
    title: str,
    snippet: str,                  # ~200 chars
    content: str | None,           # full markdown if depth >= 2
    relevance_score: float,        # 0–1
    extracted_at: datetime,
    metadata: dict,                # extra fields per adapter
)

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                         SearchEngine                                 │
└──────────────────────────────────────────────────────────────────────┘
                               │
            ┌──────────────────┴──────────────────┐
            │                                     │
            ▼                                     ▼
    ┌──────────────┐                     ┌──────────────┐
    │   Catalog    │                     │    Router    │
    │  (SQLite)    │ ─── metadata ────▶  │   (LLM/emb)  │
    └──────────────┘                     └──────────────┘
                                                  │
                                                  │ ranked sites
                                                  ▼
                                       ┌────────────────────┐
                                       │   parallel fetch   │
                                       └────────────────────┘
                                                  │
                  ┌───────────────────────────────┼───────────────────────────────┐
                  │                               │                               │
                  ▼                               ▼                               ▼
         ┌──────────────┐               ┌──────────────┐               ┌──────────────┐
         │ SiteAdapter  │               │ SiteAdapter  │               │ SiteAdapter  │
         │  cvm.gov.br  │               │  arxiv.org   │               │ news.yc.com  │
         │              │               │              │               │              │
         │ ┌──────────┐ │               │ ┌──────────┐ │               │ ┌──────────┐ │
         │ │ selector │ │               │ │ selector │ │               │ │ selector │ │
         │ │  cache   │ │               │ │  cache   │ │               │ │  cache   │ │
         │ └────┬─────┘ │               │ └────┬─────┘ │               │ └────┬─────┘ │
         │      │ MISS  │               │      │ HIT   │               │      │ HIT   │
         │      ▼       │               │      ▼       │               │      ▼       │
         │  discover    │               │   playwright │               │   playwright │
         │  via LLM     │               │   replay     │               │   replay     │
         └──────┬───────┘               └──────┬───────┘               └──────┬───────┘
                │                              │                              │
                └──────────────────────────────┴──────────────────────────────┘
                                               │
                                               ▼
                                    ┌────────────────────┐
                                    │  Aggregator/Rank   │
                                    └────────────────────┘
                                               │
                                               ▼
                                       SearchResult

Module layout

vouch/
├── __init__.py            # public API: search, SearchEngine, Site
├── engine.py              # SearchEngine orchestrator
├── catalog.py             # Catalog (SQLite-backed Site registry)
├── models.py              # Site, Chunk, SearchResult, RouteDecision
├── config.py              # SearchEngine config dataclass
├── exceptions.py          # VouchError + back-compat CurioError alias
├── dns_resolver.py        # auto-fixes wrong-host catalog entries
├── plugins.py             # entry_points-based plugin discovery
├── cli.py                 # typer-based CLI (vouch ...)
├── server.py              # optional: FastAPI dashboard
├── _lang.py               # language detection helpers
├── _llm.py                # LiteLLM wrapper (sync + async)
├── router/
│   ├── base.py            # Router Protocol
│   ├── llm_router.py      # LLM-based routing (default)
│   ├── embedding_router.py# cosine-similarity routing
│   ├── tag_router.py      # tag-match routing
│   └── all_router.py      # "search every site" fallback
├── adapters/
│   ├── base.py            # SiteAdapter Protocol
│   ├── http.py            # depth=0, fast HTTP+Trafilatura
│   ├── browser.py         # depth=1+, Playwright
│   ├── browser_pool.py    # shared Chromium pool across searches
│   └── stealth.py         # patchright wrapper
├── discovery/
│   ├── search_bar.py      # LLM-powered selector discovery
│   ├── cache.py           # SQLite selector cache
│   ├── humanize.py        # lognormal pauses, typing, business hours
│   └── probe.py           # opt-in profile probe on add()
├── extraction/
│   ├── trafilatura.py     # HTML → markdown
│   ├── pdf.py             # PDF handling (pypdf)
│   ├── llm.py             # structured extraction via Pydantic
│   ├── llm_extract.py     # quality detector + LLM-driven extraction
│   └── css_selectors.py   # cached-selector replay (lxml.cssselect)
├── captcha/               # optional: Tesseract OCR + vision_llm CAPTCHA
│   ├── solver.py
│   └── tesseract.py
├── monitor/               # optional: change tracking + notifications
│   ├── watcher.py
│   └── notify.py
├── profiles/              # 24 curated site profiles + registry
│   ├── registry.py
│   ├── update.py
│   └── builtin.yaml
├── integrations/
│   ├── _common.py
│   ├── crewai.py          # VouchSearchTool (CrewAI)
│   ├── langchain.py       # VouchSearchTool (LangChain)
│   ├── pydantic_ai.py     # vouch_tool (PydanticAI)
│   └── mcp.py             # MCP server (Claude Desktop, Cursor)

Key dependencies

playwright — browser automation
litellm — multi-LLM, BYOK, fallback chains
pydantic v2 — models everywhere
sqlalchemy + sqlite — catalog and cache
trafilatura — fast HTML→markdown extraction
typer — CLI
apscheduler — optional, for monitor
fastapi + uvicorn — optional, for dashboard
patchright — optional, for stealth
sentence-transformers — optional, for embedding router

No CrewAI/LangChain/etc as hard deps. They're integration extras only.

Cost and performance

Typical numbers from internal runs. Depend heavily on LLM, network, site, and cache state — treat as rough order-of-magnitude, not a guarantee. Reproducible benchmark scripts under examples/ if you want to validate on your stack.

Per-search costs

Typical query, depth=1, 3 sites routed, all cache hits:

LLM	Routing	Discovery	Extraction	Total
Ollama qwen2.5:14b	$0	$0 (cached)	$0	$0
Gemini Flash-Lite	$0.0001	$0 (cached)	$0.001	~$0.001
Claude Haiku 4.5	$0.0005	$0 (cached)	$0.005	~$0.005
GPT-4.1-mini	$0.0003	$0 (cached)	$0.003	~$0.003

First search on a new site adds ~$0.01–0.05 for discovery (one-time per site).

Latency

HTTP-only fetch (depth=0): ~1–3s/site
Cached browser search (depth=1): ~2–5s/site
First-time discovery (depth=1): ~8–15s/site
Full content extraction (depth=2): ~15–40s/site

Searches across multiple sites run in parallel (controlled by parallel_sites).

Hardware requirements

Pure HTTP mode: any machine with Python
Browser mode: 4GB RAM minimum, 8GB recommended
Local LLM mode: depends on model. Qwen2.5:14b needs ~16GB RAM (CPU) or ~10GB VRAM
Vision mode: Qwen2.5-VL 7B needs ~8GB VRAM

Limits and what's out of scope

vouch is opinionated about what it tries to do and what it doesn't.

What's in scope:

Public sites that respond well to standard web requests
Sites where you want to discover search behavior dynamically
Sites you've explicitly added to your catalog
Polite, rate-limited, robots.txt-respecting access
Educational, research, regulatory, professional use cases

What's explicitly out of scope:

Bypassing Cloudflare's modern bot protection (Turnstile invisible, Bot Fight Mode aggressive). vouch includes humanize and stealth options that help with mild detection, but no open-source library reliably defeats current Cloudflare. For those cases, plug a commercial bypass service (Browserbase, Bright Data, ZenRows) via the behavior="external" site option.
Mass scraping — vouch is built for "search across my list of trusted sources", not "scrape the entire web."
CAPTCHA solving as a sales pitch. The optional vision-LLM CAPTCHA assist is best-effort and only works on simple visual challenges. It won't (and shouldn't) work on adversarial challenges.
Logged-in personal accounts (LinkedIn, Instagram, etc.) — even when technically possible, this is usually a ToS violation. vouch supports authentication for sites where it's legitimate (gov portals, enterprise dashboards you own), not for this.

This isn't a limitation we apologize for — it's the position. A library that promises to bypass everything breaks every week and isn't sustainable as open-source.

Roadmap

Shipped (see CHANGELOG.md for the full list):

v0.1 — Levels 1-3 API, HTTP + browser adapters, LLM router, selector cache, BYOK via LiteLLM, CrewAI + LangChain integrations, CLI.
v0.2 — embedding/tag/all routers, vision-LLM + Tesseract CAPTCHA assist, MCP server, PydanticAI integration, browser pool, plugin model via entry_points, profile registry + auto-update, async API, FastAPI dashboard, stealth mode (patchright), change monitor, auto-escalation ladder, auto-DNS resolution, quality detector, CSS selector pinning, Accept-Language header.

Coming next (no committed dates — direction shaped by feedback and real use cases):

Auth flow helpers for sites that legitimately need a login (cookie persistence, refresh-token rotation)
Distributed search workers (multi-machine fanout for large catalogs)
Built-in vector store for chunk reuse across queries
Docling-backed PDF extraction
Workflow chains (search → reason → search → extract)
Standalone community profile registry repo (vouch-profiles)
Per-site adapter packages on PyPI (vouch-adapter-*) as the plugin ecosystem matures

Contributing

See CONTRIBUTING.md. Quick start for dev:

git clone https://github.com/evertonsjjj/vouch
cd vouch
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,all]"
playwright install chromium
pytest

Issues, PRs, ideas welcome. Especially:

New SiteAdapter implementations (specific site optimizations)
New Router strategies
Integrations with other agent frameworks
Translations of docs

References and prior art

vouch stands on the shoulders of these projects. Read their docs — most of them solve adjacent problems brilliantly.

Direct inspiration

Tavily & Exa — defined the "search API for agents" category. vouch is "Tavily, but for your own sources."
Kagi Lenses & Brave Goggles — proved curated-source search is a real product. vouch brings it to OSS.
Skyvern — pioneered "discover-then-cache" for browser automation. vouch applies the same pattern to search specifically.
Stagehand — act() / observe() API inspired the discovery layer.

Building blocks used

Playwright — browser automation
browser-use — DOM-to-LLM serialization patterns
Crawl4AI — extraction strategies, JsonCssExtractionStrategy
LiteLLM — multi-provider LLM client
Trafilatura — HTML to clean text
Patchright — stealth Playwright
Camoufox — for serious anti-fingerprinting
semantic-router — embedding-based routing patterns
APScheduler — for the optional monitor module
Apprise — 100+ notification channels for monitor

Adjacent projects worth knowing

Perplexica — open-source Perplexity clone (open-web)
gpt-researcher — research agent (open-web)
local-deep-research — local research bot (engine-based, not site-based)
changedetection.io — site change tracking (no LLM extraction)
bdambrosio/llmsearch — small precedent for source-curated LLM search
Firecrawl — /search endpoint backed by Google/Bing
ScrapeGraphAI — LLM-driven scraping
SearXNG — federated meta-search engine

Reading

Stagehand caching internals: https://www.browserbase.com/blog/stagehand-caching
Skyvern adaptive caching: https://github.com/Skyvern-AI/skyvern (recent PRs)
Anti-detection landscape 2026: https://github.com/pim97/anti-detect-browser-tools-tech-comparison
LLM Gateway comparison: https://github.com/BerriAI/litellm

License

MIT — see LICENSE.

Acknowledgments

Built by @evertonsjjj. Early-stage open source — looking for real-world feedback, contributors, and battle-tested use cases. The gap this fills (curated-source search for agents) is one I needed myself, but I'd rather hear what's wrong with it from people using it than guess in isolation.

If vouch helps you, a ⭐ on GitHub goes a long way.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
docs		docs
examples		examples
tests		tests
vouch		vouch
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
smoke-output.log		smoke-output.log
test-30v2.log		test-30v2.log

Folders and files

Latest commit

History

Repository files navigation

vouch

Table of contents

What's new in 0.2

Why vouch

How it compares

Installation

Quick start

30 seconds with Ollama

30 seconds with a paid API

Three ways to use

Level 1 — One-shot function

Level 2 — Programmatic catalog

Level 3 — YAML-declared catalog

Core concepts

The Site model

How the optional metadata is used

Search depth

Smart Site Routing

Strategies

Selector cache

Profile registry

LLM configuration (BYOK)

Local (Ollama, LM Studio)

API providers

Pass keys explicitly

Fallback chain

Using as an agent tool

CrewAI

LangChain / LangGraph

PydanticAI

MCP server (Claude Desktop, Cursor, Cline)

Optional features

Human-like behavior

Stealth mode (per site)

CAPTCHA assist (lightweight by default)

Change monitoring

FastAPI dashboard

Configuration reference

SearchEngine(...)

engine.search(...)

SearchResult

Architecture

Module layout

Key dependencies

Cost and performance

Per-search costs

Latency

Hardware requirements

Limits and what's out of scope

Roadmap

Contributing

References and prior art

Direct inspiration

Building blocks used

Adjacent projects worth knowing

Reading

License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`SearchEngine(...)`

`engine.search(...)`

`SearchResult`

Packages