Submit a startup pitch. Watch five distinct AI investor personas debate it peer-to-peer over A2A. The Chair synthesises a verdict.
A self-contained, fully-local agentic web app that demonstrates real
peer-to-peer Agent-to-Agent (A2A) protocol use, MCP tool grounding,
multi-agent orchestration, and cost-aware production deployment β all
runnable from docker compose up on your laptop.
git clone <this-repo> && cd sharkhouse
cp .env.example .env # set OPENAI_API_KEY
docker compose up -d # 9 services, ~3 min cold build, ~30 s warm boot
open http://localhost:3000A full debate runs in 60β90 s, costs ~$0.03β0.05 against gpt-4o-mini,
and ends with a colour-coded verdict you can permalink.
The five "sharks" are fictional investor archetypes built for a simulation β not real people, not real funds, no real capital. Each has a sharp opinion. Most of them disagree. The point isn't accuracy; the point is the disagreement.
| Shark | Archetype | What they push on |
|---|---|---|
| Vera | The YC Believer | Founder-market fit. "Has the founder lived the problem?" |
| Hiro | The Deep-Tech Skeptic | Technical defensibility. "This isn't a moat. Anyone with a weekend and an API key could ship this." |
| Mira | The Distribution Hawk | CAC and channel. "No channel, no deal." |
| Derek | The Contrarian Grump | Market structure, valuation discipline. Quotes 19th-century economists unironically. |
| Priya | The Mission Investor | Founder integrity, second-order effects. "And then what?" |
Plus The Chair β a non-participating moderator that picks the most- divergent pairs for peer debate and synthesises the final verdict.
The real demonstration of A2A's value is framework-agnostic interop. The five investors are implemented with three different agent frameworks:
| Investor | Framework | Why |
|---|---|---|
| Vera, Mira, Priya | OpenAI SDK (direct) | Minimalist reference implementation via our StructuredLLM wrapper β one less abstraction when calling models |
| Hiro | LangGraph (create_react_agent) + langchain-mcp-adapters |
Defensibility checks are a natural ReAct loop β think β maybe-call-tools β think β output |
| Derek | Google ADK (LlmAgent + LiteLlm) |
Declarative output_schema-driven structured output; LiteLLM bridges ADK to OpenAI |
The Chair's A2A client calls all five identically. It doesn't know β and doesn't need to know β which framework each investor uses internally. Peer debates in Phase 2 cross framework boundaries transparently (Hiro on LangGraph can argue with Derek on ADK, because both speak the same A2A wire protocol). That's the point of an open protocol: pick the framework that fits your agent, keep interop for free.
See docs/architecture.md for the detailed framework breakdown and sequence diagrams.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 1: 5 openings in parallel β
β Each shark reads the pitch independently. May call MCP β
β tools (TAM, unit economics, comparables, web search) to β
β ground their critique. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 2: Peer debate (the real A2A β see below) β
β Chair picks the 2 most-divergent investor pairs by signed β
β (position, confidence) score. For each pair, the more- β
β confident side initiates and drives 4 turns of back-and- β
β forth. Chair drops out of the message path. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 3: 5 closings in parallel β
β Each shark sees the full Phase 1+2 transcript. Emits a β
β final position + a confidence delta vs their opening. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 4: Chair verdict β
β No A2A. The Chair runs a single LLM synthesis across the β
β full debate state and emits a Verdict artifact: overall β
β position, consensus strength, summary, notable moments. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total: ~19 LLM calls per debate, ~60β90 s wall clock, ~$0.03β0.05 on
gpt-4o-mini. Hard caps enforced at the orchestrator layer β see
Hard budgets.
This is the unique angle. Most "multi-agent" systems are an orchestrator that loops over tool calls and calls itself collaborative. Sharkhouse is not that. Phase 2 uses the official Google A2A protocol for genuine peer-to-peer agent communication.
Each investor is a fully A2A-compliant server. It exposes:
GET /.well-known/agent-card.jsonβ the standard A2A discovery endpointPOST /a2aβ JSON-RPC entrypoint with the v0.3 method set (message/send,tasks/get, β¦)- Five declared skills:
assess_pitch,challenge,respond,closing_statement,initiate_debate
Each investor is also an A2A client of the others. On boot, every
investor process fetches the agent cards of the four other investors
from their /.well-known/agent-card.json URLs and caches the parsed
cards. There is no hardcoded URL list β peers are discovered.
In Phase 2, the Chair drops out of the message path. The Chair
sends ONE A2A message per pair (initiate_debate(target=X, max_turns=4))
to the chosen initiator. The initiator then drives the loop entirely
peer-to-peer:
Chair β Vera : initiate_debate(target=derek, max_turns=4)
Vera β Derek : challenge(opponent_assessment=...) [turn 1]
Derek β Vera : (response artifact returned) [turn 2]
Vera β Derek : respond(prior_turn=...) [turn 3]
Derek β Vera : (response artifact returned) [turn 4]
Each turn is a real A2A Task with its own taskId, sharing
contextId = debate_id so the lineage is traceable. After every turn,
the speaker fires a debate_event webhook to the Chair so the UI can
show the debate progressing. Hard turn-cap enforcement re-checks on
every iteration body, not just at entry β covered by the test suite.
The implementation is initiator-driven: the chosen investor makes every outbound A2A call (target only responds). A fully distributed variant (each side alternately initiating) is on the upgrade path; both are spec-compliant and operationally equivalent. Initiator-driven is significantly easier to reason about and budget-track.
A shared MCP server exposes four research tools every investor can call:
| Tool | What | Backend | Per-(debate Γ investor) cap |
|---|---|---|---|
unit_economics_check |
Sanity-test CAC, LTV, gross margin, payback. PASS/WARN/FAIL with quotable summary. | Pure Python | 3 |
tam_sanity_check |
Compare claimed TAM to a curated reference set. plausible / inflated / absurd. | Bundled JSON (14 categories) | 2 |
comparables_lookup |
Filter bundled startup snapshot by sector + stage. | Bundled JSON (~25 companies, 11 sectors) | 2 |
web_search |
Web search for founder background / category checks. | Tavily (free 1000/mo). Falls back to "disabled" no-op when no key. | 2 |
Tools speak streamable HTTP MCP;
investors connect via the official mcp Python SDK. Server-side rate
limits are enforced per (debate_id, investor_id, tool_name) β never
via prompts. Calls debit the per-debate DebateBudget on the investor
side too, so a single investor can't exhaust the per-debate tool
budget by hammering one tool.
The tool integration uses OpenAI's function-calling loop in the
investor's LangGraph: the model decides whether to call a tool,
the executor invokes the MCP tool, the result feeds back into the
model context, and the final structured assessment quotes the tool
output verbatim. Mira and Derek tend to lean on unit_economics_check
and tam_sanity_check; Hiro reaches for comparables_lookup; Priya
uses web_search sparingly for character context.
Every budget below is enforced at the orchestrator layer, NOT via prompts. Breach is a graceful failure (the debate ends with a clear reason) β never a runaway.
| Cap | v1 default |
|---|---|
| Peer-debate turns per pair | 4 |
| Peer-debate pairs per debate | 2 |
| Total A2A calls per debate | 20 |
| Output tokens per call (investors) | 500 |
| Output tokens per call (Chair) | 1500 |
| Output tokens per debate | 15,000 |
| Tool calls per investor per debate | 5 |
| Wall clock per debate | 180 s |
| Concurrent debates (process-wide) | 8 |
| Per-IP debates per day (anonymous) | 3 |
Each budget has a unit test that confirms the cap triggers correctly. At ~$0.05 per debate, even a Hacker News front-page burst of 5,000 debates costs ~$250 β plus the platform-level OpenAI spend cap is the second independent ceiling.
Every pitch passes through three gates BEFORE any investor sees it:
- Size β 200β5,000 chars (Pydantic + explicit check)
- Prompt-injection heuristics β regex coverage for the five most-
attempted patterns:
ignore previous instructions, persona swaps, system-prompt leaks, jailbreak markers, "you are now β¦". Reject on match. - PII detection β regex for email, US/intl phone, SSN, credit card.
On match, return a redacted version to the user and require explicit
confirmation (
confirmed_redacted=true) before the pitch goes anywhere near the model. - OpenAI moderation β free
omni-moderation-latestAPI. Reject on any flagged category. Fails OPEN on network error so a moderation outage doesn't block legitimate users.
Outputs are post-validated against Pydantic schemas with bounded
retries β every artifact (InvestorAssessment, InvestorChallenge,
InvestorResponse, ClosingStatement, Verdict) has a strict contract
the LLM must satisfy or the call retries with a corrective turn.
Upgrade path documented in code: Presidio for richer PII coverage, Prompt-Guard-2 for stronger injection detection, defamation NER. None are blocking for v1.
sharkhouse/
βββ frontend/ # Next.js 15 + React 19 + Tailwind
β βββ app/ # App-router pages: /, /pitch, /debate/[id]
β βββ components/ # InvestorOpening, PeerTurn, ClosingStatement, VerdictPanel, β¦
β βββ lib/api.ts # Typed orchestrator client
β
βββ packages/ # Python uv workspace
β βββ core/sharkhouse_core/
β β βββ models.py # Pydantic contracts (every artifact)
β β βββ budgets.py # DebateBudget hard-cap enforcement
β β βββ settings.py # Typed Pydantic Settings
β β βββ llm.py # OpenAI wrapper with budget + retries +
β β β # tool-calling loop + LangSmith hook
β β βββ mcp_client.py # MCP toolset wrapper (streamable HTTP)
β β βββ a2a_client.py # Wraps a2a-sdk for Chair-side calls
β β βββ events.py # debate_event webhook helper
β β βββ guardrails.py # PII + injection + moderation pipeline
β β
β βββ orchestrator/ # The Chair (FastAPI)
β β βββ orchestrator/
β β βββ main.py # POST /api/pitches, GET /api/debates/{id}, /events webhook
β β βββ chair.py # Pair selection, phase orchestration, verdict synthesis
β β
β βββ investors/ # Five A2A servers + shared template
β β βββ _template/ # Shared base: executor, persona, peer client, peer prompts
β β βββ vera/ hiro/ mira/ derek/ priya/
β β β βββ prompts.py # Persona contract
β β β βββ persona.py # Persona dataclass (system prompt + signature/banned phrases)
β β β βββ agent_card.json # A2A skill manifest
β β β βββ __main__.py # uvicorn entrypoint
β β
β βββ mcp_tools/mcp_server/ # FastMCP server with 4 tools
β βββ cli/sharkhouse_cli/ # `sharkhouse pitch submit` CLI
β
βββ samples/ # 3 test pitches: dentist_crm, saas_with_numbers, weak_pitch
βββ infra/docker/ # Shared Python base Dockerfile
βββ docker-compose.yml # 9 services (5 investors + orchestrator + mcp + postgres + frontend)
βββ scripts/ # a2a_raw_client.py
βββ docs/architecture.md # Deeper diagrams + sequence flows
# First time
uv sync --all-packages
make setup # installs pre-commit + verifies tools
# Run the full stack
docker compose up -d # 9 services come up healthy in ~30 s
# Or run a service standalone (faster iteration)
INVESTOR_URL_VERA=http://localhost:8100 PORT=8100 uv run python -m vera
# Tests
uv run pytest # 76 unit tests, ~2 s
# Lint
uv run ruff check .
# Type check (core only by default; expand per phase)
uv run mypy packages/core/sharkhouse_core
# CLI flow against a running stack
ORCHESTRATOR_PUBLIC_URL=http://localhost:8000 \
uv run sharkhouse pitch submit --from samples/saas_with_numbers.txtIf something on your machine already owns port 3000, 5432, or 8000:
SHARKHOUSE_HOST_PORT_FRONTEND=3010 \
SHARKHOUSE_HOST_PORT_ORCHESTRATOR=8500 \
SHARKHOUSE_HOST_PORT_POSTGRES=55432 \
NEXT_PUBLIC_ORCHESTRATOR_URL=http://localhost:8500 \
docker compose up -dThe orchestrator + Postgres are the only services bound to the host;
the five investors and MCP server are compose-network-only (use
docker compose exec <svc> to debug from outside).
Only OPENAI_API_KEY is mandatory. Everything else has a working
local default. See .env.example for the full list including the
optional Tavily key (web search), LangSmith trace credentials, and
budget overrides.
The product is feature-complete for v1. The items below are intentionally deferred and make obvious contribution targets:
- Presidio + Prompt-Guard-2 β current input guardrails use lightweight regex heuristics + the OpenAI moderation API. Adding Microsoft Presidio for PII detection and Prompt-Guard-2 for prompt-injection classification would harden the pipeline without changing the API surface.
- Redis SSE fan-out β live streaming is in-process today (one orchestrator replica is fine for the demo). A Redis-backed pub/sub transport would allow horizontal scaling of the orchestrator without changing call sites.
- Persona-distinctness regression eval β the personas are sharp today, but they drift quietly when prompts get edited. An automated nightly eval that flags voice collisions across the five sharks would catch drift before it ships.
- OG image generation for permalink social cards β debate permalinks exist (Postgres-backed), but the social card preview is the default Open Graph image. Generating a per-debate card with the verdict and shark portraits would make share-outs more compelling.
None of these are launch blockers. The full demo loop runs end-to-end.
| Capability | Status |
|---|---|
| Five investor personas with distinct voice contracts | shipped |
| Peer-to-peer A2A protocol (cross-framework) | shipped |
| MCP tool grounding (4 tools, rate-limited, fail-soft) | shipped |
| LangGraph + Google ADK + OpenAI SDK interop | shipped |
| Postgres-backed permalinks (survive restart) | shipped |
| Live SSE streaming of debate events | shipped |
| Input guardrails (size, PII, prompt-injection, moderation) | shipped |
| LangSmith trace propagation | shipped |
| Next.js frontend with typewriter reveal | shipped |
| Presidio / Prompt-Guard-2 / Redis SSE fan-out | deferred |
| Persona-distinctness regression eval | deferred |
| OG image generation for permalinks | deferred |
76 unit tests, no secrets in repo, CI clean.
- The A2A protocol for the open agent-to-agent standard that makes Phase 2 possible.
- The Model Context Protocol for the tool-server spec.
- LangGraph + LangChain for the LLM orchestration plumbing.
MIT β see LICENSE.
See CONTRIBUTING.md. Persona changes (prompts, banned-phrase lists, signature phrases) are reviewed against the distinctness contract β if your changes make any persona sound more like a generic LLM assistant, the PR will be rejected.
Disclaimer. The five investor personas are fictional composites created for a simulation. They do not represent real people, real funds, or real capital. The verdicts you see are entertainment, not investment advice β even when they sound confident. The Chair will say PASS on perfectly good companies and INVEST on bad ones. The spectacle is in the disagreement, not the accuracy.