Skip to content

chaubes/sharkhouse

Repository files navigation

sharkhouse

Submit a startup pitch. Watch five distinct AI investor personas debate it peer-to-peer over A2A. The Chair synthesises a verdict.

A self-contained, fully-local agentic web app that demonstrates real peer-to-peer Agent-to-Agent (A2A) protocol use, MCP tool grounding, multi-agent orchestration, and cost-aware production deployment β€” all runnable from docker compose up on your laptop.

git clone <this-repo> && cd sharkhouse
cp .env.example .env   # set OPENAI_API_KEY
docker compose up -d   # 9 services, ~3 min cold build, ~30 s warm boot
open http://localhost:3000

A full debate runs in 60–90 s, costs ~$0.03–0.05 against gpt-4o-mini, and ends with a colour-coded verdict you can permalink.


What it is

The five "sharks" are fictional investor archetypes built for a simulation β€” not real people, not real funds, no real capital. Each has a sharp opinion. Most of them disagree. The point isn't accuracy; the point is the disagreement.

Shark Archetype What they push on
Vera The YC Believer Founder-market fit. "Has the founder lived the problem?"
Hiro The Deep-Tech Skeptic Technical defensibility. "This isn't a moat. Anyone with a weekend and an API key could ship this."
Mira The Distribution Hawk CAC and channel. "No channel, no deal."
Derek The Contrarian Grump Market structure, valuation discipline. Quotes 19th-century economists unironically.
Priya The Mission Investor Founder integrity, second-order effects. "And then what?"

Plus The Chair β€” a non-participating moderator that picks the most- divergent pairs for peer debate and synthesises the final verdict.


Three frameworks, one wire protocol

The real demonstration of A2A's value is framework-agnostic interop. The five investors are implemented with three different agent frameworks:

Investor Framework Why
Vera, Mira, Priya OpenAI SDK (direct) Minimalist reference implementation via our StructuredLLM wrapper β€” one less abstraction when calling models
Hiro LangGraph (create_react_agent) + langchain-mcp-adapters Defensibility checks are a natural ReAct loop β€” think β†’ maybe-call-tools β†’ think β†’ output
Derek Google ADK (LlmAgent + LiteLlm) Declarative output_schema-driven structured output; LiteLLM bridges ADK to OpenAI

The Chair's A2A client calls all five identically. It doesn't know β€” and doesn't need to know β€” which framework each investor uses internally. Peer debates in Phase 2 cross framework boundaries transparently (Hiro on LangGraph can argue with Derek on ADK, because both speak the same A2A wire protocol). That's the point of an open protocol: pick the framework that fits your agent, keep interop for free.

See docs/architecture.md for the detailed framework breakdown and sequence diagrams.


A four-phase debate

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Phase 1: 5 openings in parallel                                     β”‚
β”‚           Each shark reads the pitch independently. May call MCP     β”‚
β”‚           tools (TAM, unit economics, comparables, web search) to    β”‚
β”‚           ground their critique.                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Phase 2: Peer debate (the real A2A β€” see below)                     β”‚
β”‚           Chair picks the 2 most-divergent investor pairs by signed  β”‚
β”‚           (position, confidence) score. For each pair, the more-     β”‚
β”‚           confident side initiates and drives 4 turns of back-and-   β”‚
β”‚           forth. Chair drops out of the message path.                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Phase 3: 5 closings in parallel                                     β”‚
β”‚           Each shark sees the full Phase 1+2 transcript. Emits a     β”‚
β”‚           final position + a confidence delta vs their opening.      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Phase 4: Chair verdict                                              β”‚
β”‚           No A2A. The Chair runs a single LLM synthesis across the   β”‚
β”‚           full debate state and emits a Verdict artifact: overall    β”‚
β”‚           position, consensus strength, summary, notable moments.    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total: ~19 LLM calls per debate, ~60–90 s wall clock, ~$0.03–0.05 on gpt-4o-mini. Hard caps enforced at the orchestrator layer β€” see Hard budgets.


How A2A actually works here

This is the unique angle. Most "multi-agent" systems are an orchestrator that loops over tool calls and calls itself collaborative. Sharkhouse is not that. Phase 2 uses the official Google A2A protocol for genuine peer-to-peer agent communication.

Each investor is a fully A2A-compliant server. It exposes:

  • GET /.well-known/agent-card.json β€” the standard A2A discovery endpoint
  • POST /a2a β€” JSON-RPC entrypoint with the v0.3 method set (message/send, tasks/get, …)
  • Five declared skills: assess_pitch, challenge, respond, closing_statement, initiate_debate

Each investor is also an A2A client of the others. On boot, every investor process fetches the agent cards of the four other investors from their /.well-known/agent-card.json URLs and caches the parsed cards. There is no hardcoded URL list β€” peers are discovered.

In Phase 2, the Chair drops out of the message path. The Chair sends ONE A2A message per pair (initiate_debate(target=X, max_turns=4)) to the chosen initiator. The initiator then drives the loop entirely peer-to-peer:

Chair β†’ Vera   : initiate_debate(target=derek, max_turns=4)
Vera  β†’ Derek  : challenge(opponent_assessment=...)              [turn 1]
Derek β†’ Vera   : (response artifact returned)                    [turn 2]
Vera  β†’ Derek  : respond(prior_turn=...)                         [turn 3]
Derek β†’ Vera   : (response artifact returned)                    [turn 4]

Each turn is a real A2A Task with its own taskId, sharing contextId = debate_id so the lineage is traceable. After every turn, the speaker fires a debate_event webhook to the Chair so the UI can show the debate progressing. Hard turn-cap enforcement re-checks on every iteration body, not just at entry β€” covered by the test suite.

The implementation is initiator-driven: the chosen investor makes every outbound A2A call (target only responds). A fully distributed variant (each side alternately initiating) is on the upgrade path; both are spec-compliant and operationally equivalent. Initiator-driven is significantly easier to reason about and budget-track.


MCP tools

A shared MCP server exposes four research tools every investor can call:

Tool What Backend Per-(debate Γ— investor) cap
unit_economics_check Sanity-test CAC, LTV, gross margin, payback. PASS/WARN/FAIL with quotable summary. Pure Python 3
tam_sanity_check Compare claimed TAM to a curated reference set. plausible / inflated / absurd. Bundled JSON (14 categories) 2
comparables_lookup Filter bundled startup snapshot by sector + stage. Bundled JSON (~25 companies, 11 sectors) 2
web_search Web search for founder background / category checks. Tavily (free 1000/mo). Falls back to "disabled" no-op when no key. 2

Tools speak streamable HTTP MCP; investors connect via the official mcp Python SDK. Server-side rate limits are enforced per (debate_id, investor_id, tool_name) β€” never via prompts. Calls debit the per-debate DebateBudget on the investor side too, so a single investor can't exhaust the per-debate tool budget by hammering one tool.

The tool integration uses OpenAI's function-calling loop in the investor's LangGraph: the model decides whether to call a tool, the executor invokes the MCP tool, the result feeds back into the model context, and the final structured assessment quotes the tool output verbatim. Mira and Derek tend to lean on unit_economics_check and tam_sanity_check; Hiro reaches for comparables_lookup; Priya uses web_search sparingly for character context.


Hard budgets

Every budget below is enforced at the orchestrator layer, NOT via prompts. Breach is a graceful failure (the debate ends with a clear reason) β€” never a runaway.

Cap v1 default
Peer-debate turns per pair 4
Peer-debate pairs per debate 2
Total A2A calls per debate 20
Output tokens per call (investors) 500
Output tokens per call (Chair) 1500
Output tokens per debate 15,000
Tool calls per investor per debate 5
Wall clock per debate 180 s
Concurrent debates (process-wide) 8
Per-IP debates per day (anonymous) 3

Each budget has a unit test that confirms the cap triggers correctly. At ~$0.05 per debate, even a Hacker News front-page burst of 5,000 debates costs ~$250 β€” plus the platform-level OpenAI spend cap is the second independent ceiling.


Input guardrails

Every pitch passes through three gates BEFORE any investor sees it:

  1. Size β€” 200–5,000 chars (Pydantic + explicit check)
  2. Prompt-injection heuristics β€” regex coverage for the five most- attempted patterns: ignore previous instructions, persona swaps, system-prompt leaks, jailbreak markers, "you are now …". Reject on match.
  3. PII detection β€” regex for email, US/intl phone, SSN, credit card. On match, return a redacted version to the user and require explicit confirmation (confirmed_redacted=true) before the pitch goes anywhere near the model.
  4. OpenAI moderation β€” free omni-moderation-latest API. Reject on any flagged category. Fails OPEN on network error so a moderation outage doesn't block legitimate users.

Outputs are post-validated against Pydantic schemas with bounded retries β€” every artifact (InvestorAssessment, InvestorChallenge, InvestorResponse, ClosingStatement, Verdict) has a strict contract the LLM must satisfy or the call retries with a corrective turn.

Upgrade path documented in code: Presidio for richer PII coverage, Prompt-Guard-2 for stronger injection detection, defamation NER. None are blocking for v1.


Repo layout

sharkhouse/
β”œβ”€β”€ frontend/                     # Next.js 15 + React 19 + Tailwind
β”‚   β”œβ”€β”€ app/                      # App-router pages: /, /pitch, /debate/[id]
β”‚   β”œβ”€β”€ components/               # InvestorOpening, PeerTurn, ClosingStatement, VerdictPanel, …
β”‚   └── lib/api.ts                # Typed orchestrator client
β”‚
β”œβ”€β”€ packages/                     # Python uv workspace
β”‚   β”œβ”€β”€ core/sharkhouse_core/
β”‚   β”‚   β”œβ”€β”€ models.py             # Pydantic contracts (every artifact)
β”‚   β”‚   β”œβ”€β”€ budgets.py            # DebateBudget hard-cap enforcement
β”‚   β”‚   β”œβ”€β”€ settings.py           # Typed Pydantic Settings
β”‚   β”‚   β”œβ”€β”€ llm.py                # OpenAI wrapper with budget + retries +
β”‚   β”‚   β”‚                         #   tool-calling loop + LangSmith hook
β”‚   β”‚   β”œβ”€β”€ mcp_client.py         # MCP toolset wrapper (streamable HTTP)
β”‚   β”‚   β”œβ”€β”€ a2a_client.py         # Wraps a2a-sdk for Chair-side calls
β”‚   β”‚   β”œβ”€β”€ events.py             # debate_event webhook helper
β”‚   β”‚   └── guardrails.py         # PII + injection + moderation pipeline
β”‚   β”‚
β”‚   β”œβ”€β”€ orchestrator/             # The Chair (FastAPI)
β”‚   β”‚   └── orchestrator/
β”‚   β”‚       β”œβ”€β”€ main.py           # POST /api/pitches, GET /api/debates/{id}, /events webhook
β”‚   β”‚       └── chair.py          # Pair selection, phase orchestration, verdict synthesis
β”‚   β”‚
β”‚   β”œβ”€β”€ investors/                # Five A2A servers + shared template
β”‚   β”‚   β”œβ”€β”€ _template/            # Shared base: executor, persona, peer client, peer prompts
β”‚   β”‚   β”œβ”€β”€ vera/  hiro/  mira/  derek/  priya/
β”‚   β”‚   β”‚   β”œβ”€β”€ prompts.py        # Persona contract
β”‚   β”‚   β”‚   β”œβ”€β”€ persona.py        # Persona dataclass (system prompt + signature/banned phrases)
β”‚   β”‚   β”‚   β”œβ”€β”€ agent_card.json   # A2A skill manifest
β”‚   β”‚   β”‚   └── __main__.py       # uvicorn entrypoint
β”‚   β”‚
β”‚   β”œβ”€β”€ mcp_tools/mcp_server/     # FastMCP server with 4 tools
β”‚   └── cli/sharkhouse_cli/       # `sharkhouse pitch submit` CLI
β”‚
β”œβ”€β”€ samples/                      # 3 test pitches: dentist_crm, saas_with_numbers, weak_pitch
β”œβ”€β”€ infra/docker/                 # Shared Python base Dockerfile
β”œβ”€β”€ docker-compose.yml            # 9 services (5 investors + orchestrator + mcp + postgres + frontend)
β”œβ”€β”€ scripts/                      # a2a_raw_client.py
└── docs/architecture.md          # Deeper diagrams + sequence flows

Local development

# First time
uv sync --all-packages
make setup                  # installs pre-commit + verifies tools

# Run the full stack
docker compose up -d        # 9 services come up healthy in ~30 s

# Or run a service standalone (faster iteration)
INVESTOR_URL_VERA=http://localhost:8100 PORT=8100 uv run python -m vera

# Tests
uv run pytest               # 76 unit tests, ~2 s

# Lint
uv run ruff check .

# Type check (core only by default; expand per phase)
uv run mypy packages/core/sharkhouse_core

# CLI flow against a running stack
ORCHESTRATOR_PUBLIC_URL=http://localhost:8000 \
  uv run sharkhouse pitch submit --from samples/saas_with_numbers.txt

Host port collisions

If something on your machine already owns port 3000, 5432, or 8000:

SHARKHOUSE_HOST_PORT_FRONTEND=3010 \
SHARKHOUSE_HOST_PORT_ORCHESTRATOR=8500 \
SHARKHOUSE_HOST_PORT_POSTGRES=55432 \
NEXT_PUBLIC_ORCHESTRATOR_URL=http://localhost:8500 \
docker compose up -d

The orchestrator + Postgres are the only services bound to the host; the five investors and MCP server are compose-network-only (use docker compose exec <svc> to debug from outside).

Required env vars

Only OPENAI_API_KEY is mandatory. Everything else has a working local default. See .env.example for the full list including the optional Tavily key (web search), LangSmith trace credentials, and budget overrides.


Roadmap β€” deferred work

The product is feature-complete for v1. The items below are intentionally deferred and make obvious contribution targets:

  • Presidio + Prompt-Guard-2 β€” current input guardrails use lightweight regex heuristics + the OpenAI moderation API. Adding Microsoft Presidio for PII detection and Prompt-Guard-2 for prompt-injection classification would harden the pipeline without changing the API surface.
  • Redis SSE fan-out β€” live streaming is in-process today (one orchestrator replica is fine for the demo). A Redis-backed pub/sub transport would allow horizontal scaling of the orchestrator without changing call sites.
  • Persona-distinctness regression eval β€” the personas are sharp today, but they drift quietly when prompts get edited. An automated nightly eval that flags voice collisions across the five sharks would catch drift before it ships.
  • OG image generation for permalink social cards β€” debate permalinks exist (Postgres-backed), but the social card preview is the default Open Graph image. Generating a per-debate card with the verdict and shark portraits would make share-outs more compelling.

None of these are launch blockers. The full demo loop runs end-to-end.


Project status

Capability Status
Five investor personas with distinct voice contracts shipped
Peer-to-peer A2A protocol (cross-framework) shipped
MCP tool grounding (4 tools, rate-limited, fail-soft) shipped
LangGraph + Google ADK + OpenAI SDK interop shipped
Postgres-backed permalinks (survive restart) shipped
Live SSE streaming of debate events shipped
Input guardrails (size, PII, prompt-injection, moderation) shipped
LangSmith trace propagation shipped
Next.js frontend with typewriter reveal shipped
Presidio / Prompt-Guard-2 / Redis SSE fan-out deferred
Persona-distinctness regression eval deferred
OG image generation for permalinks deferred

76 unit tests, no secrets in repo, CI clean.


Acknowledgements

  • The A2A protocol for the open agent-to-agent standard that makes Phase 2 possible.
  • The Model Context Protocol for the tool-server spec.
  • LangGraph + LangChain for the LLM orchestration plumbing.

License

MIT β€” see LICENSE.


Contributing

See CONTRIBUTING.md. Persona changes (prompts, banned-phrase lists, signature phrases) are reviewed against the distinctness contract β€” if your changes make any persona sound more like a generic LLM assistant, the PR will be rejected.


Disclaimer. The five investor personas are fictional composites created for a simulation. They do not represent real people, real funds, or real capital. The verdicts you see are entertainment, not investment advice β€” even when they sound confident. The Chair will say PASS on perfectly good companies and INVEST on bad ones. The spectacle is in the disagreement, not the accuracy.

About

A public, agentic web app where founders submit a startup pitch and watch five AI "investor" agents

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors