feat: schema probing via self-play by samcm · Pull Request #87 · ethpandaops/panda

samcm · 2026-03-18T06:43:31Z

Adds a probe runner that asks the same ClickHouse question multiple times with different personas, then checks if the generated queries agree on which tables to use. Disagreements surface schema ambiguity that can be fixed by adding examples or runbooks.

tests/eval/scripts/run_probes.py — standalone async runner with concurrent agent execution, live progress, and timestamped JSON results
tests/eval/probes/analysis.py — LLM-based table extraction (Gemini Flash via OpenRouter) and N-way agreement scoring
tests/eval/cases/probes.yaml — 40 probe questions seeded from Grafana dashboards, alerts, and the notebooks repo
tests/eval/scripts/plot_probe.py — convergence plots over time
.claude/skills/self-play/SKILL.md — Claude Code skill that drives the fix loop: run probes, show disagreements, human picks the right table, agent writes the fix
tests/eval/config-probe.yaml — server config for local probe runs on :2481
pkg/sandbox/docker.go + pkg/config/config.go — instance label on sandbox containers so the probe runner can clean up its own containers without touching the docker server's
modules/clickhouse/examples.yaml — initial example fixes for block properties and orphaned block queries

Adds a probe runner that asks the same ClickHouse question N times with different personas, then checks if the generated queries agree on which tables to use. Disagreements surface schema ambiguity. - probes/analysis.py: LLM-based table extraction + agreement scoring - scripts/run_probes.py: standalone async runner with rich output - scripts/plot_probe.py: convergence plots over time - cases/probes.yaml: 31 probe questions seeded from Grafana dashboards

Runner auto-starts a local panda-server on :2481 using config-probe.yaml. Falls back gracefully if port is already in use. Use --url to skip and connect to an existing server instead. Also fixes {network} KeyError in table extraction and uses LLM-based extraction via OpenRouter.

Native server takes 5+ min on first run for EIP embedding. Use --local-server flag to start a native server instead.

3 runs against haiku showing persistent disagreements on: - max_block_size: 5 different tables across runs - block_gas_used: fct_engine_new_payload vs canonical_execution_block - orphaned_blocks_24h: total confusion, no convergence avg_block_arrival_time and block_arrival_by_client converge well thanks to existing examples for fct_block_first_seen_by_node.

…biguity Probes for gas usage and block size were splitting across 4+ tables (canonical_execution_block, fct_prepared_block, fct_execution_block, etc). Adding explicit examples steers to fct_block_head on xatu-cbt, bringing both block_gas_used and max_block_size from 33% to 100% agreement.

Probing showed disagreement on which table to use for orphaned block queries. Add block_status category with fct_block examples and negative guidance to steer away from fct_block_canonical/fct_block_head.

…N table names

One knob controls everything: --concurrency N means at most N agents running simultaneously across all probes and personas. Default 5. All probes fire concurrently, semaphore gates individual agents. Also bumps sandbox max_sessions to 50.

Probe server tags its containers with instance=probe. When the probe runner stops, it kills only containers with that label, leaving the docker server's containers untouched.

…t, cost, duration)

… in plots

Schema discovery for xatu-cbt returned 0 tables because SHOW TABLES requires a default database, which xatu-cbt doesn't have. Tables live in per-network databases (e.g. mainnet.fct_block_head). When SHOW TABLES returns empty, fall back to SHOW DATABASES + SHOW TABLES FROM <db> to discover tables across per-network databases. Networks are derived from the database names. Also removes fetchTableNetworks which ran SELECT DISTINCT meta_network_name against every table — a full table scan that made schema discovery take 10+ minutes. Schema fetch now completes in ~30s.

When type is omitted from the search tool, fans out across examples, runbooks, and EIPs and returns combined results. This means the model finds relevant runbooks even when it only thinks to search for examples.

Add attestation agreement example mapping the concept to fct_attestation_correctness_head. Add runbook for correlating blob gossip propagation with engine_getBlobs success rates across clusters.

samcm added 30 commits March 18, 2026 09:56

fix: default probe runner to docker server on :2480

137acf8

Native server takes 5+ min on first run for EIP embedding. Use --local-server flag to start a native server instead.

fix: bump local server timeout to 240s for EIP embedding

fca361c

data: probe run baseline (20% agreement, 5 probes)

d85b670

fix: default to 5 attempts to match persona count

d2a826b

move self-play skill into repo

b3365bd

feat: add --retry flag to re-run only previously disagreeing probes

fca68f3

feat: rename --retry to --only-previously-failed, update skill

4efc480

fix: add fct_block examples for orphaned block queries

64896ac

Probing showed disagreement on which table to use for orphaned block queries. Add block_status category with fct_block examples and negative guidance to steer away from fct_block_canonical/fct_block_head.

feat: run persona attempts concurrently with asyncio.gather

67a8eec

fix: tighten table extraction prompt to only extract literal FROM/JOI…

ef18ddd

…N table names

feat: add 10 probes from notebooks investigations (user-approved)

51e65ea

fix: expand self-play skill scope to entire repo, not just examples.yaml

de5ed84

fix: require generalized fixes, not narrow per-question examples

265e7b2

fix: live progress counter instead of printing all probe headers upfront

db6751f

fix: default to local server, remove --local-server flag

9db1405

fix: say 'personas' not 'attempts' in output, show persona names

ebf5bf1

fix: correct --concurrency help text

40c8668

fix: shuffle probes before applying -n limit

0516159

feat: add sandbox instance label for container cleanup

4f6c583

Probe server tags its containers with instance=probe. When the probe runner stops, it kills only containers with that label, leaving the docker server's containers untouched.

data: latest probe results and example fixes

188d425

fix: replace misleading overall rate with 3-panel dashboard (agreemen…

a782ac5

…t, cost, duration)

feat: track turns/tool_calls per attempt, replace duration with turns…

15c28a8

… in plots

feat: make search type optional, default to searching all types

b426476

When type is omitted from the search tool, fans out across examples, runbooks, and EIPs and returns combined results. This means the model finds relevant runbooks even when it only thinks to search for examples.

feat: add attestation agreement example and blob propagation runbook

ac6f50e

Add attestation agreement example mapping the concept to fct_attestation_correctness_head. Add runbook for correlating blob gossip propagation with engine_getBlobs success rates across clusters.

fix: update self-play skill and grader for CTE/system table filtering

3a6b012

samcm merged commit 014f3c7 into master Mar 18, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: schema probing via self-play#87

feat: schema probing via self-play#87
samcm merged 31 commits intomasterfrom
playful-beaver-229

samcm commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samcm commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant