Skip to content

feat: schema probing via self-play#87

Merged
samcm merged 31 commits intomasterfrom
playful-beaver-229
Mar 18, 2026
Merged

feat: schema probing via self-play#87
samcm merged 31 commits intomasterfrom
playful-beaver-229

Conversation

@samcm
Copy link
Copy Markdown
Member

@samcm samcm commented Mar 18, 2026

Adds a probe runner that asks the same ClickHouse question multiple times with different personas, then checks if the generated queries agree on which tables to use. Disagreements surface schema ambiguity that can be fixed by adding examples or runbooks.

  • tests/eval/scripts/run_probes.py — standalone async runner with concurrent agent execution, live progress, and timestamped JSON results
  • tests/eval/probes/analysis.py — LLM-based table extraction (Gemini Flash via OpenRouter) and N-way agreement scoring
  • tests/eval/cases/probes.yaml — 40 probe questions seeded from Grafana dashboards, alerts, and the notebooks repo
  • tests/eval/scripts/plot_probe.py — convergence plots over time
  • .claude/skills/self-play/SKILL.md — Claude Code skill that drives the fix loop: run probes, show disagreements, human picks the right table, agent writes the fix
  • tests/eval/config-probe.yaml — server config for local probe runs on :2481
  • pkg/sandbox/docker.go + pkg/config/config.goinstance label on sandbox containers so the probe runner can clean up its own containers without touching the docker server's
  • modules/clickhouse/examples.yaml — initial example fixes for block properties and orphaned block queries

samcm added 30 commits March 18, 2026 09:56
Adds a probe runner that asks the same ClickHouse question N times
with different personas, then checks if the generated queries agree
on which tables to use. Disagreements surface schema ambiguity.

- probes/analysis.py: LLM-based table extraction + agreement scoring
- scripts/run_probes.py: standalone async runner with rich output
- scripts/plot_probe.py: convergence plots over time
- cases/probes.yaml: 31 probe questions seeded from Grafana dashboards
Runner auto-starts a local panda-server on :2481 using config-probe.yaml.
Falls back gracefully if port is already in use. Use --url to skip and
connect to an existing server instead. Also fixes {network} KeyError in
table extraction and uses LLM-based extraction via OpenRouter.
Native server takes 5+ min on first run for EIP embedding.
Use --local-server flag to start a native server instead.
3 runs against haiku showing persistent disagreements on:
- max_block_size: 5 different tables across runs
- block_gas_used: fct_engine_new_payload vs canonical_execution_block
- orphaned_blocks_24h: total confusion, no convergence

avg_block_arrival_time and block_arrival_by_client converge well
thanks to existing examples for fct_block_first_seen_by_node.
…biguity

Probes for gas usage and block size were splitting across 4+ tables
(canonical_execution_block, fct_prepared_block, fct_execution_block, etc).
Adding explicit examples steers to fct_block_head on xatu-cbt, bringing
both block_gas_used and max_block_size from 33% to 100% agreement.
Probing showed disagreement on which table to use for orphaned block
queries. Add block_status category with fct_block examples and negative
guidance to steer away from fct_block_canonical/fct_block_head.
One knob controls everything: --concurrency N means at most N agents
running simultaneously across all probes and personas. Default 5.
All probes fire concurrently, semaphore gates individual agents.
Also bumps sandbox max_sessions to 50.
Probe server tags its containers with instance=probe. When the probe
runner stops, it kills only containers with that label, leaving the
docker server's containers untouched.
Schema discovery for xatu-cbt returned 0 tables because SHOW TABLES
requires a default database, which xatu-cbt doesn't have. Tables live
in per-network databases (e.g. mainnet.fct_block_head).

When SHOW TABLES returns empty, fall back to SHOW DATABASES + SHOW
TABLES FROM <db> to discover tables across per-network databases.
Networks are derived from the database names.

Also removes fetchTableNetworks which ran SELECT DISTINCT
meta_network_name against every table — a full table scan that made
schema discovery take 10+ minutes. Schema fetch now completes in ~30s.
When type is omitted from the search tool, fans out across examples,
runbooks, and EIPs and returns combined results. This means the model
finds relevant runbooks even when it only thinks to search for examples.
Add attestation agreement example mapping the concept to
fct_attestation_correctness_head. Add runbook for correlating blob
gossip propagation with engine_getBlobs success rates across clusters.
@samcm samcm merged commit 014f3c7 into master Mar 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant