Conversation
Adds a probe runner that asks the same ClickHouse question N times with different personas, then checks if the generated queries agree on which tables to use. Disagreements surface schema ambiguity. - probes/analysis.py: LLM-based table extraction + agreement scoring - scripts/run_probes.py: standalone async runner with rich output - scripts/plot_probe.py: convergence plots over time - cases/probes.yaml: 31 probe questions seeded from Grafana dashboards
Runner auto-starts a local panda-server on :2481 using config-probe.yaml.
Falls back gracefully if port is already in use. Use --url to skip and
connect to an existing server instead. Also fixes {network} KeyError in
table extraction and uses LLM-based extraction via OpenRouter.
Native server takes 5+ min on first run for EIP embedding. Use --local-server flag to start a native server instead.
3 runs against haiku showing persistent disagreements on: - max_block_size: 5 different tables across runs - block_gas_used: fct_engine_new_payload vs canonical_execution_block - orphaned_blocks_24h: total confusion, no convergence avg_block_arrival_time and block_arrival_by_client converge well thanks to existing examples for fct_block_first_seen_by_node.
…biguity Probes for gas usage and block size were splitting across 4+ tables (canonical_execution_block, fct_prepared_block, fct_execution_block, etc). Adding explicit examples steers to fct_block_head on xatu-cbt, bringing both block_gas_used and max_block_size from 33% to 100% agreement.
Probing showed disagreement on which table to use for orphaned block queries. Add block_status category with fct_block examples and negative guidance to steer away from fct_block_canonical/fct_block_head.
One knob controls everything: --concurrency N means at most N agents running simultaneously across all probes and personas. Default 5. All probes fire concurrently, semaphore gates individual agents. Also bumps sandbox max_sessions to 50.
Probe server tags its containers with instance=probe. When the probe runner stops, it kills only containers with that label, leaving the docker server's containers untouched.
…t, cost, duration)
Schema discovery for xatu-cbt returned 0 tables because SHOW TABLES requires a default database, which xatu-cbt doesn't have. Tables live in per-network databases (e.g. mainnet.fct_block_head). When SHOW TABLES returns empty, fall back to SHOW DATABASES + SHOW TABLES FROM <db> to discover tables across per-network databases. Networks are derived from the database names. Also removes fetchTableNetworks which ran SELECT DISTINCT meta_network_name against every table — a full table scan that made schema discovery take 10+ minutes. Schema fetch now completes in ~30s.
When type is omitted from the search tool, fans out across examples, runbooks, and EIPs and returns combined results. This means the model finds relevant runbooks even when it only thinks to search for examples.
Add attestation agreement example mapping the concept to fct_attestation_correctness_head. Add runbook for correlating blob gossip propagation with engine_getBlobs success rates across clusters.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a probe runner that asks the same ClickHouse question multiple times with different personas, then checks if the generated queries agree on which tables to use. Disagreements surface schema ambiguity that can be fixed by adding examples or runbooks.
tests/eval/scripts/run_probes.py— standalone async runner with concurrent agent execution, live progress, and timestamped JSON resultstests/eval/probes/analysis.py— LLM-based table extraction (Gemini Flash via OpenRouter) and N-way agreement scoringtests/eval/cases/probes.yaml— 40 probe questions seeded from Grafana dashboards, alerts, and the notebooks repotests/eval/scripts/plot_probe.py— convergence plots over time.claude/skills/self-play/SKILL.md— Claude Code skill that drives the fix loop: run probes, show disagreements, human picks the right table, agent writes the fixtests/eval/config-probe.yaml— server config for local probe runs on :2481pkg/sandbox/docker.go+pkg/config/config.go—instancelabel on sandbox containers so the probe runner can clean up its own containers without touching the docker server'smodules/clickhouse/examples.yaml— initial example fixes for block properties and orphaned block queries