feat: entropy loss metric, embedding enrichment, and example fixes for self-play probes by samcm · Pull Request #92 · ethpandaops/panda

samcm · 2026-03-20T00:16:39Z

Adds entropy as a loss metric for self-play schema probing. Entropy measures table choice confusion across personas — 0 means all agree, higher means more disagreement. Average entropy went from 0.5756 to 0.4993 (-13%), zero-entropy probes from 16 to 23 out of 40.

Changes:

Shannon entropy in probe analysis, results JSON, CLI output, comparison tables, and the plot dashboard
--tag filter on the probe runner for domain-level testing (e.g. --tag blobs)
Enrich example search embeddings with category name, cluster, and table names extracted from SQL — previously only name+description were embedded, so search couldn't tell apart examples using different tables
Remove two non-existent tables from examples that were teaching the model to hallucinate (fct_mev_bid_value, fct_attestation_correctness_by_entity_head)
Fix wrong gossip table name (libp2p_gossipsub_beacon_aggregate_and_proof -> libp2p_gossipsub_aggregate_and_proof)
Deduplicate example names that were confusing search results
New examples: execution traces (int_transaction_call_frame), reorg investigation, parent distance, timing games, getBlobs success rate, peer count estimation, head accuracy by entity
Runbook for querying block-number-partitioned tables
Rewrite self-play skill for autonomous schema-informed resolution with commit/rollback evaluation

Add Shannon entropy as the primary metric for tracking schema ambiguity. Lower entropy = less confusion about which tables to use. This replaces binary agree/disagree with a continuous signal that trends over time. Changes: - Add entropy computation to probe analysis and results JSON - Add --tag filter for domain-level probe runs (e.g. --tag blobs) - Propagate probe tags into result JSON for traceability - Rewrite plot dashboard around entropy (trend line + heatmap) - Backfill entropy for older results without it - Add execution_traces category with int_transaction_call_frame examples - Add missed slots and empty blob detection examples - Add runbook for block-number-partitioned table queries - Rewrite self-play skill for autonomous schema-informed resolution

Schema-informed fixes for probes with entropy >= 1.37: - Fix wrong table name in gossip example (libp2p_gossipsub_beacon_aggregate_and_proof -> libp2p_gossipsub_aggregate_and_proof) - Add reorg investigation example (fct_block + fct_block_first_seen_by_node) - Add parent distance distribution example (fct_block self-join) - Add getBlobs success rate distribution example (fct_engine_get_blobs_by_slot) - Add head accuracy by entity vs blob count example - Add per-node peer count estimate example (libp2p_connected) - Update SKILL.md with commit/rollback evaluation step

Reduce schema confusion by clarifying which tables to use: - Validator section: canonical_beacon_validators alone for counts/status - Entity section: prefer pre-aggregated entity tables over per-validator - MEV section: clarify mev_relay_bid_trace for bid values, fct_block_mev_head for market share - Add MEV bid value distribution example - Fix gossip table name (already committed but ensuring consistency)

- Add timing games by entity example (fct_block_first_seen_by_node + fct_block_proposer_entity) - Expand MEV description to enumerate all real MEV tables and their purposes - Addresses timing_games_by_entity (1.92) and mev_bid_value_distribution (1.37) probes

…ations) Critical fixes: - Remove fct_mev_bid_value example (table doesn't exist, was teaching hallucination) - Replace fct_attestation_correctness_by_entity_head (doesn't exist) with fct_attestation_correctness_by_validator_head + ethseer_validator_entity join - Fix entity_analysis category description to stop recommending non-existent table - Deduplicate "Attestation participation rate" (rename xatu version) - Deduplicate "Block arrival by consensus client" (rename network_health version) - Remove default. prefix from xatu queries for consistency

Baseline: avg entropy 0.5756 (16/40 agreed) Current: avg entropy 0.5325 (17/40 agreed) Key improvements from this session: - precompile_gas_usage: 0.20 agreement -> 1.00 (int_transaction_call_frame) - slot_13505944_reorg: 2.32 entropy -> 0.00 (fct_block + fct_block_first_seen_by_node) - aggregate_propagation_timing: 1.52 -> 0.00 (fixed table name typo) - mev_bid_value_distribution: 1.37 -> 0.00 (removed poisoned fct_mev_bid_value example) - head_accuracy_vs_blob_count: 1.92 -> 0.72 - entity_performance_comparison: fixed non-existent table reference

The embedding text for query examples now includes category name, cluster, and table names extracted from the SQL query. Previously only the example name and description were embedded, making it impossible for semantic search to distinguish examples using different tables for similar questions. Before: "MEV bid value distribution. Get the distribution of MEV bid values" After: "MEV Analysis: MEV bid value distribution. Distribution of MEV bid values... Cluster: xatu. Tables: mev_relay_bid_trace"

avg entropy: 0.5756 (baseline) -> 0.4993 (now), -13.2% zero-entropy probes: 16 -> 23 out of 40

…ttestation The previous example used length(aggregation_bits) from beacon_api_eth_v1_events_attestation which gives the hex string length, not the number of attesting validators. canonical_beacon_elaborated_attestation has a validators array — length(validators) gives the correct count.

Fixed broken example that used length(aggregation_bits) string length instead of length(validators) array count. Switched to canonical_beacon_elaborated_attestation which has the validators array. Result: validators_per_included_attestation 1.92 -> 0.00 (all 5 agreed)

…uation The validator status distribution probe keeps adding dim_validator_status (lifecycle transitions) instead of using canonical_beacon_validators alone (current state). Enriched descriptions to explicitly name the correct table and explain why dim_validator_status is not appropriate for current state queries.

Best run yet. Down from 0.5756 baseline (-20.5%). 24 probes at zero entropy, up from 16.

Signed-off-by: Sam Calder-Mason <sam@puritydev.io>

samcm added 16 commits March 19, 2026 16:23

chore: probe results after embedding enrichment

158be51

avg entropy: 0.5756 (baseline) -> 0.4993 (now), -13.2% zero-entropy probes: 16 -> 23 out of 40

chore: update dashboard plot

d7f2eba

chore: validator probe results after description enrichment

789e8b1

chore: full probe results — avg entropy 0.4574 (24/40 agreed)

4d81bd8

Best run yet. Down from 0.5756 baseline (-20.5%). 24 probes at zero entropy, up from 16.

chore: remove probe result files from PR

4806a10

Delete tests/eval/probes/results/summary.png

7bf7ce4

Signed-off-by: Sam Calder-Mason <sam@puritydev.io>

samcm merged commit a75a7e0 into master Mar 25, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: entropy loss metric, embedding enrichment, and example fixes for self-play probes#92

feat: entropy loss metric, embedding enrichment, and example fixes for self-play probes#92
samcm merged 16 commits intomasterfrom
smart-butterfly-882

samcm commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samcm commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant