feat: entropy loss metric, embedding enrichment, and example fixes for self-play probes#92
Merged
feat: entropy loss metric, embedding enrichment, and example fixes for self-play probes#92
Conversation
Add Shannon entropy as the primary metric for tracking schema ambiguity. Lower entropy = less confusion about which tables to use. This replaces binary agree/disagree with a continuous signal that trends over time. Changes: - Add entropy computation to probe analysis and results JSON - Add --tag filter for domain-level probe runs (e.g. --tag blobs) - Propagate probe tags into result JSON for traceability - Rewrite plot dashboard around entropy (trend line + heatmap) - Backfill entropy for older results without it - Add execution_traces category with int_transaction_call_frame examples - Add missed slots and empty blob detection examples - Add runbook for block-number-partitioned table queries - Rewrite self-play skill for autonomous schema-informed resolution
Schema-informed fixes for probes with entropy >= 1.37: - Fix wrong table name in gossip example (libp2p_gossipsub_beacon_aggregate_and_proof -> libp2p_gossipsub_aggregate_and_proof) - Add reorg investigation example (fct_block + fct_block_first_seen_by_node) - Add parent distance distribution example (fct_block self-join) - Add getBlobs success rate distribution example (fct_engine_get_blobs_by_slot) - Add head accuracy by entity vs blob count example - Add per-node peer count estimate example (libp2p_connected) - Update SKILL.md with commit/rollback evaluation step
Reduce schema confusion by clarifying which tables to use: - Validator section: canonical_beacon_validators alone for counts/status - Entity section: prefer pre-aggregated entity tables over per-validator - MEV section: clarify mev_relay_bid_trace for bid values, fct_block_mev_head for market share - Add MEV bid value distribution example - Fix gossip table name (already committed but ensuring consistency)
- Add timing games by entity example (fct_block_first_seen_by_node + fct_block_proposer_entity) - Expand MEV description to enumerate all real MEV tables and their purposes - Addresses timing_games_by_entity (1.92) and mev_bid_value_distribution (1.37) probes
…ations) Critical fixes: - Remove fct_mev_bid_value example (table doesn't exist, was teaching hallucination) - Replace fct_attestation_correctness_by_entity_head (doesn't exist) with fct_attestation_correctness_by_validator_head + ethseer_validator_entity join - Fix entity_analysis category description to stop recommending non-existent table - Deduplicate "Attestation participation rate" (rename xatu version) - Deduplicate "Block arrival by consensus client" (rename network_health version) - Remove default. prefix from xatu queries for consistency
Baseline: avg entropy 0.5756 (16/40 agreed) Current: avg entropy 0.5325 (17/40 agreed) Key improvements from this session: - precompile_gas_usage: 0.20 agreement -> 1.00 (int_transaction_call_frame) - slot_13505944_reorg: 2.32 entropy -> 0.00 (fct_block + fct_block_first_seen_by_node) - aggregate_propagation_timing: 1.52 -> 0.00 (fixed table name typo) - mev_bid_value_distribution: 1.37 -> 0.00 (removed poisoned fct_mev_bid_value example) - head_accuracy_vs_blob_count: 1.92 -> 0.72 - entity_performance_comparison: fixed non-existent table reference
The embedding text for query examples now includes category name, cluster,
and table names extracted from the SQL query. Previously only the example
name and description were embedded, making it impossible for semantic search
to distinguish examples using different tables for similar questions.
Before: "MEV bid value distribution. Get the distribution of MEV bid values"
After: "MEV Analysis: MEV bid value distribution. Distribution of MEV bid
values... Cluster: xatu. Tables: mev_relay_bid_trace"
avg entropy: 0.5756 (baseline) -> 0.4993 (now), -13.2% zero-entropy probes: 16 -> 23 out of 40
…ttestation The previous example used length(aggregation_bits) from beacon_api_eth_v1_events_attestation which gives the hex string length, not the number of attesting validators. canonical_beacon_elaborated_attestation has a validators array — length(validators) gives the correct count.
Fixed broken example that used length(aggregation_bits) string length instead of length(validators) array count. Switched to canonical_beacon_elaborated_attestation which has the validators array. Result: validators_per_included_attestation 1.92 -> 0.00 (all 5 agreed)
…uation The validator status distribution probe keeps adding dim_validator_status (lifecycle transitions) instead of using canonical_beacon_validators alone (current state). Enriched descriptions to explicitly name the correct table and explain why dim_validator_status is not appropriate for current state queries.
Best run yet. Down from 0.5756 baseline (-20.5%). 24 probes at zero entropy, up from 16.
Signed-off-by: Sam Calder-Mason <sam@puritydev.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds entropy as a loss metric for self-play schema probing. Entropy measures table choice confusion across personas — 0 means all agree, higher means more disagreement. Average entropy went from 0.5756 to 0.4993 (-13%), zero-entropy probes from 16 to 23 out of 40.
Changes:
--tagfilter on the probe runner for domain-level testing (e.g.--tag blobs)fct_mev_bid_value,fct_attestation_correctness_by_entity_head)libp2p_gossipsub_beacon_aggregate_and_proof->libp2p_gossipsub_aggregate_and_proof)int_transaction_call_frame), reorg investigation, parent distance, timing games, getBlobs success rate, peer count estimation, head accuracy by entity