Skip to content

feat: entropy loss metric, embedding enrichment, and example fixes for self-play probes#92

Merged
samcm merged 16 commits intomasterfrom
smart-butterfly-882
Mar 25, 2026
Merged

feat: entropy loss metric, embedding enrichment, and example fixes for self-play probes#92
samcm merged 16 commits intomasterfrom
smart-butterfly-882

Conversation

@samcm
Copy link
Copy Markdown
Member

@samcm samcm commented Mar 20, 2026

Adds entropy as a loss metric for self-play schema probing. Entropy measures table choice confusion across personas — 0 means all agree, higher means more disagreement. Average entropy went from 0.5756 to 0.4993 (-13%), zero-entropy probes from 16 to 23 out of 40.

Changes:

  • Shannon entropy in probe analysis, results JSON, CLI output, comparison tables, and the plot dashboard
  • --tag filter on the probe runner for domain-level testing (e.g. --tag blobs)
  • Enrich example search embeddings with category name, cluster, and table names extracted from SQL — previously only name+description were embedded, so search couldn't tell apart examples using different tables
  • Remove two non-existent tables from examples that were teaching the model to hallucinate (fct_mev_bid_value, fct_attestation_correctness_by_entity_head)
  • Fix wrong gossip table name (libp2p_gossipsub_beacon_aggregate_and_proof -> libp2p_gossipsub_aggregate_and_proof)
  • Deduplicate example names that were confusing search results
  • New examples: execution traces (int_transaction_call_frame), reorg investigation, parent distance, timing games, getBlobs success rate, peer count estimation, head accuracy by entity
  • Runbook for querying block-number-partitioned tables
  • Rewrite self-play skill for autonomous schema-informed resolution with commit/rollback evaluation

samcm added 16 commits March 19, 2026 16:23
Add Shannon entropy as the primary metric for tracking schema ambiguity.
Lower entropy = less confusion about which tables to use. This replaces
binary agree/disagree with a continuous signal that trends over time.

Changes:
- Add entropy computation to probe analysis and results JSON
- Add --tag filter for domain-level probe runs (e.g. --tag blobs)
- Propagate probe tags into result JSON for traceability
- Rewrite plot dashboard around entropy (trend line + heatmap)
- Backfill entropy for older results without it
- Add execution_traces category with int_transaction_call_frame examples
- Add missed slots and empty blob detection examples
- Add runbook for block-number-partitioned table queries
- Rewrite self-play skill for autonomous schema-informed resolution
Schema-informed fixes for probes with entropy >= 1.37:

- Fix wrong table name in gossip example (libp2p_gossipsub_beacon_aggregate_and_proof
  -> libp2p_gossipsub_aggregate_and_proof)
- Add reorg investigation example (fct_block + fct_block_first_seen_by_node)
- Add parent distance distribution example (fct_block self-join)
- Add getBlobs success rate distribution example (fct_engine_get_blobs_by_slot)
- Add head accuracy by entity vs blob count example
- Add per-node peer count estimate example (libp2p_connected)
- Update SKILL.md with commit/rollback evaluation step
Reduce schema confusion by clarifying which tables to use:
- Validator section: canonical_beacon_validators alone for counts/status
- Entity section: prefer pre-aggregated entity tables over per-validator
- MEV section: clarify mev_relay_bid_trace for bid values, fct_block_mev_head for market share
- Add MEV bid value distribution example
- Fix gossip table name (already committed but ensuring consistency)
- Add timing games by entity example (fct_block_first_seen_by_node + fct_block_proposer_entity)
- Expand MEV description to enumerate all real MEV tables and their purposes
- Addresses timing_games_by_entity (1.92) and mev_bid_value_distribution (1.37) probes
…ations)

Critical fixes:
- Remove fct_mev_bid_value example (table doesn't exist, was teaching hallucination)
- Replace fct_attestation_correctness_by_entity_head (doesn't exist) with
  fct_attestation_correctness_by_validator_head + ethseer_validator_entity join
- Fix entity_analysis category description to stop recommending non-existent table
- Deduplicate "Attestation participation rate" (rename xatu version)
- Deduplicate "Block arrival by consensus client" (rename network_health version)
- Remove default. prefix from xatu queries for consistency
Baseline: avg entropy 0.5756 (16/40 agreed)
Current:  avg entropy 0.5325 (17/40 agreed)

Key improvements from this session:
- precompile_gas_usage: 0.20 agreement -> 1.00 (int_transaction_call_frame)
- slot_13505944_reorg: 2.32 entropy -> 0.00 (fct_block + fct_block_first_seen_by_node)
- aggregate_propagation_timing: 1.52 -> 0.00 (fixed table name typo)
- mev_bid_value_distribution: 1.37 -> 0.00 (removed poisoned fct_mev_bid_value example)
- head_accuracy_vs_blob_count: 1.92 -> 0.72
- entity_performance_comparison: fixed non-existent table reference
The embedding text for query examples now includes category name, cluster,
and table names extracted from the SQL query. Previously only the example
name and description were embedded, making it impossible for semantic search
to distinguish examples using different tables for similar questions.

Before: "MEV bid value distribution. Get the distribution of MEV bid values"
After:  "MEV Analysis: MEV bid value distribution. Distribution of MEV bid
         values... Cluster: xatu. Tables: mev_relay_bid_trace"
avg entropy: 0.5756 (baseline) -> 0.4993 (now), -13.2%
zero-entropy probes: 16 -> 23 out of 40
…ttestation

The previous example used length(aggregation_bits) from
beacon_api_eth_v1_events_attestation which gives the hex string length,
not the number of attesting validators. canonical_beacon_elaborated_attestation
has a validators array — length(validators) gives the correct count.
Fixed broken example that used length(aggregation_bits) string length
instead of length(validators) array count. Switched to
canonical_beacon_elaborated_attestation which has the validators array.

Result: validators_per_included_attestation 1.92 -> 0.00 (all 5 agreed)
…uation

The validator status distribution probe keeps adding dim_validator_status
(lifecycle transitions) instead of using canonical_beacon_validators alone
(current state). Enriched descriptions to explicitly name the correct table
and explain why dim_validator_status is not appropriate for current state queries.
Best run yet. Down from 0.5756 baseline (-20.5%).
24 probes at zero entropy, up from 16.
Signed-off-by: Sam Calder-Mason <sam@puritydev.io>
@samcm samcm merged commit a75a7e0 into master Mar 25, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant