Skip to content

Scout: Statistical Evaluation Framework (from eval-robuste) #44

@bacoco

Description

@bacoco

Statistical Evaluation Framework

Source: Alexmacapple/alex-claude-skill
Score: 4.3/5.0 (Impact: 4, Novelty: 5, Applicability: 4, Effort: 4)

Three techniques from eval-robuste

1. Prompt Hash Chaining for Baseline Integrity
Hash the full evaluation chain (SKILL.md + zone defs + agent prompts + git SHA). Store in results. Reject comparisons when hash changes — prevents confusing prompt changes with code improvements.

2. Deterministic Stats Delegation
ALL math (mean, stdev, CI, verdict) delegated to a Python script. LLM never computes numbers. Any math by the LLM is a protocol violation. Add score_audit.py to ShipGuard.

3. Intrinsic Stability Verdict (NOISE/REGRESSION/IMPROVEMENT)
Run audit N times, compute stdev of finding counts. STABLE if stdev < threshold. With baseline: NOISE/REGRESSION/IMPROVEMENT based on |delta| > sigma * ref_stdev. Prevents chasing phantom regressions from LLM stochasticity.

What ShipGuard should do

Build a sg-eval skill or extend sg-improve with:

  • prompt_hash.py — hash evaluation chain, store in results, reject invalid comparisons
  • score_audit.py — deterministic aggregation (no LLM math)
  • N-run stability detection with NOISE/REGRESSION/IMPROVEMENT verdicts

Affected skill: sg-improve
Mutation type: add_constraint
Scouted: 2026-04-16

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions