Statistical Evaluation Framework
Source: Alexmacapple/alex-claude-skill
Score: 4.3/5.0 (Impact: 4, Novelty: 5, Applicability: 4, Effort: 4)
Three techniques from eval-robuste
1. Prompt Hash Chaining for Baseline Integrity
Hash the full evaluation chain (SKILL.md + zone defs + agent prompts + git SHA). Store in results. Reject comparisons when hash changes — prevents confusing prompt changes with code improvements.
2. Deterministic Stats Delegation
ALL math (mean, stdev, CI, verdict) delegated to a Python script. LLM never computes numbers. Any math by the LLM is a protocol violation. Add score_audit.py to ShipGuard.
3. Intrinsic Stability Verdict (NOISE/REGRESSION/IMPROVEMENT)
Run audit N times, compute stdev of finding counts. STABLE if stdev < threshold. With baseline: NOISE/REGRESSION/IMPROVEMENT based on |delta| > sigma * ref_stdev. Prevents chasing phantom regressions from LLM stochasticity.
What ShipGuard should do
Build a sg-eval skill or extend sg-improve with:
prompt_hash.py — hash evaluation chain, store in results, reject invalid comparisons
score_audit.py — deterministic aggregation (no LLM math)
- N-run stability detection with NOISE/REGRESSION/IMPROVEMENT verdicts
Affected skill: sg-improve
Mutation type: add_constraint
Scouted: 2026-04-16
Statistical Evaluation Framework
Source: Alexmacapple/alex-claude-skill
Score: 4.3/5.0 (Impact: 4, Novelty: 5, Applicability: 4, Effort: 4)
Three techniques from eval-robuste
1. Prompt Hash Chaining for Baseline Integrity
Hash the full evaluation chain (SKILL.md + zone defs + agent prompts + git SHA). Store in results. Reject comparisons when hash changes — prevents confusing prompt changes with code improvements.
2. Deterministic Stats Delegation
ALL math (mean, stdev, CI, verdict) delegated to a Python script. LLM never computes numbers. Any math by the LLM is a protocol violation. Add
score_audit.pyto ShipGuard.3. Intrinsic Stability Verdict (NOISE/REGRESSION/IMPROVEMENT)
Run audit N times, compute stdev of finding counts. STABLE if stdev < threshold. With baseline: NOISE/REGRESSION/IMPROVEMENT based on |delta| > sigma * ref_stdev. Prevents chasing phantom regressions from LLM stochasticity.
What ShipGuard should do
Build a
sg-evalskill or extend sg-improve with:prompt_hash.py— hash evaluation chain, store in results, reject invalid comparisonsscore_audit.py— deterministic aggregation (no LLM math)Affected skill:
sg-improveMutation type:
add_constraintScouted: 2026-04-16