Releases: adewale/skill-eval-harness
Releases · adewale/skill-eval-harness
Skill Eval Harness v0.4.2
Changes
- Add
skill-benchmark token-overheadfor static skill footprint, paired runtime token deltas, objective lift, and lift per 1k extra tokens. - Document token-overhead usage in README.
- Bump package version to 0.4.2.
Validation
- CI passed on Python 3.10, 3.11, and 3.12.
- Local:
python3 -m py_compile *.py examples/adewale-workspace/*.py - Local:
python3 -m unittest discover tests -v - Local:
uv build
v0.4.1
Highlights
- Grounds trace support in live Pi and Codex JSONL event shapes.
- Adds variant-scoped assertions for process checks that differ by variant.
- Adds shared trace artifact writing for import-trace, run-codex, Jetty imports, Pi smoke, and Pi trigger traces.
- Isolates the Adewale Pi smoke runner workspace so without_skill cannot read source-repo skill files.
- Adds skill-pi-trigger-eval --trace-runs.
Validation
- python3 -m py_compile .py examples/adewale-workspace/.py
- python3 -m unittest discover tests -v
v0.4.0
Highlights
- Adds trace-aware evaluation support: import-trace, run-codex, normalized events/metrics, process and efficiency assertions.
- Adds benchmark paired deltas, normalized gain, negative-delta cases, telemetry availability, and taxonomy slice summaries.
- Adds taxonomy audit warnings and profile-skill size/reference/module reporting.
- Documents trace artifacts and Codex JSONL execution.
Validation
- python3 -m py_compile .py examples/adewale-workspace/.py
- python3 -m unittest discover tests -v
v0.3.0
v0.3.0
Adds three eval-quality features:
scriptobjective assertions for deterministic repo-owned oracle commands, gated by explicit--allow-scripts.- Prompt/assertion leakage lint in
validateandaudit-manifest; use--strict-leakageto fail validation. skill-benchmark judge --judge-cmd ...for pluggable judge backends that emitjudge-results.jsonlcompatible with existing benchmark merging.
Also includes tests and README updates.
v0.2.0
v0.2.0
Adds the first Jetty adapter slice:
skill-benchmark export-jettyskill-benchmark run-jettyskill-benchmark import-jetty-results
The adapter follows Jetty docs and jettyio/jettyio-skills runbook-mode conventions: system-message runbook, jetty.template_variables, model_provider, snapshot, uploaded files, and local deterministic grading after import.
Also includes mocked Jetty execution/import tests, hidden-prompt non-executable safety, README updates, and TODO/spec status updates.
Not included yet: live token-backed validation, streaming, concurrent live submissions, Jetty simple_judge, and materialized ablated skill files.