Releases · adewale/skill-eval-harness

12 Jun 13:43

adewale

v0.4.2

31ec765

Skill Eval Harness v0.4.2 Latest

Latest

Changes

Add skill-benchmark token-overhead for static skill footprint, paired runtime token deltas, objective lift, and lift per 1k extra tokens.
Document token-overhead usage in README.
Bump package version to 0.4.2.

Validation

CI passed on Python 3.10, 3.11, and 3.12.
Local: python3 -m py_compile *.py examples/adewale-workspace/*.py
Local: python3 -m unittest discover tests -v
Local: uv build

Assets 2

11 Jun 19:50

adewale

v0.4.1

5ceea87

v0.4.1

Highlights

Grounds trace support in live Pi and Codex JSONL event shapes.
Adds variant-scoped assertions for process checks that differ by variant.
Adds shared trace artifact writing for import-trace, run-codex, Jetty imports, Pi smoke, and Pi trigger traces.
Isolates the Adewale Pi smoke runner workspace so without_skill cannot read source-repo skill files.
Adds skill-pi-trigger-eval --trace-runs.

Validation

python3 -m py_compile .py examples/adewale-workspace/.py
python3 -m unittest discover tests -v

Assets 2

11 Jun 18:47

adewale

v0.4.0

5a8144c

v0.4.0

Highlights

Adds trace-aware evaluation support: import-trace, run-codex, normalized events/metrics, process and efficiency assertions.
Adds benchmark paired deltas, normalized gain, negative-delta cases, telemetry availability, and taxonomy slice summaries.
Adds taxonomy audit warnings and profile-skill size/reference/module reporting.
Documents trace artifacts and Codex JSONL execution.

Validation

python3 -m py_compile .py examples/adewale-workspace/.py
python3 -m unittest discover tests -v

Assets 2

10 Jun 23:53

adewale

v0.3.0

04610f6

v0.3.0

Adds three eval-quality features:

script objective assertions for deterministic repo-owned oracle commands, gated by explicit --allow-scripts.
Prompt/assertion leakage lint in validate and audit-manifest; use --strict-leakage to fail validation.
skill-benchmark judge --judge-cmd ... for pluggable judge backends that emit judge-results.jsonl compatible with existing benchmark merging.

Also includes tests and README updates.

Assets 2

10 Jun 10:32

adewale

v0.2.0

3c6c4c9

v0.2.0

Adds the first Jetty adapter slice:

skill-benchmark export-jetty
skill-benchmark run-jetty
skill-benchmark import-jetty-results

The adapter follows Jetty docs and jettyio/jettyio-skills runbook-mode conventions: system-message runbook, jetty.template_variables, model_provider, snapshot, uploaded files, and local deterministic grading after import.

Also includes mocked Jetty execution/import tests, hidden-prompt non-executable safety, README updates, and TODO/spec status updates.

Not included yet: live token-backed validation, streaming, concurrent live submissions, Jetty simple_judge, and materialized ablated skill files.

Assets 2

09 Jun 13:57

adewale

v0.1.1

ea25cc6

v0.1.1

Switch installation instructions to uv.

Assets 2

09 Jun 13:32

adewale

v0.1.0

0d9a75c

v0.1.0

Initial standalone Skill Eval Harness release.

Assets 2

Releases: adewale/skill-eval-harness

Skill Eval Harness v0.4.2

Changes

Validation

Uh oh!

v0.4.1

Highlights

Validation

Uh oh!

v0.4.0

Highlights

Validation

Uh oh!

v0.3.0

v0.3.0

Uh oh!

v0.2.0

v0.2.0

Uh oh!

v0.1.1

Uh oh!

v0.1.0

Uh oh!