Skip to content

Releases: adewale/skill-eval-harness

Skill Eval Harness v0.4.2

12 Jun 13:43
31ec765

Choose a tag to compare

Changes

  • Add skill-benchmark token-overhead for static skill footprint, paired runtime token deltas, objective lift, and lift per 1k extra tokens.
  • Document token-overhead usage in README.
  • Bump package version to 0.4.2.

Validation

  • CI passed on Python 3.10, 3.11, and 3.12.
  • Local: python3 -m py_compile *.py examples/adewale-workspace/*.py
  • Local: python3 -m unittest discover tests -v
  • Local: uv build

v0.4.1

11 Jun 19:50
5ceea87

Choose a tag to compare

Highlights

  • Grounds trace support in live Pi and Codex JSONL event shapes.
  • Adds variant-scoped assertions for process checks that differ by variant.
  • Adds shared trace artifact writing for import-trace, run-codex, Jetty imports, Pi smoke, and Pi trigger traces.
  • Isolates the Adewale Pi smoke runner workspace so without_skill cannot read source-repo skill files.
  • Adds skill-pi-trigger-eval --trace-runs.

Validation

  • python3 -m py_compile .py examples/adewale-workspace/.py
  • python3 -m unittest discover tests -v

v0.4.0

11 Jun 18:47

Choose a tag to compare

Highlights

  • Adds trace-aware evaluation support: import-trace, run-codex, normalized events/metrics, process and efficiency assertions.
  • Adds benchmark paired deltas, normalized gain, negative-delta cases, telemetry availability, and taxonomy slice summaries.
  • Adds taxonomy audit warnings and profile-skill size/reference/module reporting.
  • Documents trace artifacts and Codex JSONL execution.

Validation

  • python3 -m py_compile .py examples/adewale-workspace/.py
  • python3 -m unittest discover tests -v

v0.3.0

10 Jun 23:53
04610f6

Choose a tag to compare

v0.3.0

Adds three eval-quality features:

  • script objective assertions for deterministic repo-owned oracle commands, gated by explicit --allow-scripts.
  • Prompt/assertion leakage lint in validate and audit-manifest; use --strict-leakage to fail validation.
  • skill-benchmark judge --judge-cmd ... for pluggable judge backends that emit judge-results.jsonl compatible with existing benchmark merging.

Also includes tests and README updates.

v0.2.0

10 Jun 10:32
3c6c4c9

Choose a tag to compare

v0.2.0

Adds the first Jetty adapter slice:

  • skill-benchmark export-jetty
  • skill-benchmark run-jetty
  • skill-benchmark import-jetty-results

The adapter follows Jetty docs and jettyio/jettyio-skills runbook-mode conventions: system-message runbook, jetty.template_variables, model_provider, snapshot, uploaded files, and local deterministic grading after import.

Also includes mocked Jetty execution/import tests, hidden-prompt non-executable safety, README updates, and TODO/spec status updates.

Not included yet: live token-backed validation, streaming, concurrent live submissions, Jetty simple_judge, and materialized ablated skill files.

v0.1.1

09 Jun 13:57

Choose a tag to compare

Switch installation instructions to uv.

v0.1.0

09 Jun 13:32

Choose a tag to compare

Initial standalone Skill Eval Harness release.