feat: on-disk cert evidence via [CERT-FIRE] stdout log + v0.1.0a3 by coredipper · Pull Request #6 · coredipper/operon-openhands-gates

coredipper · 2026-04-21T08:45:06Z

Summary

Closes the "cert metadata not serialized" caveat from the n=10 delta artifact. The blog post had to cite retry counts as a proxy for critic firing because no on-disk cert evidence existed; this PR lands the workaround as a side-channel stdout log.

Library (ships in v0.1.0a3):

OperonStagnationCritic.evaluate() writes a single logger.info("[CERT-FIRE] %s", <json>) line on the certificate transition. Fires exactly once per critic instance per conversation (same guard as the cert emission itself). Payload: theorem, source, cert_evidence_n, epiplexic_integral, severity, detection_index.
Instance_id is NOT in the payload — the critic doesn't receive one. Correlation happens downstream via the logs/instance_<iid>.output.log filename.

Generator:

_load_cert_fires_from_logs parses both bare [CERT-FIRE] lines and the openhands [DOCKER] {"message": "[CERT-FIRE] {...}"} JSON-wrapper.
CLI: --baseline-logs-dir / --treatment-logs-dir (independently optional; back-compat with pre-instrumentation runs that parse to zero cert records).
Artifact: per-instance treatment_certificate payload + summary certificates_emitted count/rate.
Markdown: new Certificates emitted headline row, Treat cert column in the instance table, and the "cert metadata not serialized" caveat flips to a "side-channel log via [CERT-FIRE]" variant when real records are present.

Tests: +10 new, 94 total pass:

3 on the critic (single-fire on transition, no-fire without, no re-fire after fire)
7 on the generator (parser with both log shapes, missing-dir, no-cert-lines, keeps-first-only, artifact population, markdown mutation, back-compat)

Regen

Ran the updated generator against the existing logs. Count is 0/10 — logs predate this library change. The new cert-caveat variant says so explicitly: "0 of 10 treatment instances have on-disk cert records on this run." Future post-instrumentation runs will populate real records.

Out of scope (parallel follow-ups)

Sibling operon-langgraph-gates StagnationGate — same instrumentation pattern, separate repo, separate plan.
Upstream openhands-sdk patch to serialize CriticResult.metadata into event history — would obsolete this workaround.

Test plan

pytest tests/ — 94 pass (3 critic + 7 generator + existing 84).
ruff check scripts/ tests/ clean.
Artifact regen via --baseline-logs-dir / --treatment-logs-dir produces valid JSON + md on pre-instrumentation logs (0 cert records, back-compat caveat text correct).
After merge: gh release create v0.1.0a3 → publish.yml → PyPI.

🤖 Generated with Claude Code

Closes caveat 2 from the n=10 delta md ("``OperonStagnationCritic`` emits ``CriticResult.metadata`` but openhands-sdk drops it"). The blog post had to cite retry counts as a proxy for critic firing because no on-disk cert evidence existed; v0.1.0a3 lands the workaround. **Library** (src/operon_openhands_gates/stagnation_critic.py): ``OperonStagnationCritic.evaluate()`` now writes a single ``logger.info("[CERT-FIRE] %s", <json>)`` line on the certificate transition (guarded by the same ``was_stagnant / should_be_stagnant`` check, so fires exactly once per critic instance per conversation). Payload: ``theorem``, ``source``, ``cert_evidence_n``, ``epiplexic_integral``, ``severity``, ``detection_index``. Instance_id is NOT in the payload — the critic doesn't receive one. Correlation happens downstream via the ``logs/instance_<iid>.output.log`` filename. **Generator** (scripts/generate_delta_artifact.py): new ``_load_cert_fires_from_logs`` parser handling both bare ``[CERT-FIRE]`` lines and the openhands ``[DOCKER] {"message": "[CERT-FIRE] {...}"}`` wrapper. New CLI flags ``--baseline-logs-dir`` / ``--treatment-logs-dir`` (independently optional — back-compat with pre-instrumentation runs). When logs are supplied: - Per-instance ``baseline_certificate`` / ``treatment_certificate`` payloads (or ``None``). - Summary ``certificates_emitted`` + ``certificates_emitted_rate``. - Markdown: new "Certificates emitted" headline row, "Treat cert" column in the instance table, caveat 2 flips to a "side-channel log" variant that names the ``[CERT-FIRE]`` prefix and the ``--*-logs-dir`` flags. - "Next steps" drops the "fix cert serialization" item when real records are present (it's done). **Tests** (+10 new, 94 total pass): - 3 on the critic: single-fire on transition, no-fire without transition, no re-fire after initial fire. - 7 on the generator: parser (Docker-wrapped + bare), missing-dir, no-cert-lines, keeps-first-only on duplicate, artifact population, markdown mutation + back-compat. **Regen**: artifact regenerated with ``--*-logs-dir`` against existing logs. Count is 0/10 because the logs predate this library change (captured from v0.1.0a2 inference runs); the markdown's new cert caveat variant calls that out explicitly. Future post-instrumentation runs will populate real cert records. Version: 0.1.0a2 → 0.1.0a3. Out of scope (parallel follow-ups): - Sibling ``operon-langgraph-gates`` ``StagnationGate`` would benefit from the same instrumentation — separate plan. - Upstream openhands-sdk patch to serialize ``CriticResult.metadata`` into event history — would obsolete the side-channel log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two findings on the v0.1.0a3 cert-log PR: **High (M1, #848)**: ``logger.info(...)`` writes to stderr by default (Python stdlib logging's StreamHandler defaults to ``sys.stderr``). The benchmarks runner captures *stdout* into ``logs/instance_<iid>.output.log``, so our cert marker would have been invisible to the downstream parser in the common container config — silently breaking the advertised workaround. Fix: emit via ``print("[CERT-FIRE] ..." , file=sys.stdout, flush=True)`` in ``OperonStagnationCritic.evaluate()``. Explicit stdout, explicit flush (so the line lands before the container teardown truncates buffered output). Dropped the module-level ``logger`` since it's no longer used for this purpose. Tests: swapped ``caplog`` for ``capsys`` in the 3 existing cert-fire tests so they exercise the actual emitted stream (not the logging hierarchy, which doesn't reflect what the runner captures). Added ``test_cert_fire_goes_to_stdout_not_stderr`` as an explicit channel-pinning regression. **Medium (M2, #848)**: ``--baseline-logs-dir`` / ``--treatment-logs-dir`` weren't validated against the run rows. A stale/mistyped directory would silently yield ``certificates_emitted: 0`` because non-matching filenames are ignored and missing matches look identical to "no cert fires" — same bug class that ``_validate_eval_report_covers_rows`` catches for eval reports. Fix: added ``_validate_logs_dir_covers_rows`` with three failure modes surfaced clearly (dir doesn't exist, dir is empty, some expected ``instance_<iid>.output.log`` files are missing). Called from ``main()`` before the artifact is built. Extra files beyond the row set are fine (only missing IDs trigger the error). Tests: 4 new cases covering each failure mode + the happy path. 99 total tests pass, ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coredipper and others added 2 commits April 21, 2026 10:44

coredipper merged commit a5948ef into main Apr 21, 2026
2 checks passed

coredipper deleted the feat/cert-stdout-log branch April 21, 2026 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: on-disk cert evidence via [CERT-FIRE] stdout log + v0.1.0a3#6

feat: on-disk cert evidence via [CERT-FIRE] stdout log + v0.1.0a3#6
coredipper merged 2 commits into
mainfrom
feat/cert-stdout-log

coredipper commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coredipper commented Apr 21, 2026

Summary

Regen

Out of scope (parallel follow-ups)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant