feat: on-disk cert evidence via [CERT-FIRE] stdout log + v0.1.0a3#6
Merged
Conversation
Closes caveat 2 from the n=10 delta md ("``OperonStagnationCritic``
emits ``CriticResult.metadata`` but openhands-sdk drops it"). The
blog post had to cite retry counts as a proxy for critic firing
because no on-disk cert evidence existed; v0.1.0a3 lands the
workaround.
**Library** (src/operon_openhands_gates/stagnation_critic.py):
``OperonStagnationCritic.evaluate()`` now writes a single
``logger.info("[CERT-FIRE] %s", <json>)`` line on the certificate
transition (guarded by the same ``was_stagnant / should_be_stagnant``
check, so fires exactly once per critic instance per conversation).
Payload: ``theorem``, ``source``, ``cert_evidence_n``,
``epiplexic_integral``, ``severity``, ``detection_index``.
Instance_id is NOT in the payload — the critic doesn't receive one.
Correlation happens downstream via the ``logs/instance_<iid>.output.log``
filename.
**Generator** (scripts/generate_delta_artifact.py): new
``_load_cert_fires_from_logs`` parser handling both bare ``[CERT-FIRE]``
lines and the openhands ``[DOCKER] {"message": "[CERT-FIRE] {...}"}``
wrapper. New CLI flags ``--baseline-logs-dir`` / ``--treatment-logs-dir``
(independently optional — back-compat with pre-instrumentation runs).
When logs are supplied:
- Per-instance ``baseline_certificate`` / ``treatment_certificate``
payloads (or ``None``).
- Summary ``certificates_emitted`` + ``certificates_emitted_rate``.
- Markdown: new "Certificates emitted" headline row, "Treat cert"
column in the instance table, caveat 2 flips to a "side-channel
log" variant that names the ``[CERT-FIRE]`` prefix and the
``--*-logs-dir`` flags.
- "Next steps" drops the "fix cert serialization" item when real
records are present (it's done).
**Tests** (+10 new, 94 total pass):
- 3 on the critic: single-fire on transition, no-fire without
transition, no re-fire after initial fire.
- 7 on the generator: parser (Docker-wrapped + bare), missing-dir,
no-cert-lines, keeps-first-only on duplicate, artifact population,
markdown mutation + back-compat.
**Regen**: artifact regenerated with ``--*-logs-dir`` against
existing logs. Count is 0/10 because the logs predate this
library change (captured from v0.1.0a2 inference runs); the
markdown's new cert caveat variant calls that out explicitly.
Future post-instrumentation runs will populate real cert records.
Version: 0.1.0a2 → 0.1.0a3.
Out of scope (parallel follow-ups):
- Sibling ``operon-langgraph-gates`` ``StagnationGate`` would benefit
from the same instrumentation — separate plan.
- Upstream openhands-sdk patch to serialize ``CriticResult.metadata``
into event history — would obsolete the side-channel log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings on the v0.1.0a3 cert-log PR:
**High (M1, #848)**: ``logger.info(...)`` writes to stderr by default
(Python stdlib logging's StreamHandler defaults to ``sys.stderr``).
The benchmarks runner captures *stdout* into
``logs/instance_<iid>.output.log``, so our cert marker would have
been invisible to the downstream parser in the common container
config — silently breaking the advertised workaround.
Fix: emit via ``print("[CERT-FIRE] ..." , file=sys.stdout, flush=True)``
in ``OperonStagnationCritic.evaluate()``. Explicit stdout, explicit
flush (so the line lands before the container teardown truncates
buffered output). Dropped the module-level ``logger`` since it's
no longer used for this purpose.
Tests: swapped ``caplog`` for ``capsys`` in the 3 existing cert-fire
tests so they exercise the actual emitted stream (not the logging
hierarchy, which doesn't reflect what the runner captures). Added
``test_cert_fire_goes_to_stdout_not_stderr`` as an explicit
channel-pinning regression.
**Medium (M2, #848)**: ``--baseline-logs-dir`` / ``--treatment-logs-dir``
weren't validated against the run rows. A stale/mistyped directory
would silently yield ``certificates_emitted: 0`` because non-matching
filenames are ignored and missing matches look identical to "no
cert fires" — same bug class that ``_validate_eval_report_covers_rows``
catches for eval reports.
Fix: added ``_validate_logs_dir_covers_rows`` with three failure
modes surfaced clearly (dir doesn't exist, dir is empty, some
expected ``instance_<iid>.output.log`` files are missing). Called
from ``main()`` before the artifact is built. Extra files beyond the
row set are fine (only missing IDs trigger the error).
Tests: 4 new cases covering each failure mode + the happy path.
99 total tests pass, ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the "cert metadata not serialized" caveat from the n=10 delta artifact. The blog post had to cite retry counts as a proxy for critic firing because no on-disk cert evidence existed; this PR lands the workaround as a side-channel stdout log.
Library (ships in v0.1.0a3):
OperonStagnationCritic.evaluate()writes a singlelogger.info("[CERT-FIRE] %s", <json>)line on the certificate transition. Fires exactly once per critic instance per conversation (same guard as the cert emission itself). Payload:theorem,source,cert_evidence_n,epiplexic_integral,severity,detection_index.logs/instance_<iid>.output.logfilename.Generator:
_load_cert_fires_from_logsparses both bare[CERT-FIRE]lines and the openhands[DOCKER] {"message": "[CERT-FIRE] {...}"}JSON-wrapper.--baseline-logs-dir/--treatment-logs-dir(independently optional; back-compat with pre-instrumentation runs that parse to zero cert records).treatment_certificatepayload + summarycertificates_emittedcount/rate.Certificates emittedheadline row,Treat certcolumn in the instance table, and the "cert metadata not serialized" caveat flips to a "side-channel log via[CERT-FIRE]" variant when real records are present.Tests: +10 new, 94 total pass:
Regen
Ran the updated generator against the existing logs. Count is
0/10— logs predate this library change. The new cert-caveat variant says so explicitly: "0 of 10 treatment instances have on-disk cert records on this run." Future post-instrumentation runs will populate real records.Out of scope (parallel follow-ups)
operon-langgraph-gatesStagnationGate— same instrumentation pattern, separate repo, separate plan.openhands-sdkpatch to serializeCriticResult.metadatainto event history — would obsolete this workaround.Test plan
pytest tests/— 94 pass (3 critic + 7 generator + existing 84).ruff check scripts/ tests/clean.--baseline-logs-dir/--treatment-logs-dirproduces valid JSON + md on pre-instrumentation logs (0 cert records, back-compat caveat text correct).gh release create v0.1.0a3→ publish.yml → PyPI.🤖 Generated with Claude Code