Skip to content

feat: on-disk cert evidence via [CERT-FIRE] stdout log + v0.1.0a3#6

Merged
coredipper merged 2 commits into
mainfrom
feat/cert-stdout-log
Apr 21, 2026
Merged

feat: on-disk cert evidence via [CERT-FIRE] stdout log + v0.1.0a3#6
coredipper merged 2 commits into
mainfrom
feat/cert-stdout-log

Conversation

@coredipper
Copy link
Copy Markdown
Owner

Summary

Closes the "cert metadata not serialized" caveat from the n=10 delta artifact. The blog post had to cite retry counts as a proxy for critic firing because no on-disk cert evidence existed; this PR lands the workaround as a side-channel stdout log.

Library (ships in v0.1.0a3):

  • OperonStagnationCritic.evaluate() writes a single logger.info("[CERT-FIRE] %s", <json>) line on the certificate transition. Fires exactly once per critic instance per conversation (same guard as the cert emission itself). Payload: theorem, source, cert_evidence_n, epiplexic_integral, severity, detection_index.
  • Instance_id is NOT in the payload — the critic doesn't receive one. Correlation happens downstream via the logs/instance_<iid>.output.log filename.

Generator:

  • _load_cert_fires_from_logs parses both bare [CERT-FIRE] lines and the openhands [DOCKER] {"message": "[CERT-FIRE] {...}"} JSON-wrapper.
  • CLI: --baseline-logs-dir / --treatment-logs-dir (independently optional; back-compat with pre-instrumentation runs that parse to zero cert records).
  • Artifact: per-instance treatment_certificate payload + summary certificates_emitted count/rate.
  • Markdown: new Certificates emitted headline row, Treat cert column in the instance table, and the "cert metadata not serialized" caveat flips to a "side-channel log via [CERT-FIRE]" variant when real records are present.

Tests: +10 new, 94 total pass:

  • 3 on the critic (single-fire on transition, no-fire without, no re-fire after fire)
  • 7 on the generator (parser with both log shapes, missing-dir, no-cert-lines, keeps-first-only, artifact population, markdown mutation, back-compat)

Regen

Ran the updated generator against the existing logs. Count is 0/10 — logs predate this library change. The new cert-caveat variant says so explicitly: "0 of 10 treatment instances have on-disk cert records on this run." Future post-instrumentation runs will populate real records.

Out of scope (parallel follow-ups)

  • Sibling operon-langgraph-gates StagnationGate — same instrumentation pattern, separate repo, separate plan.
  • Upstream openhands-sdk patch to serialize CriticResult.metadata into event history — would obsolete this workaround.

Test plan

  • pytest tests/ — 94 pass (3 critic + 7 generator + existing 84).
  • ruff check scripts/ tests/ clean.
  • Artifact regen via --baseline-logs-dir / --treatment-logs-dir produces valid JSON + md on pre-instrumentation logs (0 cert records, back-compat caveat text correct).
  • After merge: gh release create v0.1.0a3 → publish.yml → PyPI.

🤖 Generated with Claude Code

coredipper and others added 2 commits April 21, 2026 10:44
Closes caveat 2 from the n=10 delta md ("``OperonStagnationCritic``
emits ``CriticResult.metadata`` but openhands-sdk drops it"). The
blog post had to cite retry counts as a proxy for critic firing
because no on-disk cert evidence existed; v0.1.0a3 lands the
workaround.

**Library** (src/operon_openhands_gates/stagnation_critic.py):
``OperonStagnationCritic.evaluate()`` now writes a single
``logger.info("[CERT-FIRE] %s", <json>)`` line on the certificate
transition (guarded by the same ``was_stagnant / should_be_stagnant``
check, so fires exactly once per critic instance per conversation).
Payload: ``theorem``, ``source``, ``cert_evidence_n``,
``epiplexic_integral``, ``severity``, ``detection_index``.

Instance_id is NOT in the payload — the critic doesn't receive one.
Correlation happens downstream via the ``logs/instance_<iid>.output.log``
filename.

**Generator** (scripts/generate_delta_artifact.py): new
``_load_cert_fires_from_logs`` parser handling both bare ``[CERT-FIRE]``
lines and the openhands ``[DOCKER] {"message": "[CERT-FIRE] {...}"}``
wrapper. New CLI flags ``--baseline-logs-dir`` / ``--treatment-logs-dir``
(independently optional — back-compat with pre-instrumentation runs).
When logs are supplied:
- Per-instance ``baseline_certificate`` / ``treatment_certificate``
  payloads (or ``None``).
- Summary ``certificates_emitted`` + ``certificates_emitted_rate``.
- Markdown: new "Certificates emitted" headline row, "Treat cert"
  column in the instance table, caveat 2 flips to a "side-channel
  log" variant that names the ``[CERT-FIRE]`` prefix and the
  ``--*-logs-dir`` flags.
- "Next steps" drops the "fix cert serialization" item when real
  records are present (it's done).

**Tests** (+10 new, 94 total pass):
- 3 on the critic: single-fire on transition, no-fire without
  transition, no re-fire after initial fire.
- 7 on the generator: parser (Docker-wrapped + bare), missing-dir,
  no-cert-lines, keeps-first-only on duplicate, artifact population,
  markdown mutation + back-compat.

**Regen**: artifact regenerated with ``--*-logs-dir`` against
existing logs. Count is 0/10 because the logs predate this
library change (captured from v0.1.0a2 inference runs); the
markdown's new cert caveat variant calls that out explicitly.
Future post-instrumentation runs will populate real cert records.

Version: 0.1.0a2 → 0.1.0a3.

Out of scope (parallel follow-ups):
- Sibling ``operon-langgraph-gates`` ``StagnationGate`` would benefit
  from the same instrumentation — separate plan.
- Upstream openhands-sdk patch to serialize ``CriticResult.metadata``
  into event history — would obsolete the side-channel log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings on the v0.1.0a3 cert-log PR:

**High (M1, #848)**: ``logger.info(...)`` writes to stderr by default
(Python stdlib logging's StreamHandler defaults to ``sys.stderr``).
The benchmarks runner captures *stdout* into
``logs/instance_<iid>.output.log``, so our cert marker would have
been invisible to the downstream parser in the common container
config — silently breaking the advertised workaround.

Fix: emit via ``print("[CERT-FIRE] ..." , file=sys.stdout, flush=True)``
in ``OperonStagnationCritic.evaluate()``. Explicit stdout, explicit
flush (so the line lands before the container teardown truncates
buffered output). Dropped the module-level ``logger`` since it's
no longer used for this purpose.

Tests: swapped ``caplog`` for ``capsys`` in the 3 existing cert-fire
tests so they exercise the actual emitted stream (not the logging
hierarchy, which doesn't reflect what the runner captures). Added
``test_cert_fire_goes_to_stdout_not_stderr`` as an explicit
channel-pinning regression.

**Medium (M2, #848)**: ``--baseline-logs-dir`` / ``--treatment-logs-dir``
weren't validated against the run rows. A stale/mistyped directory
would silently yield ``certificates_emitted: 0`` because non-matching
filenames are ignored and missing matches look identical to "no
cert fires" — same bug class that ``_validate_eval_report_covers_rows``
catches for eval reports.

Fix: added ``_validate_logs_dir_covers_rows`` with three failure
modes surfaced clearly (dir doesn't exist, dir is empty, some
expected ``instance_<iid>.output.log`` files are missing). Called
from ``main()`` before the artifact is built. Extra files beyond the
row set are fine (only missing IDs trigger the error).

Tests: 4 new cases covering each failure mode + the happy path.

99 total tests pass, ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coredipper coredipper merged commit a5948ef into main Apr 21, 2026
2 checks passed
@coredipper coredipper deleted the feat/cert-stdout-log branch April 21, 2026 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant