release: v0.35.0 — Phase C cross-model confirmation + retry null result by coredipper · Pull Request #59 · coredipper/operon

coredipper · 2026-04-18T18:52:43Z

Phase C from the v0.34.5 blog post's "what's next" list lands: cross-model check of the 8B format-discipline ceiling claim, with format-correction retry active. PR #57 (timeout fix) and PR #58 (retry infrastructure) already on main from earlier in this session; this PR adds the new artifact + paper §6.3 Phase C writeup + version bump.

Highlights

	gemma4:latest (v0.34.5)	deepseek-r1:8b (v0.35.0)
Total evaluated	1/30	0/30
Sanitizer-rejected	27	30
Runtime errors	2	0
Resolved	0	0
Retry active	no	yes
Retry-recovered patches	—	0
Mean latency (baseline, completed)	131 s	1084 s (18 min)

Two load-bearing findings:

Cross-model ceiling is sharp. The single v0.34.5 "survivor" (django-11001 baseline, unresolved) was gemma4-specific. deepseek-r1:8b could not produce a git apply-clean diff for it either. The 8B-class format-discipline ceiling is not a single-model artifact.
Retry-with-reason-code did not help. 0 retry-recovered patches across 30 submissions. At 8B, diff-format failure is not mistake-that-can-be-corrected but cannot-produce-format-at-all. Retry helps occasional misformatters, not capability-ceiling models. Sharper negative than the v0.34.5 paper predicted (we said retry "might recover 20–40% of sanitizer drops"; it recovered zero).

Model identity

tag=deepseek-r1:8b
digest=6995872bfe4c
blob_sha256=e6a7edc1a4d7d9b2de136a221a57336b76316cfe53a252aeba814496c5ae439d
architecture=qwen3  (DeepSeek-R1 distilled onto Qwen3-8B)
parameters=8.2B
quantization=Q4_K_M

Observations from the live logs

Reason-code vocabulary fired in the wild: overlong_hunk and truncated_hunk both appeared across different instances, confirming the diagnostic categorization is accurate.
Repo-dependent latency: astropy instances ~27 min/call, django instances ~10 min/call. Same model, same grounded context, same retry. Driven by issue complexity (stack traces + import paths in astropy vs simpler descriptions in django). Worth flagging for future cross-repo benchmarking.
Zero runtime errors: PR feat: raise LLM timeout to 900s for Phase 2 reasoning-model readiness #57's 900s timeout absorbs reasoning-model <think> blocks. Two timeouts that appeared in v0.34.5 (astropy-12907, astropy-14995) are now clean completions in this artifact.

Changes

eval/results/swebench_phase2_deepseek_retry.json — the new artifact (post_run_check=match, harness=ok, 0/30 evaluated)
article/paper5/sections/06-experiments.tex — new \subsubsection{Phase C: Cross-Model Check with Format-Correction Retry} after the v0.34.5 grounded-rerun content; per-condition outcome table, comparison table, what-Phase-C-does-and-does-not-say paragraph
article/paper5/sections/07-conclusion.tex — limitations item updated to "SWE-bench-lite ceiling at 8B, confirmed cross-model"
article/paper5/main.pdf — rebuilt (114KB → 121KB, the Phase C content is the delta)
Version bump 0.34.5 → 0.35.0 (pyproject.toml / __init__.py / README badge)

Test plan

Artifact schema test: pytest tests/unit/test_swebench_phase2_identity.py -q — 102 pass
post_run_check.status = match on the new artifact
harness=ok across all three conditions
Paper PDF rebuilt without new warnings
Merge + release v0.35.0 → PyPI

What's next (not in this PR)

70B+ cloud-GPU rerun (Modal / Groq free-tier Llama-3.3-70B) to test whether retry recovers at larger scale
Blog follow-up post on Phase C
Site release notes for v0.35.0

🤖 Generated with Claude Code

Ships the Phase C experiment described in the v0.34.5 blog's "what's next" list: a locally-runnable cross-model check of the 8B format- discipline ceiling claim, with format-correction retry active. Artifact: eval/results/swebench_phase2_deepseek_retry.json - model: deepseek-r1:8b (DeepSeek-R1 distilled onto Qwen3-8B), digest 6995872bfe4c, 8.2B params, Q4_K_M, 131k context - Same 10 SWE-bench-lite instances as v0.34.5, same grounded pipeline - --retry-on-reject active (format-correction retry) - post_run_check.status = match - harness=ok across all three conditions - Total wall-clock: ~22 hours (reasoning models are slow) Outcome (all three conditions, 10 instances each): resolved=0, unresolved=0, sanitizer-rejected=10, runtime_error=0 mean_latency: baseline 1084s, organism 1073s, langgraph 1130s retry-recovered patches: 0 across all 30 submissions Two load-bearing findings for Paper 5 §6.3 Phase C: 1. Cross-model ceiling is sharp. Both gemma4 (v0.34.5) and deepseek-r1:8b (v0.35.0) produce zero resolved instances. Gemma4 crossed the sanitizer once (django-11001 baseline -> unresolved); deepseek-r1 crossed it zero times. The single v0.34.5 "survivor" was gemma4-specific, not a structurally easy issue. 2. Retry-with-reason-code did not help. The retry mechanism fires correctly (we observed overlong_hunk and truncated_hunk reason codes in the live logs), retry prompts embed the reason + failed output verbatim with reason-specific guidance, but zero "retry recovered" events logged across 30 submissions. At 8B, diff-format failure is not mistake-that-can-be-corrected but cannot-produce-format-at-all. Retry helps occasional misformatters, not capability-ceiling models. Sharper negative result than the v0.34.5 paper predicted (we suggested retry "might recover 20-40% of sanitizer drops"); at 8B it recovers zero. Paper 5 §6.3 adds new subsubsection "Phase C: Cross-Model Check with Format-Correction Retry" with: - Setup (deepseek-r1:8b identity details) - Per-condition outcome table - Comparison table vs gemma4 v0.34.5 - What Phase C does and does not say - Next wedge: 70B+ via cloud GPU, outside local-only scope Paper 5 §7 Limitations updated: "SWE-bench-lite ceiling at 8B, confirmed cross-model". Paper PDF rebuilt via tectonic (114KB -> 121KB; the Phase C content is the delta). Infrastructure shipped in this release (already on main from PRs #57, #58, #57's #753 and #58's #755 follow-ups): - 900s LLM timeout + 60s probe timeout split (PR #57) - sanitize_with_reason() returning (patch, reason_code) tuple with 8 machine-readable reason codes (PR #58) - _FORMAT_RETRY_MAX = 1 retry callback plumbing in all three runners (PR #58) - EVAL_RUNTIME_ERROR status distinct from sanitizer-rejected empty_patch (carried over from v0.34.5) - classify_prediction() as authoritative override (carried over) - --retry-on-reject and --output CLI flags (PR #58) - --output honored under --rewrite-envelope (PR #58's #755 fix) - overlong-hunk classification handling of bare empty extra (same) Versions bumped: - pyproject.toml 0.34.5 -> 0.35.0 - operon_ai/__init__.py 0.34.5 -> 0.35.0 - README.md badge v0.34.5 -> v0.35.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MEDIUM review #757: v0.35.0 Phase C paper claims "0 retry-recovered across 30 submissions" and cites specific reason codes in §6.3, but the committed artifact only stored `empty_patch` for every rejection with no field showing whether grounding was enabled, whether retry ran, which reason code fired, or whether the failure was a sanitizer rejection versus empty extraction. The published evidence was weaker than the conclusions. Fix: extend the artifact schema and retrofit both v0.34.5 and v0.35 artifacts so the evidence catches up to the conclusions. Schema additions (eval/swebench_phase2.py): Top-level envelope: - `grounding: bool` — mirrors --grounding - `retry_on_reject: bool` — mirrors --retry-on-reject Per-result dict: - `sanitize_reason: str` — final-attempt reason code from sanitize_with_reason (one of SANITIZE_REASONS); "" on success - `retry_attempted: bool` — whether the retry callback was invoked - `retry_recovered: bool` — whether a retry produced a usable patch Prediction dataclass: three new fields with safe defaults. SanitizeOutcome: new NamedTuple returned by _sanitize_for_submission and _try_format_retry, carrying (patch, reason, retry_attempted, retry_recovered) so the runner can populate Prediction directly. All three runners (run_baseline, run_organism, run_langgraph) updated to unpack the outcome and pass the metadata through. build_artifact: new grounding and retry_on_reject kwargs (both default False so v0.34.x-era callers still produce a valid envelope without plumbing them through). Retrofit: - eval/results/swebench_phase2.json (v0.34.5 gemma4): grounding=True, retry_on_reject=False. Per-result sanitize_reason left empty (not recorded live); retry fields False. - eval/results/swebench_phase2_deepseek_retry.json (v0.35 deepseek): grounding=True, retry_on_reject=True. Per-result sanitize_reason populated by parsing the live log; retry_attempted=True for every empty_patch row. Reason distribution: 26 empty_extraction, 3 overlong_hunk, 1 truncated_hunk. retry_recovered=0/30. Paper §6.3 Phase C updated with the reason distribution + sharper framing: at 8B, 26/30 failures are "doesn't produce diff-shaped output in the first place" (empty_extraction), only 4/30 are "produces diff-shaped output with fixable errors". Retry-with-guidance is calibrated for the second regime but the model sits mostly in the first. The grounding-specific reasons (path_not_found, ambiguous_path, placeholder_hunk) never fired, reinforcing the v0.34.5 claim that file selection is not the bottleneck. Tests (4 new in test_swebench_phase2_schema_v035.py): - Prediction has retry_attempted, retry_recovered, sanitize_reason fields with safe defaults - build_artifact exposes grounding and retry_on_reject at the top level with backward-compat defaults - Existing schema-shape test updated to include the two new top-level keys - Existing retry-plumbing tests updated to unpack SanitizeOutcome 105/105 tests pass across sanitizer, retry, timeout, identity, schema-v035 suites (+4 new since PR #59 opened). PDF rebuilt via tectonic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e rewrite MEDIUM review #758 flagged two issues: 1. `_rewrite_envelope()` did not thread the new `grounding` and `retry_on_reject` fields from the existing artifact into `build_artifact()`. Any envelope rewrite silently reset both run-defining flags to False, defeating the point of persisting them. 2. The committed deepseek artifact showed `grounding=false` and `retry_on_reject=false` at the top level despite the paper and commit message describing it as the grounded retry run. Every per-result row had `retry_attempted=true`, making the artifact internally inconsistent. Root cause: an earlier `--rewrite-envelope` call (predating the schema extension) dropped the flags. Fix: - `_rewrite_envelope` now passes `existing.get("grounding", False)` and `existing.get("retry_on_reject", False)` to `build_artifact()`. Pre-flag artifacts (v0.34.x-era) still default to False via the `.get` fallback, so backward compat is preserved. - Regenerated the deepseek artifact top-level flags to `grounding=true, retry_on_reject=true` and ran `--rewrite-envelope` through the fixed code path to confirm the flags survive (they do, verified via post-rewrite inspection). post_run_check.status=match. Two new tests: * `test_rewrite_envelope_preserves_grounding_and_retry_flags` — locks the preservation invariant against regression. * `test_rewrite_envelope_defaults_flags_when_source_lacks_them` — confirms pre-flag artifacts (no key present) still produce a valid rewrite with False defaults (backward compat). 107/107 tests pass (+2 new since #757 landed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coredipper and others added 3 commits April 18, 2026 20:52

coredipper merged commit 0238a6c into main Apr 18, 2026
4 checks passed

coredipper deleted the feat/phase2-phase-c-crossmodel branch April 22, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v0.35.0 — Phase C cross-model confirmation + retry null result#59

release: v0.35.0 — Phase C cross-model confirmation + retry null result#59
coredipper merged 3 commits into
mainfrom
feat/phase2-phase-c-crossmodel

coredipper commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coredipper commented Apr 18, 2026

Highlights

Model identity

Observations from the live logs

Changes

Test plan

What's next (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant