release: v0.35.0 — Phase C cross-model confirmation + retry null result#59
Merged
Conversation
Ships the Phase C experiment described in the v0.34.5 blog's "what's next" list: a locally-runnable cross-model check of the 8B format- discipline ceiling claim, with format-correction retry active. Artifact: eval/results/swebench_phase2_deepseek_retry.json - model: deepseek-r1:8b (DeepSeek-R1 distilled onto Qwen3-8B), digest 6995872bfe4c, 8.2B params, Q4_K_M, 131k context - Same 10 SWE-bench-lite instances as v0.34.5, same grounded pipeline - --retry-on-reject active (format-correction retry) - post_run_check.status = match - harness=ok across all three conditions - Total wall-clock: ~22 hours (reasoning models are slow) Outcome (all three conditions, 10 instances each): resolved=0, unresolved=0, sanitizer-rejected=10, runtime_error=0 mean_latency: baseline 1084s, organism 1073s, langgraph 1130s retry-recovered patches: 0 across all 30 submissions Two load-bearing findings for Paper 5 §6.3 Phase C: 1. Cross-model ceiling is sharp. Both gemma4 (v0.34.5) and deepseek-r1:8b (v0.35.0) produce zero resolved instances. Gemma4 crossed the sanitizer once (django-11001 baseline -> unresolved); deepseek-r1 crossed it zero times. The single v0.34.5 "survivor" was gemma4-specific, not a structurally easy issue. 2. Retry-with-reason-code did not help. The retry mechanism fires correctly (we observed overlong_hunk and truncated_hunk reason codes in the live logs), retry prompts embed the reason + failed output verbatim with reason-specific guidance, but zero "retry recovered" events logged across 30 submissions. At 8B, diff-format failure is not mistake-that-can-be-corrected but cannot-produce-format-at-all. Retry helps occasional misformatters, not capability-ceiling models. Sharper negative result than the v0.34.5 paper predicted (we suggested retry "might recover 20-40% of sanitizer drops"); at 8B it recovers zero. Paper 5 §6.3 adds new subsubsection "Phase C: Cross-Model Check with Format-Correction Retry" with: - Setup (deepseek-r1:8b identity details) - Per-condition outcome table - Comparison table vs gemma4 v0.34.5 - What Phase C does and does not say - Next wedge: 70B+ via cloud GPU, outside local-only scope Paper 5 §7 Limitations updated: "SWE-bench-lite ceiling at 8B, confirmed cross-model". Paper PDF rebuilt via tectonic (114KB -> 121KB; the Phase C content is the delta). Infrastructure shipped in this release (already on main from PRs #57, #58, #57's #753 and #58's #755 follow-ups): - 900s LLM timeout + 60s probe timeout split (PR #57) - sanitize_with_reason() returning (patch, reason_code) tuple with 8 machine-readable reason codes (PR #58) - _FORMAT_RETRY_MAX = 1 retry callback plumbing in all three runners (PR #58) - EVAL_RUNTIME_ERROR status distinct from sanitizer-rejected empty_patch (carried over from v0.34.5) - classify_prediction() as authoritative override (carried over) - --retry-on-reject and --output CLI flags (PR #58) - --output honored under --rewrite-envelope (PR #58's #755 fix) - overlong-hunk classification handling of bare empty extra (same) Versions bumped: - pyproject.toml 0.34.5 -> 0.35.0 - operon_ai/__init__.py 0.34.5 -> 0.35.0 - README.md badge v0.34.5 -> v0.35.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM review #757: v0.35.0 Phase C paper claims "0 retry-recovered across 30 submissions" and cites specific reason codes in §6.3, but the committed artifact only stored `empty_patch` for every rejection with no field showing whether grounding was enabled, whether retry ran, which reason code fired, or whether the failure was a sanitizer rejection versus empty extraction. The published evidence was weaker than the conclusions. Fix: extend the artifact schema and retrofit both v0.34.5 and v0.35 artifacts so the evidence catches up to the conclusions. Schema additions (eval/swebench_phase2.py): Top-level envelope: - `grounding: bool` — mirrors --grounding - `retry_on_reject: bool` — mirrors --retry-on-reject Per-result dict: - `sanitize_reason: str` — final-attempt reason code from sanitize_with_reason (one of SANITIZE_REASONS); "" on success - `retry_attempted: bool` — whether the retry callback was invoked - `retry_recovered: bool` — whether a retry produced a usable patch Prediction dataclass: three new fields with safe defaults. SanitizeOutcome: new NamedTuple returned by _sanitize_for_submission and _try_format_retry, carrying (patch, reason, retry_attempted, retry_recovered) so the runner can populate Prediction directly. All three runners (run_baseline, run_organism, run_langgraph) updated to unpack the outcome and pass the metadata through. build_artifact: new grounding and retry_on_reject kwargs (both default False so v0.34.x-era callers still produce a valid envelope without plumbing them through). Retrofit: - eval/results/swebench_phase2.json (v0.34.5 gemma4): grounding=True, retry_on_reject=False. Per-result sanitize_reason left empty (not recorded live); retry fields False. - eval/results/swebench_phase2_deepseek_retry.json (v0.35 deepseek): grounding=True, retry_on_reject=True. Per-result sanitize_reason populated by parsing the live log; retry_attempted=True for every empty_patch row. Reason distribution: 26 empty_extraction, 3 overlong_hunk, 1 truncated_hunk. retry_recovered=0/30. Paper §6.3 Phase C updated with the reason distribution + sharper framing: at 8B, 26/30 failures are "doesn't produce diff-shaped output in the first place" (empty_extraction), only 4/30 are "produces diff-shaped output with fixable errors". Retry-with-guidance is calibrated for the second regime but the model sits mostly in the first. The grounding-specific reasons (path_not_found, ambiguous_path, placeholder_hunk) never fired, reinforcing the v0.34.5 claim that file selection is not the bottleneck. Tests (4 new in test_swebench_phase2_schema_v035.py): - Prediction has retry_attempted, retry_recovered, sanitize_reason fields with safe defaults - build_artifact exposes grounding and retry_on_reject at the top level with backward-compat defaults - Existing schema-shape test updated to include the two new top-level keys - Existing retry-plumbing tests updated to unpack SanitizeOutcome 105/105 tests pass across sanitizer, retry, timeout, identity, schema-v035 suites (+4 new since PR #59 opened). PDF rebuilt via tectonic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e rewrite
MEDIUM review #758 flagged two issues:
1. `_rewrite_envelope()` did not thread the new `grounding` and
`retry_on_reject` fields from the existing artifact into
`build_artifact()`. Any envelope rewrite silently reset both
run-defining flags to False, defeating the point of persisting them.
2. The committed deepseek artifact showed `grounding=false` and
`retry_on_reject=false` at the top level despite the paper and
commit message describing it as the grounded retry run. Every
per-result row had `retry_attempted=true`, making the artifact
internally inconsistent. Root cause: an earlier
`--rewrite-envelope` call (predating the schema extension)
dropped the flags.
Fix:
- `_rewrite_envelope` now passes
`existing.get("grounding", False)` and
`existing.get("retry_on_reject", False)` to `build_artifact()`.
Pre-flag artifacts (v0.34.x-era) still default to False via the
`.get` fallback, so backward compat is preserved.
- Regenerated the deepseek artifact top-level flags to
`grounding=true, retry_on_reject=true` and ran
`--rewrite-envelope` through the fixed code path to confirm the
flags survive (they do, verified via post-rewrite inspection).
post_run_check.status=match.
Two new tests:
* `test_rewrite_envelope_preserves_grounding_and_retry_flags` —
locks the preservation invariant against regression.
* `test_rewrite_envelope_defaults_flags_when_source_lacks_them` —
confirms pre-flag artifacts (no key present) still produce a
valid rewrite with False defaults (backward compat).
107/107 tests pass (+2 new since #757 landed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase C from the v0.34.5 blog post's "what's next" list lands: cross-model check of the 8B format-discipline ceiling claim, with format-correction retry active. PR #57 (timeout fix) and PR #58 (retry infrastructure) already on main from earlier in this session; this PR adds the new artifact + paper §6.3 Phase C writeup + version bump.
Highlights
Two load-bearing findings:
Cross-model ceiling is sharp. The single v0.34.5 "survivor" (
django-11001baseline, unresolved) was gemma4-specific. deepseek-r1:8b could not produce agit apply-clean diff for it either. The 8B-class format-discipline ceiling is not a single-model artifact.Retry-with-reason-code did not help. 0 retry-recovered patches across 30 submissions. At 8B, diff-format failure is not mistake-that-can-be-corrected but cannot-produce-format-at-all. Retry helps occasional misformatters, not capability-ceiling models. Sharper negative than the v0.34.5 paper predicted (we said retry "might recover 20–40% of sanitizer drops"; it recovered zero).
Model identity
Observations from the live logs
overlong_hunkandtruncated_hunkboth appeared across different instances, confirming the diagnostic categorization is accurate.<think>blocks. Two timeouts that appeared in v0.34.5 (astropy-12907, astropy-14995) are now clean completions in this artifact.Changes
eval/results/swebench_phase2_deepseek_retry.json— the new artifact (post_run_check=match, harness=ok, 0/30 evaluated)article/paper5/sections/06-experiments.tex— new\subsubsection{Phase C: Cross-Model Check with Format-Correction Retry}after the v0.34.5 grounded-rerun content; per-condition outcome table, comparison table, what-Phase-C-does-and-does-not-say paragrapharticle/paper5/sections/07-conclusion.tex— limitations item updated to "SWE-bench-lite ceiling at 8B, confirmed cross-model"article/paper5/main.pdf— rebuilt (114KB → 121KB, the Phase C content is the delta)__init__.py/ README badge)Test plan
pytest tests/unit/test_swebench_phase2_identity.py -q— 102 passpost_run_check.status = matchon the new artifactharness=okacross all three conditionsWhat's next (not in this PR)
🤖 Generated with Claude Code