Skip to content

release: v0.35.0 — Phase C cross-model confirmation + retry null result#59

Merged
coredipper merged 3 commits into
mainfrom
feat/phase2-phase-c-crossmodel
Apr 18, 2026
Merged

release: v0.35.0 — Phase C cross-model confirmation + retry null result#59
coredipper merged 3 commits into
mainfrom
feat/phase2-phase-c-crossmodel

Conversation

@coredipper
Copy link
Copy Markdown
Owner

Phase C from the v0.34.5 blog post's "what's next" list lands: cross-model check of the 8B format-discipline ceiling claim, with format-correction retry active. PR #57 (timeout fix) and PR #58 (retry infrastructure) already on main from earlier in this session; this PR adds the new artifact + paper §6.3 Phase C writeup + version bump.

Highlights

gemma4:latest (v0.34.5) deepseek-r1:8b (v0.35.0)
Total evaluated 1/30 0/30
Sanitizer-rejected 27 30
Runtime errors 2 0
Resolved 0 0
Retry active no yes
Retry-recovered patches 0
Mean latency (baseline, completed) 131 s 1084 s (18 min)

Two load-bearing findings:

  1. Cross-model ceiling is sharp. The single v0.34.5 "survivor" (django-11001 baseline, unresolved) was gemma4-specific. deepseek-r1:8b could not produce a git apply-clean diff for it either. The 8B-class format-discipline ceiling is not a single-model artifact.

  2. Retry-with-reason-code did not help. 0 retry-recovered patches across 30 submissions. At 8B, diff-format failure is not mistake-that-can-be-corrected but cannot-produce-format-at-all. Retry helps occasional misformatters, not capability-ceiling models. Sharper negative than the v0.34.5 paper predicted (we said retry "might recover 20–40% of sanitizer drops"; it recovered zero).

Model identity

tag=deepseek-r1:8b
digest=6995872bfe4c
blob_sha256=e6a7edc1a4d7d9b2de136a221a57336b76316cfe53a252aeba814496c5ae439d
architecture=qwen3  (DeepSeek-R1 distilled onto Qwen3-8B)
parameters=8.2B
quantization=Q4_K_M

Observations from the live logs

  • Reason-code vocabulary fired in the wild: overlong_hunk and truncated_hunk both appeared across different instances, confirming the diagnostic categorization is accurate.
  • Repo-dependent latency: astropy instances ~27 min/call, django instances ~10 min/call. Same model, same grounded context, same retry. Driven by issue complexity (stack traces + import paths in astropy vs simpler descriptions in django). Worth flagging for future cross-repo benchmarking.
  • Zero runtime errors: PR feat: raise LLM timeout to 900s for Phase 2 reasoning-model readiness #57's 900s timeout absorbs reasoning-model <think> blocks. Two timeouts that appeared in v0.34.5 (astropy-12907, astropy-14995) are now clean completions in this artifact.

Changes

  • eval/results/swebench_phase2_deepseek_retry.json — the new artifact (post_run_check=match, harness=ok, 0/30 evaluated)
  • article/paper5/sections/06-experiments.tex — new \subsubsection{Phase C: Cross-Model Check with Format-Correction Retry} after the v0.34.5 grounded-rerun content; per-condition outcome table, comparison table, what-Phase-C-does-and-does-not-say paragraph
  • article/paper5/sections/07-conclusion.tex — limitations item updated to "SWE-bench-lite ceiling at 8B, confirmed cross-model"
  • article/paper5/main.pdf — rebuilt (114KB → 121KB, the Phase C content is the delta)
  • Version bump 0.34.5 → 0.35.0 (pyproject.toml / __init__.py / README badge)

Test plan

  • Artifact schema test: pytest tests/unit/test_swebench_phase2_identity.py -q — 102 pass
  • post_run_check.status = match on the new artifact
  • harness=ok across all three conditions
  • Paper PDF rebuilt without new warnings
  • Merge + release v0.35.0 → PyPI

What's next (not in this PR)

  • 70B+ cloud-GPU rerun (Modal / Groq free-tier Llama-3.3-70B) to test whether retry recovers at larger scale
  • Blog follow-up post on Phase C
  • Site release notes for v0.35.0

🤖 Generated with Claude Code

coredipper and others added 3 commits April 18, 2026 20:52
Ships the Phase C experiment described in the v0.34.5 blog's "what's
next" list: a locally-runnable cross-model check of the 8B format-
discipline ceiling claim, with format-correction retry active.

Artifact: eval/results/swebench_phase2_deepseek_retry.json
- model: deepseek-r1:8b (DeepSeek-R1 distilled onto Qwen3-8B),
  digest 6995872bfe4c, 8.2B params, Q4_K_M, 131k context
- Same 10 SWE-bench-lite instances as v0.34.5, same grounded pipeline
- --retry-on-reject active (format-correction retry)
- post_run_check.status = match
- harness=ok across all three conditions
- Total wall-clock: ~22 hours (reasoning models are slow)

Outcome (all three conditions, 10 instances each):
  resolved=0, unresolved=0, sanitizer-rejected=10, runtime_error=0
  mean_latency: baseline 1084s, organism 1073s, langgraph 1130s
  retry-recovered patches: 0 across all 30 submissions

Two load-bearing findings for Paper 5 §6.3 Phase C:

1. Cross-model ceiling is sharp. Both gemma4 (v0.34.5) and
   deepseek-r1:8b (v0.35.0) produce zero resolved instances. Gemma4
   crossed the sanitizer once (django-11001 baseline -> unresolved);
   deepseek-r1 crossed it zero times. The single v0.34.5 "survivor"
   was gemma4-specific, not a structurally easy issue.

2. Retry-with-reason-code did not help. The retry mechanism fires
   correctly (we observed overlong_hunk and truncated_hunk reason
   codes in the live logs), retry prompts embed the reason + failed
   output verbatim with reason-specific guidance, but zero
   "retry recovered" events logged across 30 submissions. At 8B,
   diff-format failure is not mistake-that-can-be-corrected but
   cannot-produce-format-at-all. Retry helps occasional misformatters,
   not capability-ceiling models. Sharper negative result than the
   v0.34.5 paper predicted (we suggested retry "might recover 20-40%
   of sanitizer drops"); at 8B it recovers zero.

Paper 5 §6.3 adds new subsubsection "Phase C: Cross-Model Check with
Format-Correction Retry" with:
- Setup (deepseek-r1:8b identity details)
- Per-condition outcome table
- Comparison table vs gemma4 v0.34.5
- What Phase C does and does not say
- Next wedge: 70B+ via cloud GPU, outside local-only scope

Paper 5 §7 Limitations updated: "SWE-bench-lite ceiling at 8B,
confirmed cross-model".

Paper PDF rebuilt via tectonic (114KB -> 121KB; the Phase C content
is the delta).

Infrastructure shipped in this release (already on main from
PRs #57, #58, #57's #753 and #58's #755 follow-ups):
- 900s LLM timeout + 60s probe timeout split (PR #57)
- sanitize_with_reason() returning (patch, reason_code) tuple with
  8 machine-readable reason codes (PR #58)
- _FORMAT_RETRY_MAX = 1 retry callback plumbing in all three runners
  (PR #58)
- EVAL_RUNTIME_ERROR status distinct from sanitizer-rejected
  empty_patch (carried over from v0.34.5)
- classify_prediction() as authoritative override (carried over)
- --retry-on-reject and --output CLI flags (PR #58)
- --output honored under --rewrite-envelope (PR #58's #755 fix)
- overlong-hunk classification handling of bare empty extra (same)

Versions bumped:
- pyproject.toml 0.34.5 -> 0.35.0
- operon_ai/__init__.py 0.34.5 -> 0.35.0
- README.md badge v0.34.5 -> v0.35.0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEDIUM review #757: v0.35.0 Phase C paper claims "0 retry-recovered
across 30 submissions" and cites specific reason codes in §6.3, but
the committed artifact only stored `empty_patch` for every rejection
with no field showing whether grounding was enabled, whether retry
ran, which reason code fired, or whether the failure was a sanitizer
rejection versus empty extraction. The published evidence was weaker
than the conclusions.

Fix: extend the artifact schema and retrofit both v0.34.5 and v0.35
artifacts so the evidence catches up to the conclusions.

Schema additions (eval/swebench_phase2.py):

Top-level envelope:
- `grounding: bool` — mirrors --grounding
- `retry_on_reject: bool` — mirrors --retry-on-reject

Per-result dict:
- `sanitize_reason: str` — final-attempt reason code from
  sanitize_with_reason (one of SANITIZE_REASONS); "" on success
- `retry_attempted: bool` — whether the retry callback was invoked
- `retry_recovered: bool` — whether a retry produced a usable patch

Prediction dataclass: three new fields with safe defaults.
SanitizeOutcome: new NamedTuple returned by _sanitize_for_submission
and _try_format_retry, carrying (patch, reason, retry_attempted,
retry_recovered) so the runner can populate Prediction directly.
All three runners (run_baseline, run_organism, run_langgraph) updated
to unpack the outcome and pass the metadata through.

build_artifact: new grounding and retry_on_reject kwargs (both
default False so v0.34.x-era callers still produce a valid envelope
without plumbing them through).

Retrofit:
- eval/results/swebench_phase2.json (v0.34.5 gemma4):
  grounding=True, retry_on_reject=False. Per-result sanitize_reason
  left empty (not recorded live); retry fields False.
- eval/results/swebench_phase2_deepseek_retry.json (v0.35 deepseek):
  grounding=True, retry_on_reject=True. Per-result sanitize_reason
  populated by parsing the live log; retry_attempted=True for every
  empty_patch row. Reason distribution: 26 empty_extraction,
  3 overlong_hunk, 1 truncated_hunk. retry_recovered=0/30.

Paper §6.3 Phase C updated with the reason distribution + sharper
framing: at 8B, 26/30 failures are "doesn't produce diff-shaped
output in the first place" (empty_extraction), only 4/30 are
"produces diff-shaped output with fixable errors". Retry-with-guidance
is calibrated for the second regime but the model sits mostly in the
first. The grounding-specific reasons (path_not_found, ambiguous_path,
placeholder_hunk) never fired, reinforcing the v0.34.5 claim that
file selection is not the bottleneck.

Tests (4 new in test_swebench_phase2_schema_v035.py):
- Prediction has retry_attempted, retry_recovered, sanitize_reason
  fields with safe defaults
- build_artifact exposes grounding and retry_on_reject at the top
  level with backward-compat defaults
- Existing schema-shape test updated to include the two new
  top-level keys
- Existing retry-plumbing tests updated to unpack SanitizeOutcome

105/105 tests pass across sanitizer, retry, timeout, identity,
schema-v035 suites (+4 new since PR #59 opened).

PDF rebuilt via tectonic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e rewrite

MEDIUM review #758 flagged two issues:

1. `_rewrite_envelope()` did not thread the new `grounding` and
   `retry_on_reject` fields from the existing artifact into
   `build_artifact()`. Any envelope rewrite silently reset both
   run-defining flags to False, defeating the point of persisting them.

2. The committed deepseek artifact showed `grounding=false` and
   `retry_on_reject=false` at the top level despite the paper and
   commit message describing it as the grounded retry run. Every
   per-result row had `retry_attempted=true`, making the artifact
   internally inconsistent. Root cause: an earlier
   `--rewrite-envelope` call (predating the schema extension)
   dropped the flags.

Fix:

- `_rewrite_envelope` now passes
  `existing.get("grounding", False)` and
  `existing.get("retry_on_reject", False)` to `build_artifact()`.
  Pre-flag artifacts (v0.34.x-era) still default to False via the
  `.get` fallback, so backward compat is preserved.

- Regenerated the deepseek artifact top-level flags to
  `grounding=true, retry_on_reject=true` and ran
  `--rewrite-envelope` through the fixed code path to confirm the
  flags survive (they do, verified via post-rewrite inspection).
  post_run_check.status=match.

Two new tests:
  * `test_rewrite_envelope_preserves_grounding_and_retry_flags` —
    locks the preservation invariant against regression.
  * `test_rewrite_envelope_defaults_flags_when_source_lacks_them` —
    confirms pre-flag artifacts (no key present) still produce a
    valid rewrite with False defaults (backward compat).

107/107 tests pass (+2 new since #757 landed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coredipper coredipper merged commit 0238a6c into main Apr 18, 2026
4 checks passed
@coredipper coredipper deleted the feat/phase2-phase-c-crossmodel branch April 22, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant