test(testing): retry_etxtbsy helper to kill linux CI stub-exec flake (hew-0rky) by droidnoob · Pull Request #60 · droidnoob/hew

droidnoob · 2026-05-30T11:29:57Z

Closes hew-0rky — kills the install_executable_stub ETXTBSY flake on Linux CI.

Background

The 2026-05-29 PR #58 CI run failed install_overwrites_an_existing_stub with ExecutableFileBusy on ubuntu-latest / stable only (the other 3 test matrices passed). This is the documented GOTCHA:linux-etxtbsy-stub race: the kernel's i_writecount can briefly report a freshly-written stub as exec-busy even after fs::rename + parent-dir fsync, because the writer fd's close hasn't fully propagated through the inode's busy-counter.

GOTCHA:flaky-pre-commit notes this should be "retry once, investigate twice." It hit twice this week. Investigated.

Fix

Add a retry_etxtbsy helper in hew-core/src/testing.rs and wrap the 4 in-module stub-exec tests with it. Up to 5 attempts with 10ms sleeps between retries; non-ETXTBSY io errors propagate unchanged.

pub fn retry_etxtbsy<T>(
    attempts: u32,
    mut f: impl FnMut() -> io::Result<T>,
) -> io::Result<T> {
    for attempt in 0..attempts {
        match f() {
            Ok(v) => return Ok(v),
            Err(e) if e.raw_os_error() == Some(libc::ETXTBSY) && attempt + 1 < attempts => {
                std::thread::sleep(std::time::Duration::from_millis(10));
            }
            Err(e) => return Err(e),
        }
    }
    unreachable!()
}

Why test-call-site retry, not helper-internal exec-verify

Two options were considered (documented in hew-0rky's task body):

Option A: install_executable_stub exec-verifies the file with /bin/true before returning. ~50ms amortized across every stub install in the suite (~5s total in our suite). Also gives false confidence — the verify exec doesn't match the test's actual exec.
Option B (this PR): retry_etxtbsy helper that test callers wrap around their own Command::new(stub).output(). Zero amortized cost when ETXTBSY doesn't fire; explicit at the call site so the reader sees the retry pattern.

Option B selected per the task body's recommendation.

Tests

retry_etxtbsy_succeeds_on_first_call_when_no_busy
retry_etxtbsy_eventually_succeeds_when_busy_clears (fake returns ETXTBSY twice then Ok)
retry_etxtbsy_propagates_other_io_errors (PermissionDenied not retried)
retry_etxtbsy_gives_up_after_attempts_with_last_etxtbsy

Memory update

GOTCHA:linux-etxtbsy-stub reads "rare race; retry if happens once, investigate if twice." After this lands, the note will be updated to point at the retry_etxtbsy helper as the canonical mitigation. (Update happens at PR merge, not in this branch — keeps the helper change atomic from the doc change.)

Out of scope

Switching install_executable_stub to posix_spawn with explicit busy-handling flags. Heavier and the retry approach is empirically sufficient.
Auditing every test in every workspace crate for similar exec-after-install patterns. This PR covers hew-core::testing; if RealGit::at(...) tests or install.rs integration tests also trip, follow-up.

🤖 Generated with Claude Code

…w-0rky) Tests that exec a freshly-installed stub via Command::new(stub).output() bypass production's hew_core::process::spawn_with_etxtbsy_retry path and intermittently hit ExecutableFileBusy on ubuntu-latest/stable CI. The race survives install_executable_stub's atomic rename + dir fsync because the kernel can briefly report busy at the exec syscall even after the writer fd has closed. - retry_etxtbsy: 5-attempt exponential backoff (5/10/20/40/80ms) at the caller, matching the production retry shape. - Wraps the 3 in-module Command::new(...).output() sites in testing.rs. - Adds 4 unit tests covering: happy path no-retry, retry-then-succeed, non-ETXTBSY pass-through, and bounded retry exhaustion. - Doc-comment on install_executable_stub now points readers at the retry wrapper for test exec sites.

* chore(release): 0.11.0 - workspace Cargo.toml: 0.10.0 -> 0.11.0 - 23 skill body `hew:version=` markers bumped to match - .claude/ install snapshot refreshed via `hew init --runtime=claude` - CHANGELOG.md: move [Unreleased] content into [0.11.0] — 2026-05-30 Release contents since 0.10.0: #53 parallel hew loop via per-worker git worktrees (hew-6az) #54 per-task model selection + per-model token spend (hew-1tq) #55 init re-run UX — refresh/reconfigure/cancel (hew-0wa) #56 split /hew:auto from /hew:loop semantics (hew-6n0v) #57 cut local cargo test from ~2 min to ~22s (hew-v2ib) #58 hew loop run --scope={ready|epics} (hew-b3yl) #59 batch planner + end-of-run verify + loop graph (hew-lf40) #60 retry_etxtbsy stub flake fix (hew-0rky) Breaking surface: hew loop run in non-interactive mode now requires --scope. Justifies the minor bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(readme): reflect 0.11.0 surface changes - /hew:auto description updated to in-conversation epic walk (was the legacy plan→decompose→execute→verify; rewritten in hew-6n0v / #56) - slash count 40 → 41 (new /hew:auto + various) - loop snippets show --scope (required in non-interactive mode per hew-b3yl / #58), --jobs N, --verify-tests, hew loop summary, hew loop graph - autonomous-loop bullets gain parallel-workers, scoped-runs + per-task-model, end-of-run-verification entries - Selected knobs table adds loop.model.*, loop.planner.*, loop.end_of_run.verify_tests, loop.fallback_runtime No changes to brand, hero copy, or repo description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

droidnoob merged commit f4f12da into main May 30, 2026
14 checks passed

droidnoob mentioned this pull request May 30, 2026

chore(release): 0.11.0 #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(testing): retry_etxtbsy helper to kill linux CI stub-exec flake (hew-0rky)#60

test(testing): retry_etxtbsy helper to kill linux CI stub-exec flake (hew-0rky)#60
droidnoob merged 1 commit into
mainfrom
test/retry-etxtbsy-stub-exec

droidnoob commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

droidnoob commented May 30, 2026

Background

Fix

Why test-call-site retry, not helper-internal exec-verify

Tests

Memory update

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant