Skip to content

test(testing): retry_etxtbsy helper to kill linux CI stub-exec flake (hew-0rky)#60

Merged
droidnoob merged 1 commit into
mainfrom
test/retry-etxtbsy-stub-exec
May 30, 2026
Merged

test(testing): retry_etxtbsy helper to kill linux CI stub-exec flake (hew-0rky)#60
droidnoob merged 1 commit into
mainfrom
test/retry-etxtbsy-stub-exec

Conversation

@droidnoob
Copy link
Copy Markdown
Owner

Closes hew-0rky — kills the install_executable_stub ETXTBSY flake on Linux CI.

Background

The 2026-05-29 PR #58 CI run failed install_overwrites_an_existing_stub with ExecutableFileBusy on ubuntu-latest / stable only (the other 3 test matrices passed). This is the documented GOTCHA:linux-etxtbsy-stub race: the kernel's i_writecount can briefly report a freshly-written stub as exec-busy even after fs::rename + parent-dir fsync, because the writer fd's close hasn't fully propagated through the inode's busy-counter.

GOTCHA:flaky-pre-commit notes this should be "retry once, investigate twice." It hit twice this week. Investigated.

Fix

Add a retry_etxtbsy helper in hew-core/src/testing.rs and wrap the 4 in-module stub-exec tests with it. Up to 5 attempts with 10ms sleeps between retries; non-ETXTBSY io errors propagate unchanged.

pub fn retry_etxtbsy<T>(
    attempts: u32,
    mut f: impl FnMut() -> io::Result<T>,
) -> io::Result<T> {
    for attempt in 0..attempts {
        match f() {
            Ok(v) => return Ok(v),
            Err(e) if e.raw_os_error() == Some(libc::ETXTBSY) && attempt + 1 < attempts => {
                std::thread::sleep(std::time::Duration::from_millis(10));
            }
            Err(e) => return Err(e),
        }
    }
    unreachable!()
}

Why test-call-site retry, not helper-internal exec-verify

Two options were considered (documented in hew-0rky's task body):

  • Option A: install_executable_stub exec-verifies the file with /bin/true before returning. ~50ms amortized across every stub install in the suite (~5s total in our suite). Also gives false confidence — the verify exec doesn't match the test's actual exec.
  • Option B (this PR): retry_etxtbsy helper that test callers wrap around their own Command::new(stub).output(). Zero amortized cost when ETXTBSY doesn't fire; explicit at the call site so the reader sees the retry pattern.

Option B selected per the task body's recommendation.

Tests

  • retry_etxtbsy_succeeds_on_first_call_when_no_busy
  • retry_etxtbsy_eventually_succeeds_when_busy_clears (fake returns ETXTBSY twice then Ok)
  • retry_etxtbsy_propagates_other_io_errors (PermissionDenied not retried)
  • retry_etxtbsy_gives_up_after_attempts_with_last_etxtbsy

Memory update

GOTCHA:linux-etxtbsy-stub reads "rare race; retry if happens once, investigate if twice." After this lands, the note will be updated to point at the retry_etxtbsy helper as the canonical mitigation. (Update happens at PR merge, not in this branch — keeps the helper change atomic from the doc change.)

Out of scope

  • Switching install_executable_stub to posix_spawn with explicit busy-handling flags. Heavier and the retry approach is empirically sufficient.
  • Auditing every test in every workspace crate for similar exec-after-install patterns. This PR covers hew-core::testing; if RealGit::at(...) tests or install.rs integration tests also trip, follow-up.

🤖 Generated with Claude Code

…w-0rky)

Tests that exec a freshly-installed stub via Command::new(stub).output()
bypass production's hew_core::process::spawn_with_etxtbsy_retry path and
intermittently hit ExecutableFileBusy on ubuntu-latest/stable CI. The
race survives install_executable_stub's atomic rename + dir fsync
because the kernel can briefly report busy at the exec syscall even
after the writer fd has closed.

- retry_etxtbsy: 5-attempt exponential backoff (5/10/20/40/80ms) at the
  caller, matching the production retry shape.
- Wraps the 3 in-module Command::new(...).output() sites in testing.rs.
- Adds 4 unit tests covering: happy path no-retry, retry-then-succeed,
  non-ETXTBSY pass-through, and bounded retry exhaustion.
- Doc-comment on install_executable_stub now points readers at the
  retry wrapper for test exec sites.
@droidnoob droidnoob merged commit f4f12da into main May 30, 2026
14 checks passed
@droidnoob droidnoob mentioned this pull request May 30, 2026
droidnoob added a commit that referenced this pull request May 30, 2026
* chore(release): 0.11.0

- workspace Cargo.toml: 0.10.0 -> 0.11.0
- 23 skill body `hew:version=` markers bumped to match
- .claude/ install snapshot refreshed via `hew init --runtime=claude`
- CHANGELOG.md: move [Unreleased] content into [0.11.0] — 2026-05-30

Release contents since 0.10.0:

#53 parallel hew loop via per-worker git worktrees (hew-6az)
#54 per-task model selection + per-model token spend (hew-1tq)
#55 init re-run UX — refresh/reconfigure/cancel (hew-0wa)
#56 split /hew:auto from /hew:loop semantics (hew-6n0v)
#57 cut local cargo test from ~2 min to ~22s (hew-v2ib)
#58 hew loop run --scope={ready|epics} (hew-b3yl)
#59 batch planner + end-of-run verify + loop graph (hew-lf40)
#60 retry_etxtbsy stub flake fix (hew-0rky)

Breaking surface: hew loop run in non-interactive mode now requires
--scope. Justifies the minor bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(readme): reflect 0.11.0 surface changes

- /hew:auto description updated to in-conversation epic walk (was the
  legacy plan→decompose→execute→verify; rewritten in hew-6n0v / #56)
- slash count 40 → 41 (new /hew:auto + various)
- loop snippets show --scope (required in non-interactive mode per
  hew-b3yl / #58), --jobs N, --verify-tests, hew loop summary,
  hew loop graph
- autonomous-loop bullets gain parallel-workers, scoped-runs +
  per-task-model, end-of-run-verification entries
- Selected knobs table adds loop.model.*, loop.planner.*,
  loop.end_of_run.verify_tests, loop.fallback_runtime

No changes to brand, hero copy, or repo description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant