Skip to content

test(harness): wait for an HTTP response, not a bare TCP connect#346

Merged
vieiralucas merged 1 commit intomainfrom
worktree-cleanup-flaky-ci-server-ready
Apr 13, 2026
Merged

test(harness): wait for an HTTP response, not a bare TCP connect#346
vieiralucas merged 1 commit intomainfrom
worktree-cleanup-flaky-ci-server-ready

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Apr 13, 2026

Summary

Both test harnesses (`crates/fakecloud-e2e/tests/helpers/mod.rs` and `crates/fakecloud-conformance/tests/helpers/mod.rs`) spawn fakecloud as a child process and gate `TestServer::start` on a `wait_for_port` loop that only checks `TcpStream::connect`. A successful TCP connect proves fakecloud's listener socket is in the kernel accept queue, but not that axum has reached `serve().await` and installed the request handlers. Tests that issued a real HTTP call immediately after the TCP probe occasionally raced that window and saw `ConnectionRefused` / early EOF — prior sessions blamed this on flakes across RDS, IAM OIDC, Bedrock, KMS, and SES and just re-ran the workflow instead of investigating.

Fix: two-stage readiness probe.

  1. Wait for `TcpStream::connect` to succeed (SYN accepted into the listen queue).
  2. Issue a reqwest `GET /` and require any HTTP response — including 404s from unmatched routes, which is what fakecloud's root path returns. That's the guarantee the AWS SDK clients actually need.

`reqwest` is already a dev-dependency in both crates, so this is zero new cost.

Test plan

  • Both harnesses build with `cargo build -p fakecloud-e2e --tests` / `-p fakecloud-conformance --tests`.
  • CI will exercise the new probe on every check; prior-session flakes across RDS/IAM/Bedrock/KMS/SES should stop recurring under this change.

Summary by cubic

Make the e2e and conformance test harnesses wait for a real HTTP response before running tests. This ensures axum is serving and eliminates ConnectionRefused/early-EOF flakes in CI.

  • Bug Fixes
    • Replaced bare TCP probe with two-stage readiness: TCP connect, then reqwest GET to http://127.0.0.1:{port}/; any HTTP status (including 404) counts as ready.
    • Added 500ms per-request timeout; kept 100ms retry loop (~30s total) and early exit if the child process exits.

Written for commit 6c41896. Summary will update on new commits.

The e2e and conformance test harnesses both spawn fakecloud as a child
process and then block on a 30s wait_for_port loop that considers the
server ready as soon as TcpStream::connect succeeds. A successful TCP
connect only proves fakecloud's listener socket is in the kernel accept
queue — it does not prove axum has reached serve().await and installed
the request handlers. Tests that issued a real HTTP request immediately
after the TCP probe occasionally raced that window and saw
ConnectionRefused / early EOF across unrelated services (RDS, IAM OIDC,
Bedrock, KMS, SES).

Upgrade wait_for_port to a two-stage probe: first wait for TCP to
connect, then issue a reqwest GET to http://127.0.0.1:{port}/. Any HTTP
response — including 404 from an unmatched route — proves axum is
actually serving, which is the guarantee the tests need.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@vieiralucas vieiralucas merged commit d2576bc into main Apr 13, 2026
22 checks passed
@vieiralucas vieiralucas deleted the worktree-cleanup-flaky-ci-server-ready branch April 13, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant