fix(http): bound read and connect timeouts on the runtime HTTP client#1071
Merged
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3863a63ae
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Aspect CI Status ·
|
| Task | Job | Result | |
|---|---|---|---|
| ✅ passed | build |
build-task |
153 targets built |
| ✅ passed | build |
build-task |
153 targets built |
| ✅ passed | test |
test-task |
all 25 tests cached |
| ✅ passed | test |
test-task |
all 25 tests cached |
`ctx.http().*` calls had no timeout configured on the underlying reqwest `Client`, so a hung peer (slow GitHub, broken proxy, dropped connection without TCP RST) could pin a single request indefinitely. AXL feature handlers run synchronously, so one hung request wedges the whole task. - DEFAULT_HTTP_READ_TIMEOUT = 120s max gap between reads - DEFAULT_HTTP_CONNECT_TIMEOUT = 15s TCP/TLS handshake ceiling `read_timeout` resets every time bytes arrive, so slow-but-progressing transfers (large artifact uploads on a 1 Mbps CI link) are fine — only stalls trip the limit. We deliberately do NOT set a total `timeout`: that would cap every transfer regardless of progress and regress legitimate long uploads/downloads on slow links. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d3863a6 to
b7c65cc
Compare
thesayyn
reviewed
May 12, 2026
Member
thesayyn
left a comment
There was a problem hiding this comment.
this is nice default, but we probably should have knobs in http methods to allow configuring this.
gregmagolan
added a commit
that referenced
this pull request
May 12, 2026
…prefactor The HTTP read/connect-timeout change was pre-factored out of this branch and landed separately in #1071. The split left a merge artifact in crates/axl-runtime/src/engine/http.rs: the `impl StarlarkValue for Http` block and `starlark_simple_value!(Http)` macro were dropped, leaving a dangling `#[starlark_value(type = "Http")]` attribute and breaking the build (Http no longer implements StarlarkValue, so it can't be exposed via `StarlarkValueAsType` in engine/mod.rs). Restore the impl block + macro exactly as they were on main prior to the prefactor. Pure revert of the accidental deletion; no behavioural change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 13, 2026
gregmagolan
added a commit
that referenced
this pull request
May 13, 2026
…P failures (#1078) ## Summary A transient transport failure on a GitHub Actions Twirp `DeleteArtifact` call propagated out of the GHSC publisher and failed an in-flight `lint` task. Observability features must never fail the underlying task. `lib/github.axl` now applies one consistent HTTP failure-handling contract to every helper: - **Transport errors** (DNS / TLS / refused / reset / timeout) are caught via `.map_err(lambda e: str(e)).block()` and surfaced as `(False, 0, "transport error: ...")`. The exception that aborted the lint task can't happen. - **Transient failures** (transport, 5xx, 429) are retried up to 3 times with exponential backoff (250 → 500 → 1000 ms). 4xx is deterministic and not retried. - **Retry is gated on idempotency**: GET / PUT / PATCH / DELETE retry, POST runs exactly once. Retrying POST after the server accepted the first request would duplicate check-runs / PR comments / reviews / review comments — strictly worse than the original error. - For the GH Actions Twirp ArtifactService, retry is gated on a per-method allowlist (`_TWIRP_IDEMPOTENT_METHODS = ["DeleteArtifact"]`). `CreateArtifact` and `FinalizeArtifact` run once each. - **Terminal failures emit a one-line `WARNING:`** with method, URL, attempts count (when retry applied), and status/transport-error detail. Before this change, the only signal was an opaque AXL parse trace pointing at the call site with no underlying cause. Also fixed two adjacent paths surfaced while auditing every `ctx.http()` call site that feeds into a feature handler: `lib/circleci.axl::_request` (CircleCI runner API) and `lib/artifacts.axl::_upload_file_gitlab` (GitLab generic-packages PUT) — both now catch transport errors so a flaky CI API can't fail a task that just happens to be uploading artifacts, and both emit the same one-line WARNING on terminal failure. `.map_err` is the pre-existing `StarlarkFuture` transform (`axl-runtime/src/engine/async/future.rs:221`); no Rust changes. ### Builds on #1071 #1071 (already landed) put a hard upper bound on how long a single HTTP call can hang. This PR layers retry + graceful-failure on top: timeouts bound the worst case, retry covers the common case, graceful failure means observability features never abort tasks. --- ### Changes are visible to end-users: yes - Searched for relevant documentation and updated as needed: no (in-code only) - Breaking change (forces users to change their own code or config): no - Suggested release notes appear below: yes #### Suggested release notes - Transient GitHub / CI API failures no longer fail `aspect lint` / `aspect format` / `aspect gazelle` (or any other task using the GHSC / GHC / ArtifactUpload features). Idempotent calls (GET / PUT / PATCH / DELETE) retry up to 3 times with exponential backoff; POST runs once to avoid duplicate check-runs / comments / reviews. Terminal failures now log a `WARNING:` line with method, URL, attempts, and the underlying transport error or HTTP status. ### Test plan - Covered by existing test cases: 727 AXL tests + 7 snapshot suites pass. - Manual: ran `aspect lint` / `aspect format` / `aspect gazelle` locally; happy path unchanged. Retry behavior exercised whenever GitHub's edge flakes — visible in CI logs as `WARNING: GitHub API POST /repos/.../check-runs failed after 3 attempts: transport error: ...` followed by the next cycle's retry-and-recover. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ctx.http().*had no timeout configured on the underlying reqwestClient, so a hung peer (slow GitHub, broken proxy, dropped connection without TCP RST) could pin a single request indefinitely. AXL feature handlers run synchronously, so one hung request wedges the whole task.Adds two ceilings on the runtime HTTP client:
DEFAULT_HTTP_READ_TIMEOUT= 120s — maximum gap between successive reads on a single response. Resets every time bytes arrive, so a slow-but-progressing transfer (a large artifact upload on a 1 Mbps CI link) is fine; only stalls (hung peer, dropped connection without TCP RST, half-closed socket) trip the limit.DEFAULT_HTTP_CONNECT_TIMEOUT= 15s — TCP/TLS handshake ceiling. Independent of the read window so DNS or handshake stalls fail fast.We deliberately do not set a total
timeout(...): that would cap every transfer regardless of progress and regress legitimate long uploads/downloads on slow CI links.No new API surface; defaults apply to every existing caller.
Changes are visible to end-users: no
Test plan
cargo build -p aspect-clicleancargo test -p axl-runtimeclean (no http-specific tests exist; the change is a Client-builder configuration with no behavioural branching)ctx.http()continues to work; a request to an unresponsive host now errors out around the 15s (connect) or 120s (read-stall) marks instead of hanging; a long-but-progressing upload/download still completes.