Skip to content

fix(http): bound read and connect timeouts on the runtime HTTP client#1071

Merged
thesayyn merged 1 commit into
mainfrom
pre-http-timeouts
May 12, 2026
Merged

fix(http): bound read and connect timeouts on the runtime HTTP client#1071
thesayyn merged 1 commit into
mainfrom
pre-http-timeouts

Conversation

@gregmagolan
Copy link
Copy Markdown
Member

@gregmagolan gregmagolan commented May 11, 2026

Summary

ctx.http().* had no timeout configured on the underlying reqwest Client, so a hung peer (slow GitHub, broken proxy, dropped connection without TCP RST) could pin a single request indefinitely. AXL feature handlers run synchronously, so one hung request wedges the whole task.

Adds two ceilings on the runtime HTTP client:

  • DEFAULT_HTTP_READ_TIMEOUT = 120s — maximum gap between successive reads on a single response. Resets every time bytes arrive, so a slow-but-progressing transfer (a large artifact upload on a 1 Mbps CI link) is fine; only stalls (hung peer, dropped connection without TCP RST, half-closed socket) trip the limit.
  • DEFAULT_HTTP_CONNECT_TIMEOUT = 15s — TCP/TLS handshake ceiling. Independent of the read window so DNS or handshake stalls fail fast.

We deliberately do not set a total timeout(...): that would cap every transfer regardless of progress and regress legitimate long uploads/downloads on slow CI links.

No new API surface; defaults apply to every existing caller.


Changes are visible to end-users: no

  • Searched for relevant documentation and updated as needed: n/a
  • Breaking change (forces users to change their own code or config): no
  • Suggested release notes appear below: no

Test plan

  • cargo build -p aspect-cli clean
  • cargo test -p axl-runtime clean (no http-specific tests exist; the change is a Client-builder configuration with no behavioural branching)
  • Manual: any AXL task using ctx.http() continues to work; a request to an unresponsive host now errors out around the 15s (connect) or 120s (read-stall) marks instead of hanging; a long-but-progressing upload/download still completes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3863a63ae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/http.rs Outdated
@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented May 11, 2026

Aspect CI Status · 2026-05-11T23:40:41Z

Task Job Result
✅ passed build build-task 153 targets built
✅ passed build build-task 153 targets built
✅ passed test test-task all 25 tests cached
✅ passed test test-task all 25 tests cached

`ctx.http().*` calls had no timeout configured on the underlying
reqwest `Client`, so a hung peer (slow GitHub, broken proxy, dropped
connection without TCP RST) could pin a single request indefinitely.
AXL feature handlers run synchronously, so one hung request wedges
the whole task.

  - DEFAULT_HTTP_READ_TIMEOUT    = 120s   max gap between reads
  - DEFAULT_HTTP_CONNECT_TIMEOUT = 15s    TCP/TLS handshake ceiling

`read_timeout` resets every time bytes arrive, so slow-but-progressing
transfers (large artifact uploads on a 1 Mbps CI link) are fine — only
stalls trip the limit. We deliberately do NOT set a total `timeout`:
that would cap every transfer regardless of progress and regress
legitimate long uploads/downloads on slow links.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregmagolan gregmagolan changed the title fix(http): bound request and connect timeouts on the runtime HTTP client fix(http): bound read and connect timeouts on the runtime HTTP client May 11, 2026
@gregmagolan gregmagolan requested a review from thesayyn May 11, 2026 23:38
Copy link
Copy Markdown
Member

@thesayyn thesayyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice default, but we probably should have knobs in http methods to allow configuring this.

@thesayyn thesayyn merged commit 425a476 into main May 12, 2026
20 checks passed
@thesayyn thesayyn deleted the pre-http-timeouts branch May 12, 2026 20:17
gregmagolan added a commit that referenced this pull request May 12, 2026
…prefactor

The HTTP read/connect-timeout change was pre-factored out of this branch
and landed separately in #1071. The split left a merge artifact in
crates/axl-runtime/src/engine/http.rs: the `impl StarlarkValue for Http`
block and `starlark_simple_value!(Http)` macro were dropped, leaving a
dangling `#[starlark_value(type = "Http")]` attribute and breaking the
build (Http no longer implements StarlarkValue, so it can't be exposed
via `StarlarkValueAsType` in engine/mod.rs).

Restore the impl block + macro exactly as they were on main prior to the
prefactor. Pure revert of the accidental deletion; no behavioural change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gregmagolan added a commit that referenced this pull request May 13, 2026
…P failures (#1078)

## Summary

A transient transport failure on a GitHub Actions Twirp
`DeleteArtifact` call propagated out of the GHSC publisher and
failed an in-flight `lint` task. Observability features must never
fail the underlying task.

`lib/github.axl` now applies one consistent HTTP failure-handling
contract to every helper:

- **Transport errors** (DNS / TLS / refused / reset / timeout) are
  caught via `.map_err(lambda e: str(e)).block()` and surfaced as
  `(False, 0, "transport error: ...")`. The exception that aborted
  the lint task can't happen.
- **Transient failures** (transport, 5xx, 429) are retried up to 3
  times with exponential backoff (250 → 500 → 1000 ms). 4xx is
  deterministic and not retried.
- **Retry is gated on idempotency**: GET / PUT / PATCH / DELETE
  retry, POST runs exactly once. Retrying POST after the server
  accepted the first request would duplicate check-runs / PR
  comments / reviews / review comments — strictly worse than the
  original error.
- For the GH Actions Twirp ArtifactService, retry is gated on a
per-method allowlist (`_TWIRP_IDEMPOTENT_METHODS = ["DeleteArtifact"]`).
  `CreateArtifact` and `FinalizeArtifact` run once each.
- **Terminal failures emit a one-line `WARNING:`** with method, URL,
  attempts count (when retry applied), and status/transport-error
  detail. Before this change, the only signal was an opaque AXL
  parse trace pointing at the call site with no underlying cause.

Also fixed two adjacent paths surfaced while auditing every
`ctx.http()` call site that feeds into a feature handler:
`lib/circleci.axl::_request` (CircleCI runner API) and
`lib/artifacts.axl::_upload_file_gitlab` (GitLab generic-packages
PUT) — both now catch transport errors so a flaky CI API can't fail
a task that just happens to be uploading artifacts, and both emit
the same one-line WARNING on terminal failure.

`.map_err` is the pre-existing `StarlarkFuture` transform
(`axl-runtime/src/engine/async/future.rs:221`); no Rust changes.

### Builds on #1071

#1071 (already landed) put a hard upper bound on how long a single
HTTP call can hang. This PR layers retry + graceful-failure on top:
timeouts bound the worst case, retry covers the common case,
graceful failure means observability features never abort tasks.

---

### Changes are visible to end-users: yes

- Searched for relevant documentation and updated as needed: no (in-code
only)
- Breaking change (forces users to change their own code or config): no
- Suggested release notes appear below: yes

#### Suggested release notes

- Transient GitHub / CI API failures no longer fail `aspect lint` /
  `aspect format` / `aspect gazelle` (or any other task using the
  GHSC / GHC / ArtifactUpload features). Idempotent calls (GET / PUT
  / PATCH / DELETE) retry up to 3 times with exponential backoff;
  POST runs once to avoid duplicate check-runs / comments / reviews.
  Terminal failures now log a `WARNING:` line with method, URL,
  attempts, and the underlying transport error or HTTP status.

### Test plan

- Covered by existing test cases: 727 AXL tests + 7 snapshot suites
  pass.
- Manual: ran `aspect lint` / `aspect format` / `aspect gazelle`
  locally; happy path unchanged. Retry behavior exercised whenever
  GitHub's edge flakes — visible in CI logs as `WARNING: GitHub API
  POST /repos/.../check-runs failed after 3 attempts: transport
  error: ...` followed by the next cycle's retry-and-recover.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants