Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea1a2a799d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| retry: sink::retry::RetryConfig { | ||
| max_retries: max_retries as u32, | ||
| retry_min_delay, | ||
| retry_max_buffer_size: retry_max_buffer_size as usize, | ||
| timeout, | ||
| error_strategy, |
There was a problem hiding this comment.
Honor the configured BES timeout
The Starlark API and Workflows feature expose timeout/bes_timeout as an overall BES upload deadline, but this value is only stored in RetryConfig and is never read by Grpc::work, drive_stream, retry_lifecycle, or Client. Users setting a finite timeout still wait through all lifecycle calls, stream retries, and final upload indefinitely if the backend stalls or keeps retrying, so the new knob is effectively a no-op.
Useful? React with 👍 / 👎.
| bessie_sinks.append(bazel.build_events.grpc( | ||
| uri = bessie_endpoint, | ||
| metadata = {} # TODO: how does bessie authenticate? | ||
| metadata = {}, # TODO: how does bessie authenticate? |
There was a problem hiding this comment.
Within the deployment, it is just TLS so you can drop the TODO.
Address maintainer feedback on PR #1065: 1. Bound the gRPC sink's forwarder channel by retry_max_buffer_size. Previously the Tokio mpsc was unbounded, so events accumulated freely while drive_stream was sleeping between reconnects or replaying the retry buffer — defeating the very knob meant to cap memory. The forwarder now uses try_send and signals overflow via an atomic flag; drive_stream sees the closed receiver, checks the flag, and exits with BufferFull so error_strategy applies. 2. Wire the configured timeout. RetryConfig.timeout was previously stored but never read, so 'bes_timeout' was a no-op. Wrap the sink's work in tokio::time::timeout so finite deadlines actually bound lifecycle calls, stream retries, and the final upload — matching Bazel's --bes_timeout semantics. 3. Drop the TODO about bessie authentication on the workflows BES sink: within the deployment, transport security is plain TLS.
|
|
The gRPC sink uses tonic::Status::unavailable directly, but tonic was only listed in Cargo.toml. Bazel's rust_library deps are the authoritative source, so the bazel build of //:cli failed with 'use of unresolved module or unlinked crate `tonic`'. Add @crates//:tonic alongside the other tokio-family deps.
Aspect CI Status ·
|
| Task | Job | Result | |
|---|---|---|---|
| 🟡 running | build |
build-task |
in progress |
| 🟡 running | test |
test-task |
in progress |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eecf20f93d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // `tokio::time::timeout` mirrors Bazel's `--bes_timeout` and gives | ||
| // the user-set knob a real effect. | ||
| match timeout { | ||
| Some(d) => match tokio::time::timeout(d, inner).await { |
There was a problem hiding this comment.
Start BES timeout after the build finishes
When timeout is set, this wraps the entire sink task from the moment the Bazel process is spawned, but work_inner spends most of a long build just waiting for BEP events from event_rx. Bazel documents --bes_timeout as the time to wait for BES/BEP upload after the build and tests have finished (https://bazel.build/versions/7.5.0/reference/command-line-reference), so a healthy build that runs longer than bes_timeout will terminate the sink early; with warn/ignore it silently drops the rest of the upload, and with abort/fail_at_end it can fail an otherwise successful build.
Useful? React with 👍 / 👎.
The bounded mpsc with try_send introduced in c7702e9 caused BES handling hangs in CI. Revert to the prior unbounded forwarder so the sink behaves the same as before that commit. The timeout wrapper from work() is kept — it's a pure no-op when 'bes_timeout' defaults to "0s"/None, and it actually wires up the knob when the user sets a finite deadline. The memory bound concern raised on PR #1065 still stands as a follow-up; bounding the forwarder safely needs more thought than the simple try_send/overflow flag did.
After sending last_message and half-closing the request side, drive_stream waited for the server to close the response stream before returning Done. Some BES backends keep the response stream open after the final ack, which made every successful upload hang indefinitely at end-of-build — visible in CI as build/test/lint/ delivery jobs running until the 12-minute cancellation. Exit Done as soon as last_message_sent is true and the retry buffer drains. The same shortcut applies for the upstream-closed path: if we already half-closed without a last_message and the buffer drains, return UpstreamClosed without waiting for the server.
Symptom: CI Build Task hits "153 Targets Built" then sits until the 20-minute job timeout. wait()/sink_handles.join() is blocked on the gRPC sink thread because drive_stream is parked in response_stream.next() after the request side was half-closed — the server never sent another ack and never closed the response stream. The previous "exit Done once buffer drains" only fired on an ack, so a server that goes silent post-half-close pinned the entire build. Add a 30s hard deadline that arms when sender_opt is dropped (either after sending last_message or after the upstream broadcaster closed). If we hit it, return Done/UpstreamClosed — the events were already buffered into tonic before we dropped the sender, so further waiting only wedges the build.
Changes are visible to end-users: yes
Test plan