feat(taskbroker): Add Useful Push Taskbroker Metrics#595
Conversation
| "taskworker.worker.push_rpc.duration", | ||
| time.monotonic() - start_time, | ||
| tags={"result": "accepted", "processing_pool": self.worker._processing_pool_name}, | ||
| ) |
There was a problem hiding this comment.
Metrics after abort lack structural mutual exclusivity
Low Severity
The "accepted" metrics on lines 88–96 are emitted unconditionally after the if block, relying on context.abort() raising an exception to prevent them from firing in the busy case. While this works in production gRPC, the existing test test_push_task_worker_busy uses a MagicMock for context, where abort() does not raise — meaning both "busy" and "accepted" metrics fire for the same request during testing. Using an else branch (or adding a return after context.abort()) would make the mutual exclusivity between "busy" and "accepted" structurally guaranteed rather than reliant on a side effect.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit fbd2453. Configure here.
There was a problem hiding this comment.
Metrics being wrong when mocks are used is fine imo. The mocks should raise
| "taskworker.worker.push_rpc.duration", | ||
| time.monotonic() - start_time, | ||
| tags={"result": "accepted", "processing_pool": self.worker._processing_pool_name}, | ||
| ) |
There was a problem hiding this comment.
Metrics being wrong when mocks are used is fine imo. The mocks should raise
| PushError::Channel(_) => "channel_error", | ||
| }; | ||
| metrics::counter!("push.fetch.submit", "result" => reason) | ||
| .increment(1); |
There was a problem hiding this comment.
Should the match block on 153 be indented?
There was a problem hiding this comment.
Yes, fixed. I think cargofmt didn't do anything because it was already indented too far.
| Ok(Err(e)) => Err(PushError::Channel(e)), | ||
| Ok(Err(e)) => { | ||
| metrics::histogram!("push.queue.wait_duration").record(start.elapsed()); | ||
| metrics::counter!("push.queue.submit", "result" => "channel_error").increment(1); |
There was a problem hiding this comment.
Do you need a counter for successful submits?
There was a problem hiding this comment.
Yes. I realized that the metrics here are identical to the metrics at the callsite in fetch.rs, so I removed these. At the callsite, I have counters for successful submits and failed submits categorized by error kind (either timeout or channel error).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2ae4a59. Configure here.
…eorge/push-taskbroker/add-useful-metrics


Linear
Completes STREAM-871
Description
Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.
This PR adds metrics around the main push mode handoff points so we can see how the system is behaving in production and tell where work is getting stuck.
Right now, there are some scattered metrics, but they do not make it easy to answer questions like...
SetTaskStatus?This PR fills in those gaps.
Changes
mark_activation_processingoutcome metrics for success vs no-op casesSetTaskStatusmetrics for success, not-found, and error outcomesPushTaskingress metrics for attempt, accept, busy, and request duration