Skip to content

feat(taskbroker): Add Useful Push Taskbroker Metrics#595

Merged
george-sentry merged 8 commits intomainfrom
george/push-taskbroker/add-useful-metrics
Apr 16, 2026
Merged

feat(taskbroker): Add Useful Push Taskbroker Metrics#595
george-sentry merged 8 commits intomainfrom
george/push-taskbroker/add-useful-metrics

Conversation

@george-sentry
Copy link
Copy Markdown
Member

Linear

Completes STREAM-871

Description

Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.

This PR adds metrics around the main push mode handoff points so we can see how the system is behaving in production and tell where work is getting stuck.

Right now, there are some scattered metrics, but they do not make it easy to answer questions like...

  • Are fetch loops finding work, or just polling empty?
  • Are brokers claiming work successfully but getting stuck on the internal push queue?
  • Are pushes to workers succeeding, timing out, or failing with specific gRPC errors?
  • Are claimed activations being moved to processing successfully?
  • Are workers accepting pushed tasks, or rejecting them because they are busy?
  • Are pushed tasks actually closing the loop through SetTaskStatus?

This PR fills in those gaps.

Changes

  • Add fetch loop metrics for empty polls, claimed batch sizes, store errors, and submit outcomes
  • Add push pool metrics for worker connection attempts, queue depth, queue wait time, RPC attempts, RPC outcomes, and delivery outcomes
  • Add mark_activation_processing outcome metrics for success vs no-op cases
  • Add push mode SetTaskStatus metrics for success, not-found, and error outcomes
  • Add worker-side PushTask ingress metrics for attempt, accept, busy, and request duration
  • Tag gRPC middleware request metrics with response status

@george-sentry george-sentry requested a review from a team as a code owner April 14, 2026 21:21
@linear-code
Copy link
Copy Markdown

linear-code Bot commented Apr 14, 2026

"taskworker.worker.push_rpc.duration",
time.monotonic() - start_time,
tags={"result": "accepted", "processing_pool": self.worker._processing_pool_name},
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics after abort lack structural mutual exclusivity

Low Severity

The "accepted" metrics on lines 88–96 are emitted unconditionally after the if block, relying on context.abort() raising an exception to prevent them from firing in the busy case. While this works in production gRPC, the existing test test_push_task_worker_busy uses a MagicMock for context, where abort() does not raise — meaning both "busy" and "accepted" metrics fire for the same request during testing. Using an else branch (or adding a return after context.abort()) would make the mutual exclusivity between "busy" and "accepted" structurally guaranteed rather than reliant on a side effect.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fbd2453. Configure here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics being wrong when mocks are used is fine imo. The mocks should raise

"taskworker.worker.push_rpc.duration",
time.monotonic() - start_time,
tags={"result": "accepted", "processing_pool": self.worker._processing_pool_name},
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics being wrong when mocks are used is fine imo. The mocks should raise

Comment thread src/fetch/mod.rs
PushError::Channel(_) => "channel_error",
};
metrics::counter!("push.fetch.submit", "result" => reason)
.increment(1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the match block on 153 be indented?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed. I think cargofmt didn't do anything because it was already indented too far.

Comment thread src/push/mod.rs Outdated
Ok(Err(e)) => Err(PushError::Channel(e)),
Ok(Err(e)) => {
metrics::histogram!("push.queue.wait_duration").record(start.elapsed());
metrics::counter!("push.queue.submit", "result" => "channel_error").increment(1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a counter for successful submits?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I realized that the metrics here are identical to the metrics at the callsite in fetch.rs, so I removed these. At the callsite, I have counters for successful submits and failed submits categorized by error kind (either timeout or channel error).

Comment thread src/fetch/mod.rs
Comment thread src/push/mod.rs Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2ae4a59. Configure here.

Comment thread src/fetch/mod.rs
Comment thread src/grpc/metrics_middleware.rs Outdated
Comment thread src/push/mod.rs
Copy link
Copy Markdown
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread clients/python/src/taskbroker_client/worker/worker.py
Comment thread src/grpc/server.rs
Comment thread clients/python/src/taskbroker_client/worker/worker.py
@george-sentry george-sentry merged commit 00d8073 into main Apr 16, 2026
23 checks passed
@george-sentry george-sentry deleted the george/push-taskbroker/add-useful-metrics branch April 16, 2026 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants