feat(taskbroker): Add Useful Push Taskbroker Metrics by george-sentry · Pull Request #595 · getsentry/taskbroker

george-sentry · 2026-04-14T21:21:49Z

Linear

Description

Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.

This PR adds metrics around the main push mode handoff points so we can see how the system is behaving in production and tell where work is getting stuck.

Right now, there are some scattered metrics, but they do not make it easy to answer questions like...

Are fetch loops finding work, or just polling empty?
Are brokers claiming work successfully but getting stuck on the internal push queue?
Are pushes to workers succeeding, timing out, or failing with specific gRPC errors?
Are claimed activations being moved to processing successfully?
Are workers accepting pushed tasks, or rejecting them because they are busy?
Are pushed tasks actually closing the loop through SetTaskStatus?

This PR fills in those gaps.

Changes

Add fetch loop metrics for empty polls, claimed batch sizes, store errors, and submit outcomes
Add push pool metrics for worker connection attempts, queue depth, queue wait time, RPC attempts, RPC outcomes, and delivery outcomes
Add mark_activation_processing outcome metrics for success vs no-op cases
Add push mode SetTaskStatus metrics for success, not-found, and error outcomes
Add worker-side PushTask ingress metrics for attempt, accept, busy, and request duration
Tag gRPC middleware request metrics with response status

linear-code · 2026-04-14T21:21:53Z

STREAM-871 Add Useful Metrics

cursor · 2026-04-14T21:30:11Z

+            "taskworker.worker.push_rpc.duration",
+            time.monotonic() - start_time,
+            tags={"result": "accepted", "processing_pool": self.worker._processing_pool_name},
+        )


Metrics after abort lack structural mutual exclusivity

Low Severity

The "accepted" metrics on lines 88–96 are emitted unconditionally after the if block, relying on context.abort() raising an exception to prevent them from firing in the busy case. While this works in production gRPC, the existing test test_push_task_worker_busy uses a MagicMock for context, where abort() does not raise — meaning both "busy" and "accepted" metrics fire for the same request during testing. Using an else branch (or adding a return after context.abort()) would make the mutual exclusivity between "busy" and "accepted" structurally guaranteed rather than reliant on a side effect.

Additional Locations (1)

clients/python/src/taskbroker_client/worker/worker.py#L75-L86

^{Reviewed by Cursor Bugbot for commit fbd2453. Configure here.}

Metrics being wrong when mocks are used is fine imo. The mocks should raise

markstory · 2026-04-15T15:17:42Z

+            "taskworker.worker.push_rpc.duration",
+            time.monotonic() - start_time,
+            tags={"result": "accepted", "processing_pool": self.worker._processing_pool_name},
+        )


Metrics being wrong when mocks are used is fine imo. The mocks should raise

markstory · 2026-04-15T15:19:09Z

+                                                    PushError::Channel(_) => "channel_error",
+                                                };
+                                                metrics::counter!("push.fetch.submit", "result" => reason)
+                                                    .increment(1);


Should the match block on 153 be indented?

Yes, fixed. I think cargofmt didn't do anything because it was already indented too far.

markstory · 2026-04-15T15:23:08Z

-            Ok(Err(e)) => Err(PushError::Channel(e)),
+            Ok(Err(e)) => {
+                metrics::histogram!("push.queue.wait_duration").record(start.elapsed());
+                metrics::counter!("push.queue.submit", "result" => "channel_error").increment(1);


Do you need a counter for successful submits?

Yes. I realized that the metrics here are identical to the metrics at the callsite in fetch.rs, so I removed these. At the callsite, I have counters for successful submits and failed submits categorized by error kind (either timeout or channel error).

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2ae4a59. Configure here.}

…eorge/push-taskbroker/add-useful-metrics

evanh

LGTM!

Add Useful Push Taskbroker Metrics

fbd2453

george-sentry requested a review from a team as a code owner April 14, 2026 21:21

cursor Bot reviewed Apr 14, 2026

View reviewed changes

markstory reviewed Apr 15, 2026

View reviewed changes

Metrics Tweaks

2ae4a59

sentry Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread src/fetch/mod.rs

Comment thread src/push/mod.rs Outdated

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread src/fetch/mod.rs

More Metrics Tweaks

115b8f0

sentry Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread src/grpc/metrics_middleware.rs Outdated

Comment thread src/push/mod.rs

george-sentry added 2 commits April 15, 2026 12:05

Final (?) Metrics Tweaks

a16f3bc

Merge branch 'main' of https://github.com/getsentry/taskbroker into g…

92209d7

…eorge/push-taskbroker/add-useful-metrics

evanh approved these changes Apr 15, 2026

View reviewed changes

sentry Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/worker/worker.py

Address AI Comments

2320bd4

sentry Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread src/grpc/server.rs

markstory approved these changes Apr 16, 2026

View reviewed changes

george-sentry added 2 commits April 16, 2026 09:01

Change Fetch Metrics Prefix, Emit set_status Duration on All Paths

768fa11

Lint w/New Clippy Version

15cddde

sentry Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/worker/worker.py

george-sentry merged commit 00d8073 into main Apr 16, 2026
23 checks passed

george-sentry deleted the george/push-taskbroker/add-useful-metrics branch April 16, 2026 16:40

Uh oh!

Conversation

george-sentry commented Apr 14, 2026

Linear

Description

Changes

Uh oh!

linear-code Bot commented Apr 14, 2026

Uh oh!

cursor Bot Apr 14, 2026

Choose a reason for hiding this comment

Metrics after abort lack structural mutual exclusivity

Uh oh!

markstory Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

markstory Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

markstory Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

george-sentry Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

markstory Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

george-sentry Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evanh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants