feat(taskbroker): Add Claimed Status to Handle Push Failures by george-sentry · Pull Request #586 · getsentry/taskbroker

george-sentry · 2026-04-02T22:03:01Z

Linear

Description

Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.

Right now, I rely on processing_deadline to revert processing tasks back to pending if pushing them failed. This isn't good because it eats through processing attempts, resulting in needlessly dropped tasks.

I want to add a Sending status that indicates a task is being sent. Now, upkeep increments processing attempts only for tasks that are still in "sending" when their processing deadlines expire. If the status is "processing," that means the task was already sent successfully and its processing attempts can be incremented.

This will help us avoid dropping tasks needlessly when workers are busy.

Note that my original plan was different. You can see it in the commit history. Here is a description of that plan.

I want to add a sent column to the activation table to track whether a task was successfully sent after being fetched from the table. Now, upkeep increments processing attempts only for tasks that are processing and have sent = true.

If the status is processing and sent = false, that means pushing failed or timed out (or didn't happen yet), and we can revert back to pending without incrementing processing attempts.

linear-code · 2026-04-02T22:03:04Z

STREAM-860 Add Sent Flag to Handle Push Failures

…vations

…tivations Marked Sent

markstory · 2026-04-10T14:01:02Z

    /// Unused but necessary to align with sentry-protos
    Unspecified,
    Pending,
+    Sending,


What if we had a Claimed status? The lifecycle could be

flowchart pending -- taken by a push thread and going to send soon --> claimed claimed -- sent to a worker successfully --> processing processing --> complete processing --> retry processing --> failure

Loading

You have 'claim' in a bunch of the methods, but no status reflecting that, and with the addition of Sending the Processing status can mean two different things depending on the broker mode.

This is functionally the same as before, right? Changing sending to claimed?

It sounds like it is. I don't mind changing sending to claimed if it's more clear that way.

Yeah, it is just a naming change.

markstory · 2026-04-10T14:04:38Z

+        status: InflightActivationStatus,
+    ) -> Result<Vec<InflightActivation>, Error>;
+
+    /// Claims `limit` activations within the `bucket` range. Push mode uses status `Sending` until `mark_activation_sent` moves to `Processing`.


With the methods using claim in their name, Claimed could be a better status name.

I think this is a good idea. Thoughts @evanh?

Yeah I think it makes sense. Cleans up the nomenclature for sure.

markstory · 2026-04-10T14:06:01Z

    }

+    #[instrument(skip_all)]
+    async fn mark_activation_sent(&self, id: &str) -> Result<(), Error> {


Sent isn't a status/state in the state machine, should this be mark_activation_processing?

Good point - yes.

markstory · 2026-04-10T14:07:18Z

-        if let Ok(query_res) = most_once_result {
-            processing_deadline_modified_rows = query_res.rows_affected();
-        }
+        // Revert activations that weren't delivered back to 'pending' without consuming an attempt


In other tasks systems this is referred to as releasing a claim

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: SQLite unixepoch uses unsupported 'milliseconds' modifier
- Replaced the invalid SQLite milliseconds modifier with a seconds expression using fractional seconds so claim_expires_at is now populated correctly in push claims.

Or push these changes by commenting:

@cursor push 4d21eabfe0

Preview (4d21eabfe0)

diff --git a/src/store/inflight_activation.rs b/src/store/inflight_activation.rs
--- a/src/store/inflight_activation.rs
+++ b/src/store/inflight_activation.rs
@@ -930,7 +930,7 @@
             query_builder.push_bind(InflightActivationStatus::Processing);
         } else {
             query_builder.push(format!(
-                "claim_expires_at = unixepoch('now', '+' || {claim_lease_ms} || ' milliseconds', '+' || {grace_period} || ' seconds'), processing_deadline = NULL, status = "
+                "claim_expires_at = unixepoch('now', '+' || ({claim_lease_ms} / 1000.0) || ' seconds', '+' || {grace_period} || ' seconds'), processing_deadline = NULL, status = "
             ));
 
             query_builder.push_bind(InflightActivationStatus::Claimed);

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

markstory

Looks good to me, outside of the at-most-once task handling. While I don't think what you have is wrong, it does run the risk of dropping tasks without ever executing them.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2156a43. Configure here.}

cursor · 2026-04-13T20:26:23Z

            max_processing_attempts: config.max_processing_attempts,
            vacuum_page_count: config.vacuum_page_count,
            processing_deadline_grace_sec: config.processing_deadline_grace_sec,
+            claim_lease_ms: config.fetch_batch_size.max(1) as u64 * config.push_queue_timeout_ms,


Claim lease formula omits push timeout causing premature expiration

Medium Severity

The claim_lease_ms formula is fetch_batch_size * push_queue_timeout_ms, which only covers the time to enqueue tasks into the push pool's bounded channel. After enqueuing, the actual gRPC push can take up to push_timeout_ms (default 30s). With defaults, claims expire after ~8s (5s lease + 3s grace), but pushes can legitimately take up to 30s. If a push takes longer than 8s but succeeds, upkeep reverts the task to Pending before mark_activation_processing runs, causing the successfully-delivered task to be re-claimed and pushed again — duplicate execution.

Additional Locations (1)

src/store/postgres_activation_store.rs#L80-L81

^{Reviewed by Cursor Bugbot for commit 2156a43. Configure here.}

This is possibly a valid point. It may be a better idea to just make this value configurable, but then we'll need to pick an appropriate value, which is hard. Let's remember this for the future.

sentry · 2026-04-13T20:50:24Z

                .expect("Could not create kafka producer in upkeep"),
        );
        if let Ok(tasks) = store
-            .get_pending_activations_from_namespaces(None, Some(&demoted_namespaces), None, None)
+            .claim_activations(None, Some(&demoted_namespaces), None, None, false)
            .await
        {
            // Produce tasks to Kafka with updated namespace


Bug: Tasks for demoted namespaces that fail Kafka publishing get stuck in an infinite retry loop because processing_attempts is never incremented, preventing them from reaching Failure status.
_{Severity: HIGH}

Suggested Fix

To fix this, ensure the processing_attempts counter is incremented for this failure path. This can be done by either using mark_processing: true for demoted namespaces, calling mark_activation_processing after a successful publish, or incrementing the counter when reverting a task from Claimed to Pending upon expiration.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/upkeep.rs#L318-L324 Potential issue: When forwarding tasks for demoted namespaces, the system first moves them to a `Claimed` status. If the subsequent Kafka publish fails, these tasks remain `Claimed` until they expire. The expiration handler, `handle_claim_expiration()`, reverts them to `Pending` but does not increment the `processing_attempts` counter. This creates an infinite loop where tasks that consistently fail to publish are retried indefinitely, never reaching the `max_processing_attempts` limit and never being moved to the `Failure` status for dead-lettering.

The AI reviewers constantly get tripped up by this code. If I mark them as claimed, it will complain about failed publishing resulting in infinite retries. If I mark them as processing, it will complain that we will run out of tries after the processing deadline is exceeded too many times. In other words, no matter which way you go, this chunk of code will be flagged.

Add Sent Flag to Prevent Dropping Tasks on Push Failure

358edc1

george-sentry requested a review from a team as a code owner April 2, 2026 22:03

sentry Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread src/upkeep.rs

Comment thread src/push/mod.rs Outdated

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread benches/store_bench.rs Outdated

Add Metrics for Processing Deadline Resets, Fix AI Reviewer Bugs

7084a24

sentry Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread src/grpc/server.rs

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread pg_migrations/0001_create_inflight_activations.sql Outdated

Split Postgres Changes into Migrations

1d248a1

sentry Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread src/store/inflight_activation.rs Outdated

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread src/store/inflight_activation.rs Outdated

Handle Claim One Invariant Gracefully

688dc04

sentry Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread src/store/inflight_activation.rs

Comment thread src/upkeep.rs

Replace Sent Flag w/Sending Status

56d2efb

george-sentry changed the title ~~feat(taskbroker): Add Sent Flag to Prevent Dropping Tasks on Push Failure~~ feat(taskbroker): Add Sending Status to Handle Push Failures Apr 7, 2026

sentry Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/store/inflight_activation.rs

Comment thread src/upkeep.rs

cursor Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/store/inflight_activation.rs Outdated

evanh reviewed Apr 8, 2026

View reviewed changes

Comment thread src/grpc/server.rs Outdated

Mark Demoted Namespace Tasks as Sending, Log Error on No Pending Acti…

a819ea5

…vations

sentry Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/upkeep.rs Outdated

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/upkeep.rs

Comment thread src/store/inflight_activation.rs

Emit Metrics for Sending Tasks

991c4ee

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/upkeep.rs Outdated

Add Sending Count to UpkeepResults Empty Calculation, Warn on No Ac…

454e0c7

…tivations Marked Sent

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs

george-sentry added 2 commits April 9, 2026 13:51

Add Rows Affected Check to PSQL Mark Sent, Fix Unit Tests

646242f

Merge branch 'main' into george/push-taskbroker/add-sent-flag

fa4012b

sentry Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs Outdated

george-sentry requested a review from evanh April 9, 2026 21:10

markstory reviewed Apr 10, 2026

View reviewed changes

george-sentry added 2 commits April 10, 2026 11:40

Rename Sending to Claimed

d573743

Merge Remote into Local

37a7007

cursor Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/grpc/server.rs

george-sentry changed the title ~~feat(taskbroker): Add Sending Status to Handle Push Failures~~ feat(taskbroker): Add Claimed Status to Handle Push Failures Apr 10, 2026

george-sentry added 2 commits April 10, 2026 11:56

Tweak Logs

fe07fa0

Add Claim Expiration Column

a4abee7

sentry Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs Outdated

cursor Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread src/store/inflight_activation.rs

Fix AI Review Comments

a22c687

sentry Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs Outdated

markstory approved these changes Apr 13, 2026

View reviewed changes

Comment thread src/store/postgres_activation_store.rs Outdated

Revert AMO Tasks to Pending on Claim Expiration Instead of Failing

6683e80

cursor Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread src/upkeep.rs Outdated

evanh approved these changes Apr 13, 2026

View reviewed changes

Don't Recompute Seconds Since Startup

2156a43

cursor Bot reviewed Apr 13, 2026

View reviewed changes

Add Space Before WHERE in SQLite Query

b8e6857

sentry Bot reviewed Apr 13, 2026

View reviewed changes

george-sentry merged commit 6e9b71d into main Apr 13, 2026
22 checks passed

george-sentry deleted the george/push-taskbroker/add-sent-flag branch April 13, 2026 21:08

sentry-release-bot Bot mentioned this pull request Apr 15, 2026

publish: getsentry/taskbroker@26.4.0 getsentry/publish#7816

Closed

3 tasks

Uh oh!

Conversation

george-sentry commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linear

Description

Uh oh!

linear-code Bot commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

markstory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 13, 2026

Choose a reason for hiding this comment

Claim lease formula omits push timeout causing premature expiration

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sentry Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

george-sentry commented Apr 2, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading