Skip to content

feat(taskbroker): Implement retry support for raw topics#630

Open
untitaker wants to merge 12 commits into
mainfrom
feat/retry-support-raw-topics
Open

feat(taskbroker): Implement retry support for raw topics#630
untitaker wants to merge 12 commits into
mainfrom
feat/retry-support-raw-topics

Conversation

@untitaker
Copy link
Copy Markdown
Member

@untitaker untitaker commented May 11, 2026

Summary

Implement two independent features in this PR:

  • kafka_retry_topic so that one can send activations from retried tasks into a separate topic. That topic never contains raw messages.
  • Ability to get max_retries from the worker. Previously the producer would configure max_retries and send them as part of the activation, but that cannot work with raw topics, so an architecture change is needed.

See Architecture doc → Stage 4 for full context.

Dependencies

Depends on: getsentry/sentry-protos#251 (adds max_retries field to SetTaskStatusRequest)

ref STREAM-981

Add retry support for raw/passthrough topics (e.g. `ingest-events`) where
tasks don't have retry_state embedded in the message.

Changes:
- Config: Add `kafka_retry_topic` option for dedicated retry topic
- Store: Add `update_retry_state` method to update activation's retry_state
- gRPC: Handle `max_retries` in SetTaskStatus, call store.update_retry_state
- Upkeep: Route retries to dedicated retry topic when configured
- Consumer: Subscribe to both main and retry topics
- Deserializer: Topic-aware routing (retry topic always uses activation deserializer)
- Python client: Extract max_retries from Retry config, send in SetTaskStatusRequest

When a worker reports RETRY status with max_retries, the broker updates
the activation's retry_state and routes the retry to the dedicated retry
topic. This prevents retries from polluting the main topic where other
consumers (like SBC) can't parse activations.

See https://www.notion.so/3448b10e4b5d80e7a1efee6145d504c2 → Stage 4

Depends on: getsentry/sentry-protos#251

ref STREAM-981

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 11, 2026

STREAM-981

max_retries (from Python's @task decorator) excludes the initial attempt,
while max_attempts includes it. Add 1 when storing to retry_state.

Example: @task(max_retries=3) means 4 total attempts (1 initial + 3 retries)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment thread src/grpc/server.rs Outdated
@untitaker untitaker marked this pull request as ready for review May 18, 2026 15:27
@untitaker untitaker requested a review from a team as a code owner May 18, 2026 15:27
Comment thread clients/python/src/taskbroker_client/worker/workerchild.py
Comment thread src/store/adapters/postgres.rs
untitaker and others added 2 commits May 18, 2026 17:31
The sentry-protos stubs now include max_attempts in SetTaskStatusRequest.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated
Comment thread src/store/adapters/sqlite.rs
Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated
Comment thread src/store/adapters/postgres.rs
@untitaker untitaker requested a review from george-sentry May 18, 2026 17:48
Comment thread src/grpc/server.rs
}

if let Some(ref tx) = self.update_tx {
let max_attempts = request.get_ref().max_attempts;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already asked in a different comment, but I think that code is now outdated so I'll ask here again. How often will there be a max_attempts field on the message? One out of every... 10? 100? 1000? If it's going to be present often, we'll need to rethink how batching works. Because batching is necessary to reach high throughput.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be present always in tasks that run out of standard topics. I'd go further and say that maybe we should simplify the system by moving all tasks to specify the retries via the set_status method so we do not maintain more than one implementation.

Why does this affect the way batch is implemented ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field will be there unconditionally on all retry status updates. So retries are effectively not batched. I can implement batching for retries but this would increase complexity.

Note that not every retry status update results in an additional DB query. max_attempts is only set in the DB if it wasn't there before.

This should be present always in tasks that run out of standard topics.

It's optional here for the sake of rollout. We can gradually increase the amount of tasks that send max_attempts through the worker and observe its impact on the broker. (the rollout mechanism isn't implemented here)

Comment thread src/store/traits.rs
&self,
id: &str,
status: InflightActivationStatus,
max_attempts: Option<u32>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this argument here sort of changes what set_status does. Before, it only set the status. Now it also updates retry_state. Most calls to set_status throughout the tests pass None because in most cases, it isn't needed. Perhaps there should be a separate method to handle this scenario? Like set_retry_status or set_retriable_status? Whatever name makes sense.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to be a separate method, but in order to allow storage backends to optimize these queries I think we should pass it in a single method call.

Right now the postgres backend runs two queries, but one could imagine altering the internal schema so that setting the status and updating the max_attempts can be done in a single UPDATE. But the implementation already benefits from the fact that everything happens in set_status, because it can skip the second UPDATE for task activations that already have max_retries set.

Comment thread src/kafka/deserialize.rs
pub struct DeserializeConfig {
activation_config: DeserializeActivationConfig,
raw_config: Option<RawConfig>,
/// Retry topic always contains activations, even in raw_mode.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? Why is the raw mode distinction important? I thought raw mode was the only mode in which we used the retry topic?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now. You mean that every message in the retry topic is guaranteed to be an activation even in raw mode, whereas messages in the "normal" topic in raw mode may not be?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now. You mean that every message in the retry topic is guaranteed to be an activation even in raw mode, whereas messages in the "normal" topic in raw mode may not be?

Every message in the retry topic is a TaskActivation protobuf regardless of whether the consumer is in raw mode or normal mode. This is because we need to store the retry count in the topic, somehow.

Copy link
Copy Markdown

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id=processing_result.task_id,
status=processing_result.status,
fetch_next_task=fetch_next_task,
max_attempts=processing_result.max_attempts,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an example of how the task will define this value ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry = task_func.retry
if retry and retry.should_retry(inflight.activation.retry_state, err):
next_state = TASK_ACTIVATION_STATUS_RETRY
max_attempts_val = retry._times + 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following this. Why adding 1 ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_attempts shouldn't change. If we increase that value, the task will get an increasing number of retries and might not ever exhaust all retries.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the constant retry.times from the public API: https://github.com/getsentry/sentry/blob/9ee21b63ae7bcd3bf9a002e077e7fc73b860c656/src/sentry/deletions/tasks/scheduled.py#L102-L103

retry._times is a constant, so +1 is a constant. we're not continuously incrementing.

Comment thread src/grpc/server.rs
}

if let Some(ref tx) = self.update_tx {
let max_attempts = request.get_ref().max_attempts;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be present always in tasks that run out of standard topics. I'd go further and say that maybe we should simplify the system by moving all tasks to specify the retries via the set_status method so we do not maintain more than one implementation.

Why does this affect the way batch is implemented ?

Comment thread src/grpc/server.rs
Comment on lines +111 to +116
// Use batching channel if available and we don't need to update retry state.
// If max_attempts is Some, we can't use batching API to update the activation, and have to
// fall back to individual set_status.
if let Some(ref tx) = self.update_tx
&& max_attempts.is_none()
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back on the retry topic.
If the retry topic contains activations rather than the original message, why do we even need a topic ? can't we just store the activation in the DB as pending and treat it like a task for the rest of its lifetime ?

I recognize this is a departure from the original intent of this PR, but it seems a lot simpler to me to manage it this way. The idea of the topic, to me, was meant to use it as a DLQ as well. Am I missing something ?

Copy link
Copy Markdown
Member Author

@untitaker untitaker May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just store the activation in the DB as pending and treat it like a task for the rest of its lifetime ?

I am missing rationale for why the retry system was set up this way to begin with. Your suggestion also applies to how regular tasks work. I would not want to special-case raw-mode to handle retries fundamentally differently than regular tasks.

My guess is that we wanted to keep the size of a database under control, therefore pruning queued retries out of the DB and putting them back into Kafka. If we say that this is not really a concern with AlloyDB then that's fine, but we'd have to validate that IMO

I can explore this option in another PR, but not sure we should roll it out without having more context from folks who originally worked on taskbroker.

The idea of the topic, to me, was meant to use it as a DLQ as well

That is yet another topic. It can stay or go away regardless of what we decide wrt retries.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markstory @enochtangg do you remember why we didn't "just" stick retries into the DB and produce them back into kafka?

Copy link
Copy Markdown

@fpacifici fpacifici May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion also applies to how regular tasks work.

I thought we picked up the task from the database to do the retry. @george-sentry did we change something in the push model ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from slack:

@markstory: Keeping retries in sqlite/postgres could work but those retries will consume slots in sqlite/postgres. It will mean that a retry with no delay runs right away though instead of 'later' when it is found again in the topic.

@enochtangg: Another benefit I remember was because it resets the latency metric. Since latency is task dispatched - kafka receive latency, re-producing in kafka means we don't need to somehow fix that.

@fpacifici: the latency metric argument is important. I think we can keep it as it is and use the topic.  It is true that in AlloyDB there will be more room, but we will have the sqlite around for a while. No need to make the change

So I think I won't change anything here.

untitaker and others added 3 commits May 20, 2026 16:27
- Sync sentry-protos version across Rust (Cargo.toml) and Python (pyproject.toml)
- Update uv.lock to use sentry-protos 0.10.0 (was 0.8.13)
- Merge main to fix rebalance integration test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment thread clients/python/src/taskbroker_client/worker/workerchild.py
Comment thread src/kafka/deserialize.rs
Comment thread src/grpc/server.rs
Comment on lines +109 to 119
let max_attempts = request.get_ref().max_attempts;

// Use batching channel if available and we don't need to update retry state.
// If max_attempts is Some, we can't use batching API to update the activation, and have to
// fall back to individual set_status.
if let Some(ref tx) = self.update_tx
&& max_attempts.is_none()
{
tx.send((id, status))
.await
.map_err(|_| Status::internal("Status update channel closed"))?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The Python worker unconditionally sends max_attempts for tasks with retry policies. This forces the gRPC server to bypass its batching optimization, causing an N+1 database query issue.
Severity: HIGH

Suggested Fix

Modify the Python worker to only send max_attempts when a task is actually being retried. For completed or failed tasks, max_attempts should not be sent, allowing the gRPC server's max_attempts.is_none() check to pass and utilize the intended batching optimization.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/grpc/server.rs#L106-L119

Potential issue: The Python worker unconditionally sends a `max_attempts` value for any
task with a retry decorator, regardless of its final status (complete, failure, or
retry). In the gRPC server, the batching channel (`update_tx`) is only used if
`max_attempts.is_none()`. Because the Python client always sends `max_attempts`, this
condition is never met for tasks with retry policies. This forces a fallback to an
individual `self.store.set_status()` call for each task, creating a significant
performance regression by introducing an N+1 database query problem instead of using a
single batched update.

@untitaker untitaker force-pushed the feat/retry-support-raw-topics branch from 62df298 to 0bf9105 Compare May 20, 2026 17:19
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0bf9105. Configure here.

task_func.retry._times + 1
if task_func.retry and next_state == TASK_ACTIVATION_STATUS_RETRY
else None
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw retries lose delay

Medium Severity

On retry, the worker only sends max_attempts to the broker. For raw-mode activations, retry_state was absent in the stored blob; set_status then inserts a minimal state with only max_attempts. Upkeep republish uses delay_on_retry from that blob, so configured retry backoff is dropped and retries run immediately.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0bf9105. Configure here.

Copy link
Copy Markdown
Member Author

@untitaker untitaker May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn, i thought we had a nonzero default value for this. will probably add delay_on_retry to SetTaskStatusRequest

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants