-
-
Notifications
You must be signed in to change notification settings - Fork 6
feat(taskbroker): Implement retry support for raw topics #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e0c6fcc
95413d0
5354ee7
78400d7
4344e64
a085bbb
4bdfe18
2322c19
c32a3be
71ff30e
1fd5b47
0bf9105
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -378,6 +378,14 @@ def handle_alarm(signum: int, frame: FrameType | None) -> None: | |
| status=next_state, | ||
| host=inflight.host, | ||
| receive_timestamp=inflight.receive_timestamp, | ||
| # Send max_attempts if this is a retry. Don't send it | ||
| # on every task as this codepath is relatively | ||
| # unoptimized on the broker side. | ||
| max_attempts=( | ||
| task_func.retry._times + 1 | ||
| if task_func.retry and next_state == TASK_ACTIVATION_STATUS_RETRY | ||
| else None | ||
| ), | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Raw retries lose delayMedium Severity On retry, the worker only sends Reviewed by Cursor Bugbot for commit 0bf9105. Configure here.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. damn, i thought we had a nonzero default value for this. will probably add
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| ) | ||
| ) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -106,7 +106,14 @@ impl ConsumerService for TaskbrokerServer { | |
| metrics::counter!("grpc_server.set_status.failure").increment(1); | ||
| } | ||
|
|
||
| if let Some(ref tx) = self.update_tx { | ||
| let max_attempts = request.get_ref().max_attempts; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I already asked in a different comment, but I think that code is now outdated so I'll ask here again. How often will there be a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be present always in tasks that run out of standard topics. I'd go further and say that maybe we should simplify the system by moving all tasks to specify the retries via the set_status method so we do not maintain more than one implementation. Why does this affect the way batch is implemented ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The field will be there unconditionally on all retry status updates. So retries are effectively not batched. I can implement batching for retries but this would increase complexity. Note that not every retry status update results in an additional DB query.
It's optional here for the sake of rollout. We can gradually increase the amount of tasks that send |
||
|
|
||
| // Use batching channel if available and we don't need to update retry state. | ||
| // If max_attempts is Some, we can't use batching API to update the activation, and have to | ||
| // fall back to individual set_status. | ||
| if let Some(ref tx) = self.update_tx | ||
| && max_attempts.is_none() | ||
| { | ||
|
Comment on lines
+111
to
+116
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taking a step back on the retry topic. I recognize this is a departure from the original intent of this PR, but it seems a lot simpler to me to manage it this way. The idea of the topic, to me, was meant to use it as a DLQ as well. Am I missing something ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am missing rationale for why the retry system was set up this way to begin with. Your suggestion also applies to how regular tasks work. I would not want to special-case raw-mode to handle retries fundamentally differently than regular tasks. My guess is that we wanted to keep the size of a database under control, therefore pruning queued retries out of the DB and putting them back into Kafka. If we say that this is not really a concern with AlloyDB then that's fine, but we'd have to validate that IMO I can explore this option in another PR, but not sure we should roll it out without having more context from folks who originally worked on taskbroker.
That is yet another topic. It can stay or go away regardless of what we decide wrt retries.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @markstory @enochtangg do you remember why we didn't "just" stick retries into the DB and produce them back into kafka? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I thought we picked up the task from the database to do the retry. @george-sentry did we change something in the push model ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from slack:
So I think I won't change anything here. |
||
| tx.send((id, status)) | ||
| .await | ||
| .map_err(|_| Status::internal("Status update channel closed"))?; | ||
|
Comment on lines
+109
to
119
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bug: The Python worker unconditionally sends Suggested FixModify the Python worker to only send Prompt for AI Agent |
||
|
|
@@ -115,7 +122,7 @@ impl ConsumerService for TaskbrokerServer { | |
| return Ok(Response::new(SetTaskStatusResponse { task: None })); | ||
| } | ||
|
|
||
| match self.store.set_status(&id, status).await { | ||
| match self.store.set_status(&id, status, max_attempts).await { | ||
| Ok(Some(_)) => metrics::counter!( | ||
| "grpc_server.set_status", | ||
| "result" => "ok", | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| use std::sync::Arc; | ||
|
|
||
| use anyhow::Error; | ||
| use rdkafka::Message; | ||
| use rdkafka::message::OwnedMessage; | ||
|
|
||
| use crate::config::Config; | ||
|
|
@@ -12,27 +13,40 @@ use super::deserialize_raw::{self, RawConfig}; | |
| pub struct DeserializeConfig { | ||
| activation_config: DeserializeActivationConfig, | ||
| raw_config: Option<RawConfig>, | ||
| /// Retry topic always contains activations, even in raw_mode. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this comment mean? Why is the raw mode distinction important? I thought raw mode was the only mode in which we used the retry topic?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I understand now. You mean that every message in the retry topic is guaranteed to be an activation even in raw mode, whereas messages in the "normal" topic in raw mode may not be?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Every message in the retry topic is a TaskActivation protobuf regardless of whether the consumer is in raw mode or normal mode. This is because we need to store the retry count in the topic, somehow. |
||
| retry_topic: Option<String>, | ||
| } | ||
|
|
||
| impl DeserializeConfig { | ||
| pub fn from_config(config: &Config) -> Self { | ||
| Self { | ||
| activation_config: DeserializeActivationConfig::from_config(config), | ||
| raw_config: RawConfig::from_config(config), | ||
| retry_topic: config.kafka_retry_topic.clone(), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /// Create a unified deserializer that handles both normal and raw modes. | ||
| /// In raw mode, raw Kafka bytes are wrapped into a TaskActivation. | ||
| /// In normal mode, Kafka messages are expected to contain encoded TaskActivation protos. | ||
| /// Messages from the retry topic are always deserialized as activations. | ||
| pub fn new( | ||
| config: DeserializeConfig, | ||
| ) -> impl Fn(Arc<OwnedMessage>) -> Result<InflightActivation, Error> { | ||
| let raw_deserializer = config.raw_config.map(deserialize_raw::new); | ||
| let activation_deserializer = deserialize_activation::new(config.activation_config); | ||
| let retry_topic = config.retry_topic; | ||
|
|
||
| move |msg: Arc<OwnedMessage>| { | ||
| // Messages from the retry topic are always activations | ||
| if let Some(ref retry_topic) = retry_topic | ||
| && msg.topic() == retry_topic | ||
| { | ||
| return activation_deserializer(msg); | ||
| } | ||
|
|
||
| // For main topic: use raw deserializer in raw_mode, else activation deserializer | ||
|
cursor[bot] marked this conversation as resolved.
|
||
| if let Some(ref raw_deserializer) = raw_deserializer { | ||
| raw_deserializer(msg) | ||
| } else { | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an example of how the task will define this value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing changes about how this value is defined. it's already public API: https://github.com/getsentry/sentry/blob/9ee21b63ae7bcd3bf9a002e077e7fc73b860c656/src/sentry/deletions/tasks/scheduled.py#L102-L103