-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(taskbroker): Add at_most_once support #81048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## master #81048 +/- ##
==========================================
- Coverage 80.35% 80.35% -0.01%
==========================================
Files 7221 7221
Lines 319599 319544 -55
Branches 20783 20768 -15
==========================================
- Hits 256808 256760 -48
+ Misses 62396 62391 -5
+ Partials 395 393 -2 |
| metrics.incr( | ||
| "taskworker.worker.at_most_once.skipped", extra={"task": activation.taskname} | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could reply to the broker here that the task is 'complete'. That could help prevent the task from being given to another worker as this worker would appear to be 'dead' due to no response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do that, because the task isn't necessarily complete, it might be running elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the task could be running in another worker. The scenario I was concerned about is:
- Worker A picks up an at_most_once task. The task will be status=processing
- Worker A dies and can't send an update to taskbroker.
- The task will exceed its processing deadline, and be put back into pending.
- Worker B takes the task and returns early without any updates. The task will once again exceed its processing_deadline, and we'll burn worker time looping from steps 2 to 4 until the message is deadlettered.
We could have another status/state for 'failed because of idempotency' that could escape the loop. I haven't thought through how else that status should be materially different from failure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the worker marked the task completed as soon as it added the task to the idempotency cache? There's no way for the broker to safely assign the task again after it has been assigned once. And if the task is failing before/during setting the ID, we know that know actual work has been done so another worker could still execute it.
The issue here is that if the worker fails without updating the status to failure, the task won't be deadlettered because the broker thinks it was completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to this is to propagate the at_most_once state into the protobuf. Then the broker will know to never retry an at most once task, and it can send it immediately to deadletter queue on a processing deadline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including at_most_once in the protobuf would let us solve processing_deadlines more efficiently as the broker could skip doing deadline retries if the worker never responds, and regular retries could continue to work correctly, despite retries + at_most_once being nonsense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that, I think this PR is OK to be merged as is. If a worker happens to get assigned a task that is in the cache, it should skip it.
There will be a followup PR in the taskbroker to do some checks to avoid that scenario.
markstory
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other than the package changes. I'll add a task to our backlog for the potential failure mode on processing_deadline loops.
If a task is marked as at_most_once, then check the cache to see if the task has already been seen before. If it has, assume the task has already been executed and continue. Otherwise store the task ID and execute the task. Depends on getsentry/sentry-protos#66 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
If a task is marked as at_most_once, then check the cache to see if the task has already been seen
before. If it has, assume the task has already been executed and continue. Otherwise store the task
ID and execute the task.
Depends on getsentry/sentry-protos#66