fix(taskbroker): Add at_most_once support #81048

evanh · 2024-11-20T17:01:30Z

If a task is marked as at_most_once, then check the cache to see if the task has already been seen
before. If it has, assume the task has already been executed and continue. Otherwise store the task
ID and execute the task.

Depends on getsentry/sentry-protos#66

codecov · 2024-11-20T17:06:48Z

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/sentry/taskworker/worker.py	94.11%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #81048      +/-   ##
==========================================
- Coverage   80.35%   80.35%   -0.01%     
==========================================
  Files        7221     7221              
  Lines      319599   319544      -55     
  Branches    20783    20768      -15     
==========================================
- Hits       256808   256760      -48     
+ Misses      62396    62391       -5     
+ Partials      395      393       -2

src/sentry/taskworker/worker.py

markstory · 2024-11-20T22:42:29Z

src/sentry/taskworker/worker.py

+                metrics.incr(
+                    "taskworker.worker.at_most_once.skipped", extra={"task": activation.taskname}
+                )
+                return None


We could reply to the broker here that the task is 'complete'. That could help prevent the task from being given to another worker as this worker would appear to be 'dead' due to no response.

I don't think we want to do that, because the task isn't necessarily complete, it might be running elsewhere.

True, the task could be running in another worker. The scenario I was concerned about is:

Worker A picks up an at_most_once task. The task will be status=processing

Worker A dies and can't send an update to taskbroker.

The task will exceed its processing deadline, and be put back into pending.

Worker B takes the task and returns early without any updates. The task will once again exceed its processing_deadline, and we'll burn worker time looping from steps 2 to 4 until the message is deadlettered.

We could have another status/state for 'failed because of idempotency' that could escape the loop. I haven't thought through how else that status should be materially different from failure though.

What if the worker marked the task completed as soon as it added the task to the idempotency cache? There's no way for the broker to safely assign the task again after it has been assigned once. And if the task is failing before/during setting the ID, we know that know actual work has been done so another worker could still execute it.

The issue here is that if the worker fails without updating the status to failure, the task won't be deadlettered because the broker thinks it was completed.

An alternative to this is to propagate the at_most_once state into the protobuf. Then the broker will know to never retry an at most once task, and it can send it immediately to deadletter queue on a processing deadline.

Including at_most_once in the protobuf would let us solve processing_deadlines more efficiently as the broker could skip doing deadline retries if the worker never responds, and regular retries could continue to work correctly, despite retries + at_most_once being nonsense.

Given that, I think this PR is OK to be merged as is. If a worker happens to get assigned a task that is in the cache, it should skip it.

There will be a followup PR in the taskbroker to do some checks to avoid that scenario.

markstory

Looks good other than the package changes. I'll add a task to our backlog for the potential failure mode on processing_deadline loops.

requirements-dev-frozen.txt

If a task is marked as at_most_once, then check the cache to see if the task has already been seen before. If it has, assume the task has already been executed and continue. Otherwise store the task ID and execute the task. Depends on getsentry/sentry-protos#66 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

we suspect this may have increased error rates in CI -- it appears to have been accidentally upgrade in #81048

evanh requested a review from a team November 20, 2024 17:01

evanh requested a review from a team as a code owner November 20, 2024 17:01

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Nov 20, 2024

vercel bot deployed to Preview November 20, 2024 17:04 View deployment

markstory reviewed Nov 20, 2024

View reviewed changes

vercel bot deployed to Preview November 21, 2024 20:15 View deployment

markstory approved these changes Nov 22, 2024

View reviewed changes

requirements-dev-frozen.txt Show resolved Hide resolved

evanh requested review from a team as code owners November 25, 2024 17:36

evanh requested a review from a team November 25, 2024 17:36

evanh requested review from a team as code owners November 25, 2024 17:36

evanh requested review from a team November 25, 2024 17:36

evanh requested a review from a team as a code owner November 25, 2024 17:36

evanh requested a review from a team November 25, 2024 17:36

evanh requested review from a team as code owners November 25, 2024 17:36

evanh requested review from a team and removed request for a team November 25, 2024 17:38

vercel bot deployed to Preview November 25, 2024 17:45 View deployment

fix typing

bcdbf9f

vercel bot deployed to Preview November 25, 2024 20:02 View deployment

evanh enabled auto-merge (squash) November 25, 2024 20:06

update to propagate at_most_once

628c362

evanh disabled auto-merge November 25, 2024 20:20

vercel bot deployed to Preview November 25, 2024 20:23 View deployment

markstory approved these changes Nov 25, 2024

View reviewed changes

Merge branch 'master' into evanh/fix/add-at-most-once-taskbroker

ae0df74

vercel bot deployed to Preview November 25, 2024 21:58 View deployment

evanh and others added 2 commits November 26, 2024 09:36

Merge branch 'master' into evanh/fix/add-at-most-once-taskbroker

07772fa

❄️ re-freeze requirements

16e8e84

vercel bot deployed to Preview November 26, 2024 14:43 View deployment

markstory approved these changes Nov 26, 2024

View reviewed changes

evanh merged commit 4618b45 into master Nov 26, 2024
52 checks passed

evanh deleted the evanh/fix/add-at-most-once-taskbroker branch November 26, 2024 20:52

asottile-sentry mentioned this pull request Dec 4, 2024

Reapply Revert "ref: upgrade pytest (#81135)" #81643

Merged

asottile-sentry added a commit that referenced this pull request Dec 4, 2024

Reapply Revert "ref: upgrade pytest (#81135)" (#81643)

454c484

we suspect this may have increased error rates in CI -- it appears to have been accidentally upgrade in #81048

github-actions bot locked and limited conversation to collaborators Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(taskbroker): Add at_most_once support #81048

fix(taskbroker): Add at_most_once support #81048

Uh oh!

evanh commented Nov 20, 2024

Uh oh!

codecov bot commented Nov 20, 2024 •

edited

Loading

Uh oh!

Uh oh!

markstory Nov 20, 2024

Uh oh!

evanh Nov 21, 2024

Uh oh!

markstory Nov 22, 2024

Uh oh!

evanh Nov 25, 2024 •

edited

Loading

Uh oh!

evanh Nov 25, 2024 •

edited

Loading

Uh oh!

markstory Nov 25, 2024

Uh oh!

evanh Nov 25, 2024

Uh oh!

markstory left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

fix(taskbroker): Add at_most_once support #81048

fix(taskbroker): Add at_most_once support #81048

Uh oh!

Conversation

evanh commented Nov 20, 2024

Uh oh!

codecov bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

markstory Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

evanh Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

markstory Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

evanh Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evanh Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markstory Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

evanh Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

markstory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Nov 20, 2024 •

edited

Loading

evanh Nov 25, 2024 •

edited

Loading

evanh Nov 25, 2024 •

edited

Loading