Skip to content

Conversation

@evanh
Copy link
Member

@evanh evanh commented Nov 20, 2024

If a task is marked as at_most_once, then check the cache to see if the task has already been seen
before. If it has, assume the task has already been executed and continue. Otherwise store the task
ID and execute the task.

Depends on getsentry/sentry-protos#66

@evanh evanh requested a review from a team November 20, 2024 17:01
@evanh evanh requested a review from a team as a code owner November 20, 2024 17:01
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Nov 20, 2024
@codecov
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/taskworker/worker.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #81048      +/-   ##
==========================================
- Coverage   80.35%   80.35%   -0.01%     
==========================================
  Files        7221     7221              
  Lines      319599   319544      -55     
  Branches    20783    20768      -15     
==========================================
- Hits       256808   256760      -48     
+ Misses      62396    62391       -5     
+ Partials      395      393       -2     

metrics.incr(
"taskworker.worker.at_most_once.skipped", extra={"task": activation.taskname}
)
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reply to the broker here that the task is 'complete'. That could help prevent the task from being given to another worker as this worker would appear to be 'dead' due to no response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do that, because the task isn't necessarily complete, it might be running elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the task could be running in another worker. The scenario I was concerned about is:

  1. Worker A picks up an at_most_once task. The task will be status=processing
  2. Worker A dies and can't send an update to taskbroker.
  3. The task will exceed its processing deadline, and be put back into pending.
  4. Worker B takes the task and returns early without any updates. The task will once again exceed its processing_deadline, and we'll burn worker time looping from steps 2 to 4 until the message is deadlettered.

We could have another status/state for 'failed because of idempotency' that could escape the loop. I haven't thought through how else that status should be materially different from failure though.

Copy link
Member Author

@evanh evanh Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the worker marked the task completed as soon as it added the task to the idempotency cache? There's no way for the broker to safely assign the task again after it has been assigned once. And if the task is failing before/during setting the ID, we know that know actual work has been done so another worker could still execute it.

The issue here is that if the worker fails without updating the status to failure, the task won't be deadlettered because the broker thinks it was completed.

Copy link
Member Author

@evanh evanh Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to this is to propagate the at_most_once state into the protobuf. Then the broker will know to never retry an at most once task, and it can send it immediately to deadletter queue on a processing deadline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including at_most_once in the protobuf would let us solve processing_deadlines more efficiently as the broker could skip doing deadline retries if the worker never responds, and regular retries could continue to work correctly, despite retries + at_most_once being nonsense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that, I think this PR is OK to be merged as is. If a worker happens to get assigned a task that is in the cache, it should skip it.

There will be a followup PR in the taskbroker to do some checks to avoid that scenario.

Copy link
Member

@markstory markstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good other than the package changes. I'll add a task to our backlog for the potential failure mode on processing_deadline loops.

@evanh evanh requested review from a team as code owners November 25, 2024 17:36
@evanh evanh requested a review from a team November 25, 2024 17:36
@evanh evanh requested review from a team as code owners November 25, 2024 17:36
@evanh evanh requested review from a team November 25, 2024 17:36
@evanh evanh requested a review from a team as a code owner November 25, 2024 17:36
@evanh evanh requested a review from a team November 25, 2024 17:36
@evanh evanh requested review from a team as code owners November 25, 2024 17:36
@evanh evanh requested review from a team and removed request for a team November 25, 2024 17:38
@evanh evanh merged commit 4618b45 into master Nov 26, 2024
52 checks passed
@evanh evanh deleted the evanh/fix/add-at-most-once-taskbroker branch November 26, 2024 20:52
andrewshie-sentry pushed a commit that referenced this pull request Dec 2, 2024
If a task is marked as at_most_once, then check the cache to see if the
task has already been seen
before. If it has, assume the task has already been executed and
continue. Otherwise store the task
ID and execute the task.

Depends on getsentry/sentry-protos#66

---------

Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
asottile-sentry added a commit that referenced this pull request Dec 4, 2024
we suspect this may have increased error rates in CI -- it appears to
have been accidentally upgrade in #81048

<!-- Describe your PR here. -->
@github-actions github-actions bot locked and limited conversation to collaborators Dec 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Scope: Backend Automatically applied to PRs that change backend components Scope: Frontend Automatically applied to PRs that change frontend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants