-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminate tasks with late acknoledgement on connection loss #6654
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6654 +/- ##
==========================================
+ Coverage 70.52% 70.58% +0.06%
==========================================
Files 138 138
Lines 16497 16531 +34
Branches 2066 2074 +8
==========================================
+ Hits 11634 11668 +34
Misses 4663 4663
Partials 200 200
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This pull request introduces 1 alert and fixes 2 when merging acc6c1e into cfa1b41 - view on LGTM.com new alerts:
fixed alerts:
|
@thedrow Can we add a test for this? |
I'm not sure how since you need to cause a disconnect from the broker. |
@thedrow That's what I was thinking. I was trying to figure out a way to mark a connection as disconnected so we can fake this case but I don't know if it's worth it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@matusvalo Actually, I was waiting on you to review this PR throughoutly. The problem is that by the time we restart the connection, the message's channel has its connection set to
Is there an alternative here? |
@xirdneh @matusvalo There's also a request from a client of mine to terminate these tasks without revoking them. |
@thedrow can you show also the celery part of the stack? Kombu is able to recover automatically connection/channel during failed connection - see [1] - it also supports refreshing the channel if needed. Regarding your second question about revoking tasks, In this PR when you terminate the request, the task is revoked? (I don't know deep all details about Celery so maybe stupid question). [1] https://docs.celeryproject.org/projects/kombu/en/master/userguide/failover.html#operation-failover |
My humble opinion: In distributed systems you can guarantee only on of the following two possibilities:
So it means that when user is using late acknowledge, he must be aware (and must design solution in that way) that data can be processed multiple times (they are idempotent). So processing task multiple times is fine in this case and user must be take that in account. |
That's the only traceback we get.
Yes. A client of mine thinks it's the wrong behavior. According to them, we should terminate the task's without revoking. |
Yes we shouldn't revoke as it will prevent the task from running again. What I'm also working on now is to mark the tasks that were already done as successful, if that's possible. |
This pull request introduces 1 alert and fixes 2 when merging 7149523 into 4f2213a - view on LGTM.com new alerts:
fixed alerts:
|
85d8e9c
to
edf6e77
Compare
This pull request introduces 1 alert and fixes 1 when merging cd3bf99 into 4f2213a - view on LGTM.com new alerts:
fixed alerts:
|
cd3bf99
to
8342818
Compare
There already is a concept of abortable tasks so the term is overloaded.
If the worker already managed to report the task is revoked, there's no need to do it again. Without this change, the `task-revoked` event and the `task_revoked` signal are sent twice.
worker_cancel_long_running_tasks_on_connection_loss is False by default since it is possibly a breaking change. In 6.0 it will be True by default.
c4579f1
to
7d5917b
Compare
This pull request introduces 1 alert when merging 7d5917b into 850c62a - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 028f334 into 850c62a - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this lgtm. I'm happy to approve the structure of this an allow @thedrow to address or resolve my minor comments as he sees fit.
Edit: Oops, forgot to do this as a review rather than comments!
) * Terminate tasks with late acknoledgement on connection loss. * Abort task instead of terminating. Instead of terminating the task (which revokes it and prevents its execution in the future), abort the task. * Fix serialization error. * Remove debugging helpers. * Avoid revoking the task if it is aborted. * Rename `abort` to `cancel`. There already is a concept of abortable tasks so the term is overloaded. * The revoke flow is no longer called twice. If the worker already managed to report the task is revoked, there's no need to do it again. Without this change, the `task-revoked` event and the `task_revoked` signal are sent twice. * Unify the flow of announcing a task as cancelled. * Add feature flag. worker_cancel_long_running_tasks_on_connection_loss is False by default since it is possibly a breaking change. In 6.0 it will be True by default. * Add documentation. * Add unit test for the task cancelling behavior. * isort. * Add unit tests for request.cancel(). * isort & autopep8. * Add test coverage for request.on_failure() changes. * Add more test coverage. * Add more test coverage.
Note: Before submitting this pull request, please review our contributing
guidelines.
Description
Tasks with late acknowledgement keep running after restart although the connection is lost and they cannot be acked anymore (to the best of my knowledge).
This results in log errors such as these:
If we cannot recover from this situation, we must terminate all tasks with late acknowledgement.
Any other suggestion, would be useful.