Terminate tasks with late acknoledgement on connection loss #6654

thedrow · 2021-03-02T16:36:36Z

Note: Before submitting this pull request, please review our contributing
guidelines.

Description

Tasks with late acknowledgement keep running after restart although the connection is lost and they cannot be acked anymore (to the best of my knowledge).

This results in log errors such as these:

[2021-03-02 18:29:58,355: CRITICAL/MainProcess] Couldn't ack 5, reason:RecoverableConnectionError(None, 'connection already closed', None, '')
Traceback (most recent call last):
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/kombu/message.py", line 128, in ack_log_error
    self.ack(multiple=multiple)
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/kombu/message.py", line 123, in ack
    self.channel.basic_ack(self.delivery_tag, multiple=multiple)
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/amqp/channel.py", line 1391, in basic_ack
    return self.send_method(
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/amqp/abstract_channel.py", line 54, in send_method
    raise RecoverableConnectionError('connection already closed')
amqp.exceptions.RecoverableConnectionError: connection already closed
[2021-03-02 18:29:58,356: CRITICAL/MainProcess] Couldn't ack 12, reason:RecoverableConnectionError(None, 'connection already closed', None, '')

If we cannot recover from this situation, we must terminate all tasks with late acknowledgement.
Any other suggestion, would be useful.

codecov · 2021-03-02T16:39:16Z

Codecov Report

Merging #6654 (028f334) into master (850c62a) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6654      +/-   ##
==========================================
+ Coverage   70.52%   70.58%   +0.06%     
==========================================
  Files         138      138              
  Lines       16497    16531      +34     
  Branches     2066     2074       +8     
==========================================
+ Hits        11634    11668      +34     
  Misses       4663     4663              
  Partials      200      200

Flag	Coverage Δ
unittests	`70.58% <100.00%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
celery/app/defaults.py	`97.33% <ø> (ø)`
celery/bin/amqp.py	`0.00% <ø> (ø)`
celery/worker/consumer/consumer.py	`93.57% <100.00%> (+0.18%)`	⬆️
celery/worker/request.py	`96.91% <100.00%> (+0.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 850c62a...028f334. Read the comment docs.

lgtm-com · 2021-03-02T16:56:52Z

This pull request introduces 1 alert and fixes 2 when merging acc6c1e into cfa1b41 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Non-exception in 'except' clause
1 for Module is imported with 'import' and 'import from'

xirdneh · 2021-03-11T22:07:29Z

@thedrow Can we add a test for this?

thedrow · 2021-03-16T10:14:54Z

I'm not sure how since you need to cause a disconnect from the broker.
There are other problems with this PR. I'll update you soon.

xirdneh · 2021-03-16T16:31:54Z

@thedrow That's what I was thinking. I was trying to figure out a way to mark a connection as disconnected so we can fake this case but I don't know if it's worth it.

matusvalo

LGTM

thedrow · 2021-03-23T11:18:11Z

@matusvalo Actually, I was waiting on you to review this PR throughoutly.
Is there a way to revive the channel?

The problem is that by the time we restart the connection, the message's channel has its connection set to None.

Note: Before submitting this pull request, please review our contributing guidelines.

Description

Tasks with late acknowledgement keep running after restart although the connection is lost and they cannot be acked anymore (to the best of my knowledge).

[2021-03-02 18:29:58,355: CRITICAL/MainProcess] Couldn't ack 5, reason:RecoverableConnectionError(None, 'connection already closed', None, '')
Traceback (most recent call last):
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/kombu/message.py", line 128, in ack_log_error
    self.ack(multiple=multiple)
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/kombu/message.py", line 123, in ack
    self.channel.basic_ack(self.delivery_tag, multiple=multiple)
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/amqp/channel.py", line 1391, in basic_ack
    return self.send_method(
  File "/home/thedrow/.virtualenvs/celery/lib/python3.9/site-packages/amqp/abstract_channel.py", line 54, in send_method
    raise RecoverableConnectionError('connection already closed')
amqp.exceptions.RecoverableConnectionError: connection already closed
[2021-03-02 18:29:58,356: CRITICAL/MainProcess] Couldn't ack 12, reason:RecoverableConnectionError(None, 'connection already closed', None, '')

Is there an alternative here?
I'd rather ack the message since the task is done.

thedrow · 2021-03-23T11:19:24Z

@xirdneh @matusvalo There's also a request from a client of mine to terminate these tasks without revoking them.
Maybe we should introduce a setting for it? Is it always better to revoke these tasks?

matusvalo · 2021-03-26T22:42:44Z

@thedrow can you show also the celery part of the stack? Kombu is able to recover automatically connection/channel during failed connection - see [1] - it also supports refreshing the channel if needed.

Regarding your second question about revoking tasks, In this PR when you terminate the request, the task is revoked? (I don't know deep all details about Celery so maybe stupid question).

[1] https://docs.celeryproject.org/projects/kombu/en/master/userguide/failover.html#operation-failover

matusvalo · 2021-03-26T22:47:22Z

My humble opinion: In distributed systems you can guarantee only on of the following two possibilities:

you can guarantee that message is processed at most once (this is early acknowledge)
you can guarantee that message is processed at least once (this is late acknowledge)

So it means that when user is using late acknowledge, he must be aware (and must design solution in that way) that data can be processed multiple times (they are idempotent). So processing task multiple times is fine in this case and user must be take that in account.

thedrow · 2021-04-01T06:17:06Z

@thedrow can you show also the celery part of the stack? Kombu is able to recover automatically connection/channel during failed connection - see [1] - it also supports refreshing the channel if needed.

That's the only traceback we get.
The message object has a channel with its connection set to None.

Regarding your second question about revoking tasks, In this PR when you terminate the request, the task is revoked? (I don't know deep all details about Celery so maybe stupid question).

[1] https://docs.celeryproject.org/projects/kombu/en/master/userguide/failover.html#operation-failover

Yes. A client of mine thinks it's the wrong behavior. According to them, we should terminate the task's without revoking.

thedrow · 2021-04-08T10:55:03Z

Yes we shouldn't revoke as it will prevent the task from running again.
I feel like we need to introduce a new event.
I also hope that marking the task as retry is good enough and it won't have any side effects. I'm testing this now.

What I'm also working on now is to mark the tasks that were already done as successful, if that's possible.
If we can do that and check if the task was done, we can ack it when we receive it again.
@celery/core-developers Does that sound feasible?

lgtm-com · 2021-04-08T11:07:18Z

This pull request introduces 1 alert and fixes 2 when merging 7149523 into 4f2213a - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Non-exception in 'except' clause
1 for Module is imported with 'import' and 'import from'

lgtm-com · 2021-04-08T12:59:36Z

This pull request introduces 1 alert and fixes 1 when merging cd3bf99 into 4f2213a - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Non-exception in 'except' clause

There already is a concept of abortable tasks so the term is overloaded.

If the worker already managed to report the task is revoked, there's no need to do it again. Without this change, the `task-revoked` event and the `task_revoked` signal are sent twice.

worker_cancel_long_running_tasks_on_connection_loss is False by default since it is possibly a breaking change. In 6.0 it will be True by default.

lgtm-com · 2021-04-26T13:08:23Z

This pull request introduces 1 alert when merging 7d5917b into 850c62a - view on LGTM.com

new alerts:

1 for Module is imported with 'import' and 'import from'

lgtm-com · 2021-04-26T13:43:50Z

This pull request introduces 1 alert when merging 028f334 into 850c62a - view on LGTM.com

new alerts:

1 for Unused import

celery/app/defaults.py

celery/worker/consumer/consumer.py

t/unit/worker/test_request.py

maybe-sybr

Overall this lgtm. I'm happy to approve the structure of this an allow @thedrow to address or resolve my minor comments as he sees fit.

Edit: Oops, forgot to do this as a review rather than comments!

) * Terminate tasks with late acknoledgement on connection loss. * Abort task instead of terminating. Instead of terminating the task (which revokes it and prevents its execution in the future), abort the task. * Fix serialization error. * Remove debugging helpers. * Avoid revoking the task if it is aborted. * Rename `abort` to `cancel`. There already is a concept of abortable tasks so the term is overloaded. * The revoke flow is no longer called twice. If the worker already managed to report the task is revoked, there's no need to do it again. Without this change, the `task-revoked` event and the `task_revoked` signal are sent twice. * Unify the flow of announcing a task as cancelled. * Add feature flag. worker_cancel_long_running_tasks_on_connection_loss is False by default since it is possibly a breaking change. In 6.0 it will be True by default. * Add documentation. * Add unit test for the task cancelling behavior. * isort. * Add unit tests for request.cancel(). * isort & autopep8. * Add test coverage for request.on_failure() changes. * Add more test coverage. * Add more test coverage.

thedrow added PR Type: Bugfix Component: Task Execution Component: Consumer labels Mar 2, 2021

thedrow added this to the 5.1.0 milestone Mar 2, 2021

thedrow requested a review from a team March 2, 2021 16:36

thedrow self-assigned this Mar 2, 2021

xirdneh self-requested a review March 11, 2021 22:06

matusvalo previously approved these changes Mar 22, 2021

View reviewed changes

thedrow dismissed matusvalo’s stale review via 7149523 April 8, 2021 10:42

thedrow force-pushed the terminate-ack-late-tasks-on-connection-loss branch from 85d8e9c to edf6e77 Compare April 8, 2021 12:35

ahopkins previously approved these changes Apr 9, 2021

View reviewed changes

thedrow dismissed ahopkins’s stale review via 8342818 April 18, 2021 12:23

thedrow force-pushed the terminate-ack-late-tasks-on-connection-loss branch from cd3bf99 to 8342818 Compare April 18, 2021 12:23

thedrow added 10 commits April 26, 2021 15:40

Avoid revoking the task if it is aborted.

39f4ba6

Rename abort to cancel.

d9a1ba8

There already is a concept of abortable tasks so the term is overloaded.

The revoke flow is no longer called twice.

fd9f5d0

If the worker already managed to report the task is revoked, there's no need to do it again. Without this change, the `task-revoked` event and the `task_revoked` signal are sent twice.

Unify the flow of announcing a task as cancelled.

a6ba54c

Add feature flag.

d099941

worker_cancel_long_running_tasks_on_connection_loss is False by default since it is possibly a breaking change. In 6.0 it will be True by default.

Add documentation.

cd2c16e

Add unit test for the task cancelling behavior.

ebac8f4

isort.

8637368

Add unit tests for request.cancel().

b11fedf

isort & autopep8.

7d5917b

thedrow force-pushed the terminate-ack-late-tasks-on-connection-loss branch from c4579f1 to 7d5917b Compare April 26, 2021 12:44

thedrow marked this pull request as ready for review April 26, 2021 13:02

thedrow requested a review from a team April 26, 2021 13:02

thedrow added Status: Has Test Coverage ✔ and removed Status: Needs Test Coverage ✘ labels Apr 26, 2021

thedrow added 3 commits April 26, 2021 16:09

Add test coverage for request.on_failure() changes.

53bbbe3

Add more test coverage.

477cf47

Add more test coverage.

028f334

auvipy reviewed Apr 26, 2021

View reviewed changes

celery/app/defaults.py Show resolved Hide resolved

maybe-sybr reviewed Apr 26, 2021

View reviewed changes

celery/worker/consumer/consumer.py Show resolved Hide resolved

maybe-sybr reviewed Apr 26, 2021

View reviewed changes

t/unit/worker/test_request.py Show resolved Hide resolved

maybe-sybr approved these changes Apr 26, 2021

View reviewed changes

thedrow merged commit 934a227 into master Apr 28, 2021

thedrow deleted the terminate-ack-late-tasks-on-connection-loss branch April 28, 2021 09:53

thedrow mentioned this pull request Jun 1, 2021

Revoking a task causes state to be RETRY #6793

Closed

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate tasks with late acknoledgement on connection loss #6654

Terminate tasks with late acknoledgement on connection loss #6654

thedrow commented Mar 2, 2021

codecov bot commented Mar 2, 2021 •

edited

Loading

lgtm-com bot commented Mar 2, 2021

xirdneh commented Mar 11, 2021

thedrow commented Mar 16, 2021

xirdneh commented Mar 16, 2021

matusvalo left a comment

thedrow commented Mar 23, 2021

Description

thedrow commented Mar 23, 2021

matusvalo commented Mar 26, 2021

matusvalo commented Mar 26, 2021 •

edited

Loading

thedrow commented Apr 1, 2021

thedrow commented Apr 8, 2021

lgtm-com bot commented Apr 8, 2021

lgtm-com bot commented Apr 8, 2021

lgtm-com bot commented Apr 26, 2021

lgtm-com bot commented Apr 26, 2021

maybe-sybr left a comment

Terminate tasks with late acknoledgement on connection loss #6654

Terminate tasks with late acknoledgement on connection loss #6654

Conversation

thedrow commented Mar 2, 2021

Description

codecov bot commented Mar 2, 2021 • edited Loading

Codecov Report

lgtm-com bot commented Mar 2, 2021

xirdneh commented Mar 11, 2021

thedrow commented Mar 16, 2021

xirdneh commented Mar 16, 2021

matusvalo left a comment

Choose a reason for hiding this comment

thedrow commented Mar 23, 2021

Description

thedrow commented Mar 23, 2021

matusvalo commented Mar 26, 2021

matusvalo commented Mar 26, 2021 • edited Loading

thedrow commented Apr 1, 2021

thedrow commented Apr 8, 2021

lgtm-com bot commented Apr 8, 2021

lgtm-com bot commented Apr 8, 2021

lgtm-com bot commented Apr 26, 2021

lgtm-com bot commented Apr 26, 2021

maybe-sybr left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 2, 2021 •

edited

Loading

matusvalo commented Mar 26, 2021 •

edited

Loading