Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Successful tasks are not acked if acks_late=True on warm shutdown. #3802

Closed
2 tasks done
sww opened this issue Jan 31, 2017 · 22 comments
Closed
2 tasks done

Successful tasks are not acked if acks_late=True on warm shutdown. #3802

sww opened this issue Jan 31, 2017 · 22 comments

Comments

@sww
Copy link
Contributor

sww commented Jan 31, 2017

Checklist

  • I have included the output of celery -A proj report in the issue.
    (if you are not able to do this, then at least specify the Celery
    version affected).
$ celery -A app report

software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:3.4.3
            billiard:3.5.0.2 redis:2.10.5
platform -> system:Darwin arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

broker_url: 'redis://127.0.0.1:6379/3'
  • I have verified that the issue exists against the master branch of Celery.

Steps to reproduce

Tasks that finish and succeed after the warm shutdown signal are not acked if acks_late=True. To reproduce:

  1. Create app.py:
from time import sleep

from celery import Celery

app = Celery('foo', broker='redis://127.0.0.1:6379/3')

@app.task(acks_late=True)
def foo():
    sleep(5)
    return 'foo'
  1. Queue up a task:
$ python -c "from app import foo; foo.apply_async()"
  1. Start the worker and issue a warm shutdown (CTRL+C) before the task finishes.
$ celery -A app worker --loglevel=debug
[2017-01-30 16:23:46,092: DEBUG/MainProcess] | Worker: Preparing bootsteps.
[2017-01-30 16:23:46,094: DEBUG/MainProcess] | Worker: Building graph...
[2017-01-30 16:23:46,094: DEBUG/MainProcess] | Worker: New boot order: {Timer, Hub, Pool, Autoscaler, StateDB, Beat, Consumer}
[2017-01-30 16:23:46,108: DEBUG/MainProcess] | Consumer: Preparing bootsteps.
[2017-01-30 16:23:46,109: DEBUG/MainProcess] | Consumer: Building graph...
[2017-01-30 16:23:46,124: DEBUG/MainProcess] | Consumer: New boot order: {Connection, Events, Mingle, Gossip, Agent, Tasks, Control, Heart, event loop}



celery@foobar v4.0.2 (latentcall)

Darwin-16.3.0-x86_64-i386-64bit 2017-01-30 16:23:46

[config]
.> app:         foo:0x1034ddc88
.> transport:   redis://127.0.0.1:6379/3
.> results:     disabled://
.> concurrency: 4 (prefork)
.> task events: OFF (enable -E to monitor tasks in this worker)

[queues]
.> celery           exchange=celery(direct) key=celery


[tasks]
  . app.foo
  . celery.accumulate
  . celery.backend_cleanup
  . celery.chain
  . celery.chord
  . celery.chord_unlock
  . celery.chunks
  . celery.group
  . celery.map
  . celery.starmap

[2017-01-30 16:23:46,139: DEBUG/MainProcess] | Worker: Starting Hub
[2017-01-30 16:23:46,139: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:46,140: DEBUG/MainProcess] | Worker: Starting Pool
[2017-01-30 16:23:46,248: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:46,249: DEBUG/MainProcess] | Worker: Starting Consumer
[2017-01-30 16:23:46,250: DEBUG/MainProcess] | Consumer: Starting Connection
[2017-01-30 16:23:46,268: INFO/MainProcess] Connected to redis://127.0.0.1:6379/3
[2017-01-30 16:23:46,268: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:46,269: DEBUG/MainProcess] | Consumer: Starting Events
[2017-01-30 16:23:46,278: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:46,278: DEBUG/MainProcess] | Consumer: Starting Mingle
[2017-01-30 16:23:46,278: INFO/MainProcess] mingle: searching for neighbors
[2017-01-30 16:23:47,309: INFO/MainProcess] mingle: all alone
[2017-01-30 16:23:47,309: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:47,309: DEBUG/MainProcess] | Consumer: Starting Gossip
[2017-01-30 16:23:47,316: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:47,316: DEBUG/MainProcess] | Consumer: Starting Tasks
[2017-01-30 16:23:47,320: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:47,320: DEBUG/MainProcess] | Consumer: Starting Control
[2017-01-30 16:23:47,325: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:47,325: DEBUG/MainProcess] | Consumer: Starting Heart
[2017-01-30 16:23:47,329: DEBUG/MainProcess] ^-- substep ok
[2017-01-30 16:23:47,329: DEBUG/MainProcess] | Consumer: Starting event loop
[2017-01-30 16:23:47,329: INFO/MainProcess] celery@foobar ready.
[2017-01-30 16:23:47,330: DEBUG/MainProcess] | Worker: Hub.register Pool...
[2017-01-30 16:23:47,330: DEBUG/MainProcess] basic.qos: prefetch_count->16
[2017-01-30 16:23:48,093: INFO/MainProcess] Received task: app.foo[a1d83951-0e24-488b-b400-0ddbd4161254]
[2017-01-30 16:23:48,093: DEBUG/MainProcess] TaskPool: Apply <function _fast_trace_task at 0x10357c598> (args:('app.foo', 'a1d83951-0e24-488b-b400-0ddbd4161254', {'kwargsrepr': '{}', 'id': 'a1d83951-0e24-488b-b400-0ddbd4161254', 'task': 'app.foo', 'expires': None, 'reply_to': '6de20f21-2460-3189-b534-29ae63c15d1c', 'lang': 'py', 'group': None, 'argsrepr': '()', 'eta': None, 'retries': 0, 'delivery_info': {'exchange': '', 'redelivered': None, 'routing_key': 'celery', 'priority': 0}, 'root_id': 'a1d83951-0e24-488b-b400-0ddbd4161254', 'correlation_id': 'a1d83951-0e24-488b-b400-0ddbd4161254', 'origin': 'gen52076@foobar', 'timelimit': [None, None], 'parent_id': None}, b'[[], {}, {"chord": null, "callbacks": null, "chain": null, "errbacks": null}]', 'application/json', 'utf-8') kwargs:{})
[2017-01-30 16:23:48,095: DEBUG/MainProcess] Task accepted: app.foo[a1d83951-0e24-488b-b400-0ddbd4161254] pid:52112
^C
worker: Hitting Ctrl+C again will terminate all running tasks!

worker: Warm shutdown (MainProcess)
[2017-01-30 16:23:49,613: DEBUG/MainProcess] | Worker: Closing Hub...
[2017-01-30 16:23:49,613: DEBUG/MainProcess] | Worker: Closing Pool...
[2017-01-30 16:23:49,614: DEBUG/MainProcess] | Worker: Closing Consumer...
[2017-01-30 16:23:49,614: DEBUG/MainProcess] | Worker: Stopping Consumer...
[2017-01-30 16:23:49,614: DEBUG/MainProcess] | Consumer: Closing Connection...
[2017-01-30 16:23:49,614: DEBUG/MainProcess] | Consumer: Closing Events...
[2017-01-30 16:23:49,614: DEBUG/MainProcess] | Consumer: Closing Mingle...
[2017-01-30 16:23:49,614: DEBUG/MainProcess] | Consumer: Closing Gossip...
[2017-01-30 16:23:49,614: DEBUG/MainProcess] | Consumer: Closing Tasks...
[2017-01-30 16:23:49,615: DEBUG/MainProcess] | Consumer: Closing Control...
[2017-01-30 16:23:49,615: DEBUG/MainProcess] | Consumer: Closing Heart...
[2017-01-30 16:23:49,615: DEBUG/MainProcess] | Consumer: Closing event loop...
[2017-01-30 16:23:49,615: DEBUG/MainProcess] | Consumer: Stopping event loop...
[2017-01-30 16:23:49,615: DEBUG/MainProcess] | Consumer: Stopping Heart...
[2017-01-30 16:23:49,617: DEBUG/MainProcess] | Consumer: Stopping Control...
[2017-01-30 16:23:49,620: DEBUG/MainProcess] | Consumer: Stopping Tasks...
[2017-01-30 16:23:49,620: DEBUG/MainProcess] Canceling task consumer...
[2017-01-30 16:23:49,621: DEBUG/MainProcess] | Consumer: Stopping Gossip...
[2017-01-30 16:23:49,624: DEBUG/MainProcess] | Consumer: Stopping Mingle...
[2017-01-30 16:23:49,624: DEBUG/MainProcess] | Consumer: Stopping Events...
[2017-01-30 16:23:49,624: DEBUG/MainProcess] | Consumer: Stopping Connection...
[2017-01-30 16:23:49,625: DEBUG/MainProcess] | Worker: Stopping Pool...
[2017-01-30 16:23:53,099: INFO/PoolWorker-2] Task app.foo[a1d83951-0e24-488b-b400-0ddbd4161254] succeeded in 5.002987815998495s: 'foo'
[2017-01-30 16:23:54,103: DEBUG/MainProcess] | Worker: Stopping Hub...
[2017-01-30 16:23:54,103: DEBUG/MainProcess] | Consumer: Shutdown Heart...
[2017-01-30 16:23:54,104: DEBUG/MainProcess] | Consumer: Shutdown Control...
[2017-01-30 16:23:54,104: DEBUG/MainProcess] | Consumer: Shutdown Tasks...
[2017-01-30 16:23:54,104: DEBUG/MainProcess] Canceling task consumer...
[2017-01-30 16:23:54,104: DEBUG/MainProcess] Closing consumer channel...
[2017-01-30 16:23:54,104: DEBUG/MainProcess] | Consumer: Shutdown Gossip...
[2017-01-30 16:23:54,104: DEBUG/MainProcess] | Consumer: Shutdown Events...
[2017-01-30 16:23:54,105: DEBUG/MainProcess] | Consumer: Shutdown Connection...
[2017-01-30 16:23:54,106: WARNING/MainProcess] Restoring 1 unacknowledged message(s)
[2017-01-30 16:23:54,111: DEBUG/MainProcess] removing tasks from inqueue until task handler finished

(Note the second to last line even after the log says the task succeded.)

Expected behavior

The task is acked before exit.

Actual behavior

The task is not acked and restored. The next time the worker starts up, the same task will be executed again.

@markine
Copy link

markine commented Feb 8, 2017

I wonder if this is related to #3796

@calvinwyoung
Copy link

We're running into this issue as well. All long-running tasks that we kill gracefully mid-execution will get requeued, even if they complete successfully.

Has anyone made progress on this?

@dejlek
Copy link
Contributor

dejlek commented Mar 27, 2017

I can confirm this is happening with Celery 4.0.2 (Redis backend)

@daviddavis
Copy link
Contributor

FWIW, I just reproduced this problem with rabbitmq and no redis backend.

@cristianocca
Copy link

I'm using RabbitMQ and I notice as well that warm shut downs will requeue tasks with acks_late=True. However, it seems to me that warm shut down is actually killing the task as well, not really letting it finish, which is probably the reason it is queued back.

My question is, shouldn't warm shutdown ALWAYS allow the task to finish, even with acks_late=True and also ack the task once it completes, rather than killing it mid progress? I couldn't find more info about warm shutdowns and acks_late=True behavior anywhere, can anyone shine some light here?

@cristianocca
Copy link

After testing more, doing a warm shutdown without killing the actual workers, I get the task succeed log and yet it's never acked, and it is queued again once celery is started. This is quite bad and so far the only work around for some "critical" tasks that must always run and yet avoid duplicates as much as possible, is to store w/e task id it was queued with, and clear the task id once the job is done, and then always check for a non empty task id when the task starts.

Are there any other work arounds for this? Is this something that might be fixed soon?

@alexmic
Copy link

alexmic commented Dec 3, 2017

Similar issue here, although our configuration is to ack early (the default). What seems to have happened is that a task was pre-fetched by the worker, the worker received a TERM, the task started executing just after and successfully finished, the worker restarted and when it came back it received the same task again.

Here's a trace of our logs to illustrate:

16:37:47,299 celery.worker.strategy INFO [-]: Received task: ***[c341fea7-12a0-457c-bfab-65290ceeaab3] <- Task prefetched
1/Dec/2017:16:38:13 *** info worker: Warm shutdown (MainProcess) <- TERM 
16:38:14,347 plum.common.clients.http.base INFO [c341fea7-12a0-457c-bfab-65290ceeaab3]: HTTP client request: ... <- Task execution starts
16:38:14,733 celery.app.trace INFO [c341fea7-12a0-457c-bfab-65290ceeaab3]: Task ***[c341fea7-12a0-457c-bfab-65290ceeaab3] succeeded in 0.571790334997786s: None <- Task finished successfully 
1/Dec/2017:16:38:27 *** info  -------------- celery@878351b1-1a45-429d-b912-58f72dd1058b v4.1.0 (latentcall) <- worker restarted
16:38:27,610 celery.worker.strategy INFO [-]: Received task: task.accounts.transfer_deposit_from_receivables[c341fea7-12a0-457c-bfab-65290ceeaab3]  <- same task received again (note task id)

@auvipy
Copy link
Member

auvipy commented Dec 21, 2017

if anyone can verify the issue on latest master branch using the latest dependencies?

@auvipy auvipy added this to the v5.0.0 milestone Jan 15, 2018
@drien
Copy link

drien commented Feb 20, 2018

I'm able to reproduce with latest master (f3ef0df) and rabbitmq.

Here's the output, I ran the same kind of sleep() task as the OP with a couple of extra log messages:

...
[2018-02-20 18:28:08,873: INFO/MainProcess] Connected to amqp://user:**@some-rabbit:5672/testhost
[2018-02-20 18:28:08,888: INFO/MainProcess] mingle: searching for neighbors
[2018-02-20 18:28:09,948: INFO/MainProcess] mingle: all alone
[2018-02-20 18:28:09,981: INFO/MainProcess] celery@d920d5fc20d2 ready.
[2018-02-20 18:28:42,780: INFO/MainProcess] Received task: celeryapp.foo[e4b9ce2e-1bff-460d-b18f-033f50c2c90e]  
[2018-02-20 18:28:42,783: WARNING/ForkPoolWorker-2] Starting Task
^C
worker: Hitting Ctrl+C again will terminate all running tasks!

worker: Warm shutdown (MainProcess)
[2018-02-20 18:28:47,791: WARNING/ForkPoolWorker-2] Finished Sleeping
[2018-02-20 18:28:47,792: INFO/ForkPoolWorker-2] Task celeryapp.foo[e4b9ce2e-1bff-460d-b18f-033f50c2c90e] succeeded in 5.010052320998511s: 'foo'


...



/opt/celery/celery/platforms.py:796: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  uid=uid, euid=euid, gid=gid, egid=egid,
 
 -------------- celery@0bc72f9325b0 v4.2.0 (latentcall)
---- **** ----- 
--- * ***  * -- Linux-4.9.60-linuxkit-aufs-x86_64-with-debian-9.3 2018-02-20 18:28:57
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         foo:0x7f694179c6d8
- ** ---------- .> transport:   amqp://user:**@some-rabbit:5672/testhost
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 2 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . celeryapp.foo

[2018-02-20 18:28:57,634: INFO/MainProcess] Connected to amqp://user:**@some-rabbit:5672/testhost
[2018-02-20 18:28:57,650: INFO/MainProcess] mingle: searching for neighbors
[2018-02-20 18:28:58,689: INFO/MainProcess] mingle: all alone
[2018-02-20 18:28:58,760: INFO/MainProcess] celery@0bc72f9325b0 ready.
[2018-02-20 18:28:58,762: INFO/MainProcess] Received task: celeryapp.foo[e4b9ce2e-1bff-460d-b18f-033f50c2c90e]  
[2018-02-20 18:28:58,867: WARNING/ForkPoolWorker-2] Starting Task
[2018-02-20 18:29:03,873: WARNING/ForkPoolWorker-2] Finished Sleeping
[2018-02-20 18:29:03,874: INFO/ForkPoolWorker-2] Task celeryapp.foo[e4b9ce2e-1bff-460d-b18f-033f50c2c90e] succeeded in 5.007625436002854s: 'foo'

(Also hi there @alexmic! 👋)

@lovemyliwu
Copy link

@sww I just do a tricky work and fixed this problem, please check

https://gist.github.com/lovemyliwu/af5112de25b594205a76c3bfd00b9340

@auvipy auvipy modified the milestones: v5.0.0, v4.3 May 27, 2018
@crazeekennee
Copy link

I battled this for a few days, googling like crazy, tripping through python code, and thinking I must have screwed something up.

Finally, I found finding the magic spell that seems to have patched this issue on my end: I downgraded celery to 4.1.1.

Note: patch != solution

Hope this helps someone else.

@auvipy
Copy link
Member

auvipy commented Jul 23, 2018

so this is a regression

@max-l
Copy link

max-l commented Aug 23, 2018

I'm getting this problem with celery 4.2.0

  1. a task with ack_late=True is sent to a queue
  2. the task completes
  3. the worker is terminated with CTRL+C (running in the foreground)
  4. when the worker is restarted, the task is resubmited

@ksonj
Copy link

ksonj commented Jun 25, 2019

Might be fixed by celery/kombu#926

@auvipy auvipy closed this as completed Jun 25, 2019
@auvipy auvipy removed this from the 4.5 milestone Jun 25, 2019
@auvipy auvipy added this to the 4.4.0 milestone Jun 25, 2019
@davidt99
Copy link

davidt99 commented Nov 7, 2019

This is still reproducible on celery 4.4.0rc3. From the conversation, it doesn't state which commit closed it. celery/kombu#926 doesn't seem related.

@auvipy auvipy reopened this Nov 7, 2019
@auvipy
Copy link
Member

auvipy commented Nov 7, 2019

thanks for your report david

@auvipy auvipy modified the milestones: 4.4.0, 4.5 Nov 7, 2019
@dipAch
Copy link
Contributor

dipAch commented Nov 18, 2019

@davidt99 Do you have the results backend configured? If yes, what does the task state in the result store say? Just enquiring to check the behaviour for one of my use-cases.

@davidt99
Copy link

The celery output:

  -------------- celery@c7c755a0cc2d v4.3.0 (rhubarb)
 ---- **** ----- 
 --- * ***  * -- Linux-4.15.0-70-generic-x86_64-with-debian-9.6 2019-11-19 08:24:35
 -- * - **** --- 
 - ** ---------- [config]
 - ** ---------- .> app:         foo:0x7f8e15f7fba8
 - ** ---------- .> transport:   redis://foo_redis:6379/1
 - ** ---------- .> results:     redis://foo_redis:6379/1
 - *** --- * --- .> concurrency: 4 (prefork)
 -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
 --- ***** ----- 
  -------------- [queues]
                 .> celery-foo exchange=celery-foo(direct) key=celery-foo

We have task_acks_late and task_track_started turned on.

@dipAch
Copy link
Contributor

dipAch commented Dec 6, 2019

Not able to reproduce this issue on celery version v4.4.0rc4 (cliffs). I have task_acks_late and task_track_started turned on as well. Task gets successfully acked before exit (on warm shutdown).

  • Task state in Redis:
127.0.0.1:6379[1]> GET "celery-task-meta-412fa999-b591-405c-b2b0-87681d2038f8"
"{\"status\": \"SUCCESS\", \"date_done\": \"2019-12-06T05:43:29.590482\", \"task_id\": \"412fa999-b591-405c-b2b0-87681d2038f8\", \"traceback\": null, \"result\": 215, \"children\": []}"
  • Project report:
software -> celery:4.4.0rc4 (cliffs) kombu:4.6.5 py:2.7.17
            billiard:3.6.1.0 redis:3.3.11
platform -> system:Linux arch:64bit
            kernel version:4.9.184-linuxkit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://redis:6379/1

broker_url: u'redis://redis:6379/0'
include: ['tasks']
result_backend: u'redis://redis:6379/1'

@auvipy auvipy modified the milestones: 4.5, 4.4.0 Dec 6, 2019
@auvipy auvipy closed this as completed Dec 6, 2019
@freedrikp
Copy link

freedrikp commented Oct 29, 2020

I can reproduce this on 4.4.2 but only if I use a RabbitMQ broker. If I use a Redis broker it works as expected.

software -> celery:4.4.2 (cliffs) kombu:4.6.8 py:3.6.12
            billiard:3.6.3.0 py-amqp:2.5.2
platform -> system:Linux arch:64bit, ELF
            kernel version:4.15.0-65-generic imp:CPython

task_acks_late: True
task_acks_on_failure_or_timeout: True
task_publish_retry: False
task_reject_on_worker_lost: False
task_track_started: True

@freedrikp
Copy link

Can reproduce with Celery 5.0.4 as well:

software -> celery:5.0.4 (singularity) kombu:5.0.2 py:3.6.12
            billiard:3.6.3.0 py-amqp:5.0.2

@supreme-core
Copy link

Is this issue resolved for celery 5.3.1 ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests