Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same task runs multiple times at once? #4400

Open
jenstroeger opened this issue Nov 20, 2017 · 44 comments
Open

Same task runs multiple times at once? #4400

jenstroeger opened this issue Nov 20, 2017 · 44 comments

Comments

@jenstroeger
Copy link

@jenstroeger jenstroeger commented Nov 20, 2017

The issue is a repost of an unattended Google groups post Same task runs multiple times?

> ./bin/celery -A celery_app report

software -> celery:4.1.0 (latentcall) kombu:4.1.0 py:3.6.1
            billiard:3.5.0.3 redis:2.10.6
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://localhost:6379/2

broker_url: 'redis://localhost:6379/2'
result_backend: 'redis://localhost:6379/2'
task_serializer: 'json'
result_serializer: 'json'
accept_content: ['json']
timezone: 'Europe/Berlin'
enable_utc: True
imports: 'tasks'
task_routes: {
 'tasks': {'queue': 'celery-test-queue'}}

My application schedules a single group of two, sometimes three tasks, each of which with their own ETA within one hour. When the ETA arrives, I see the following in my celery log:

[2017-11-20 09:55:34,470: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 33.81780316866934s: None
[2017-11-20 09:55:34,481: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.009824380278587341s: None
[2017-11-20 09:55:34,622: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.14010038413107395s: None
…
[2017-11-20 09:55:37,890: INFO/ForkPoolWorker-8] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.012678759172558784s: None
[2017-11-20 09:55:37,891: INFO/ForkPoolWorker-2] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.01177949644625187s: None
[2017-11-20 09:55:37,899: INFO/ForkPoolWorker-8] Task tasks._test_exec[bd08ab85-28a8-488f-ba03-c2befde10054] succeeded in 0.008250340819358826s: None
…

This can repeat dozens of times. Note the first task’s 33 seconds execution time, and the use of different workers!

I have no explanation for this behavior, and would like to understand what’s going on here.

@georgepsarakis
Copy link
Member

@georgepsarakis georgepsarakis commented Nov 22, 2017

Maybe this is related to visibility timeout?

@jenstroeger
Copy link
Author

@jenstroeger jenstroeger commented Nov 23, 2017

@georgepsarakis Could you please elaborate on your suspicion?

@georgepsarakis
Copy link
Member

@georgepsarakis georgepsarakis commented Nov 24, 2017

As far as I know, this is a known issue for broker transports that do not have built-in acknowledgement characteristics of AMQP. The task will be assigned to a new worker if the task completion time exceeds the visibility timeout, thus you may see tasks being executed in parallel.

@jenstroeger
Copy link
Author

@jenstroeger jenstroeger commented Nov 24, 2017

@georgepsarakis So if the task is scheduled far ahead in the future, then I might see the above? The “visibility timeout” addresses that? From the documentation you linked:

The default visibility timeout for Redis is 1 hour.

Meaning that if within the hour the worker does not ack the task (i.e. run it?) that task is being sent to another worker which wouldn’t ack, and so on… Indeed this seems to be the case looking at the caveats section of the documentation; this related issue celery/kombu#337; or quoting from this blog:

But when developers just start using it, they regularly face abnormal behaviour of workers, specifically multiple execution of the same task by several workers. The reason which causes it is a visibility timeout setting.

Looks like setting the visibility_timeout to 31,540,000 seconds (one year) might be a quick fix.

@georgepsarakis
Copy link
Member

@georgepsarakis georgepsarakis commented Nov 25, 2017

I would say that if you increase the visibility timeout to 2 hours, your tasks will be executed only once.

So if you combine:

  • Redis broker
  • Late acknowledgement
  • ETA equal/above the visibility timeout
    you get multiple executions on the task.

I think what happens is:

  • After one hour passes one worker process starts processing the task.
  • A second worker will see that this message has been in the queue for longer than the visibility timeout and is being processed by another worker.
  • Message is restored in the queue.
  • Another worker starts processing the same message.
  • The above happens for all worker processes.

Looking into the Redis transport implementation, you will notice that it uses Sorted Sets, passing the queued time as a score to zadd. The message is restored based on that timestamp and comparing to an interval equal to the visibility timeout.

Hope this explains a bit the internals of the Redis transport.

@jenstroeger
Copy link
Author

@jenstroeger jenstroeger commented Nov 27, 2017

@georgepsarakis, I’m now thoroughly confused. If a task’s ETA is set for two months from now, why would a worker pick it up one hour after the tasks has been scheduled? Am I missing something?

My (incorrect?) assumption is that:

  • I schedule a task with an ETA at any time in the future; then
  • the task (i.e. its marshaled arguments) and ETA will sit in the queue until ETA arrives; then
  • a worker begins processing the task at ETA.

Your “I think what happens is:” above is quite different from my assumption.

@zivsu
Copy link

@zivsu zivsu commented Dec 17, 2017

I also encountered the same problem,have you solved it? @jenstroeger

Thanks!

@georgepsarakis
Copy link
Member

@georgepsarakis georgepsarakis commented Dec 17, 2017

@jenstroeger that does not sound like a feasible flow, I think the worker just continuously requeues the message in order to postpone execution until the ETA condition is finally met. The concept of the queue is to distribute messages as soon as they arrive, so the worker examines the message and just requeues.

Please note that this is my guess, I am not really aware of the internals of the ETA implementation.

@jenstroeger
Copy link
Author

@jenstroeger jenstroeger commented Dec 21, 2017

@zivsu, as mentioned above I’ve set the visibility_timeout to a very large number and that seems to have resolved the symptoms. However, as @georgepsarakis points out, that seems to be a poor approach.

I do not know the cause of the original problem nor how to address it properly.

@zivsu
Copy link

@zivsu zivsu commented Dec 21, 2017

@jenstroeger I read some blog, change visibility_timeout can not solve the problem completely, so I change my borker to rabbitmq.

@jenstroeger
Copy link
Author

@jenstroeger jenstroeger commented Dec 22, 2017

@zivsu, can you please share the link to the blog? Did you use Redis before?

@zivsu
Copy link

@zivsu zivsu commented Dec 22, 2017

@jenstroeger I can't find the blog, I used Redis as broker before. For schedule task, I choose rebbitmq to avoid the error happen again.

@Anton-Shutik
Copy link

@Anton-Shutik Anton-Shutik commented Apr 10, 2018

I have exactly same issue, my config is:
django==1.11.6
celery==4.2rc2
django-celery-beat==1.0.1

settings:

CELERY_ENABLE_UTC = True
# CELERY_TIMEZONE = 'America/Los_Angeles'

And that is the only one working combination of this settings. Also I have to schedule my periodic tasks in UTC timezone.
If you enable CELERY_TIMEZONE or disable CELERY_ENABLE_UTC it starts running periodic tasks multiple times.

@ghost
Copy link

@ghost ghost commented Apr 27, 2018

I have the save problem. the eta task excute multiply times when using redis as a broker.
any way to solve this..
look like change broker from redis to rabbitmq solve this problem..

@auvipy auvipy added this to the v5.0.0 milestone Apr 27, 2018
@2ps
Copy link

@2ps 2ps commented May 31, 2018

Using redis, there is a well-known issue when you specify a timezone other than UTC. To work around the issue, subclass the default app, and add your own timezone handling function:

from celery import Celery


class MyAppCelery(Celery):
    def now(self):
        """Return the current time and date as a datetime."""
        from datetime import datetime
        return datetime.now(self.timezone)

Hope that helps anyone else that is running into this problem.

@auvipy auvipy modified the milestones: v5.0.0, v4.3 May 31, 2018
@chrisconlan
Copy link

@chrisconlan chrisconlan commented Sep 26, 2018

I get this problem sometimes when frequently restarting celery jobs with beat on multicore machines. I've gotten in the habit of running ps aux | grep celery then kill <each_pid> to resolve it.

Best advice I have is to always make sure you see the "restart DONE" message before disconnecting from the machine.

@zetaab
Copy link

@zetaab zetaab commented Oct 10, 2018

{"log":"INFO 2018-10-09 17:41:08,468 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T17:41:08.468912644Z"}
{"log":"INFO 2018-10-09 17:41:08,468 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T17:41:08.468955918Z"}
{"log":"INFO 2018-10-09 19:46:04,293 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T19:46:04.293780045Z"}
{"log":"INFO 2018-10-09 19:46:04,293 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T19:46:04.293953621Z"}
{"log":"INFO 2018-10-09 20:46:04,802 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T20:46:04.802819711Z"}
{"log":"INFO 2018-10-09 20:46:04,802 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T20:46:04.802974829Z"}
{"log":"INFO 2018-10-09 21:46:05,335 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T21:46:05.336081133Z"}
{"log":"INFO 2018-10-09 21:46:05,335 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T21:46:05.336107517Z"}
{"log":"INFO 2018-10-09 22:46:05,900 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T22:46:05.901078395Z"}
{"log":"INFO 2018-10-09 22:46:05,900 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T22:46:05.901173663Z"}
{"log":"INFO 2018-10-09 23:46:06,484 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T23:46:06.485276904Z"}
{"log":"INFO 2018-10-09 23:46:06,484 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-09T23:46:06.485415253Z"}
{"log":"INFO 2018-10-10 00:46:07,072 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T00:46:07.072529828Z"}
{"log":"INFO 2018-10-10 00:46:07,072 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T00:46:07.072587887Z"}
{"log":"INFO 2018-10-10 01:46:07,602 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T01:46:07.60325321Z"}
{"log":"INFO 2018-10-10 01:46:07,602 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T01:46:07.603327426Z"}
{"log":"INFO 2018-10-10 02:46:08,155 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T02:46:08.155868992Z"}
{"log":"INFO 2018-10-10 02:46:08,155 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T02:46:08.155921893Z"}
{"log":"INFO 2018-10-10 03:46:08,753 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T03:46:08.75401387Z"}
{"log":"INFO 2018-10-10 03:46:08,753 strategy celery.worker.strategy 1 140031597243208 Received task: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f]  ETA:[2018-10-10 04:00:00+00:00] \n","stream":"stderr","time":"2018-10-10T03:46:08.754056891Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.013548928Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.013592318Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:71\n","stream":"stderr","time":"2018-10-10T04:00:00.014000106Z"}
{"log":"DEBUG 2018-10-10 04:00:00,013 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:71\n","stream":"stderr","time":"2018-10-10T04:00:00.014167558Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:64\n","stream":"stderr","time":"2018-10-10T04:00:00.014661348Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:64\n","stream":"stderr","time":"2018-10-10T04:00:00.014684354Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.01514884Z"}
{"log":"DEBUG 2018-10-10 04:00:00,014 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.015249646Z"}
{"log":"DEBUG 2018-10-10 04:00:00,015 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:66\n","stream":"stderr","time":"2018-10-10T04:00:00.01571124Z"}
{"log":"DEBUG 2018-10-10 04:00:00,015 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:66\n","stream":"stderr","time":"2018-10-10T04:00:00.01580249Z"}
{"log":"DEBUG 2018-10-10 04:00:00,019 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:68\n","stream":"stderr","time":"2018-10-10T04:00:00.019260948Z"}
{"log":"DEBUG 2018-10-10 04:00:00,019 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:68\n","stream":"stderr","time":"2018-10-10T04:00:00.019322151Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.245159563Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:70\n","stream":"stderr","time":"2018-10-10T04:00:00.245177267Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:67\n","stream":"stderr","time":"2018-10-10T04:00:00.245338722Z"}
{"log":"DEBUG 2018-10-10 04:00:00,245 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:67\n","stream":"stderr","time":"2018-10-10T04:00:00.245351289Z"}
{"log":"DEBUG 2018-10-10 04:00:00,256 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.256770035Z"}
{"log":"DEBUG 2018-10-10 04:00:00,256 request celery.worker.request 1 140031597243208 Task accepted: main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] pid:65\n","stream":"stderr","time":"2018-10-10T04:00:00.256788689Z"}
{"log":"INFO 2018-10-10 04:00:00,371 trace celery.app.trace 68 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.35710329699213617s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.371967002Z"}
{"log":"INFO 2018-10-10 04:00:00,371 trace celery.app.trace 68 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.35710329699213617s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.371983293Z"}
{"log":"INFO 2018-10-10 04:00:00,387 trace celery.app.trace 69 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.10637873200175818s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.388119538Z"}
{"log":"INFO 2018-10-10 04:00:00,387 trace celery.app.trace 69 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.10637873200175818s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.388166317Z"}
{"log":"INFO 2018-10-10 04:00:00,404 trace celery.app.trace 70 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.16254851799749304s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.404834545Z"}
{"log":"INFO 2018-10-10 04:00:00,404 trace celery.app.trace 70 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.16254851799749304s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.404862208Z"}
{"log":"INFO 2018-10-10 04:00:00,421 trace celery.app.trace 65 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.1654666289978195s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.421607856Z"}
{"log":"INFO 2018-10-10 04:00:00,421 trace celery.app.trace 65 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.1654666289978195s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.421674687Z"}
{"log":"INFO 2018-10-10 04:00:00,438 trace celery.app.trace 67 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.19588526099687442s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.438295459Z"}
{"log":"INFO 2018-10-10 04:00:00,438 trace celery.app.trace 67 140031597243208 Task main.batch.sendspam[2a6e5dc8-5fd2-40bd-8f65-7e7334a14b3f] succeeded in 0.19588526099687442s: None\n","stream":"stderr","time":"2018-10-10T04:00:00.438311386Z"}
...

if we check Received task timestamps, every hour it will get new task with same id. The result is that all ETA messages are sent more than 10 times. Looks like rabbitmq is only option if we want to use ETA

@Dedal-O
Copy link

@Dedal-O Dedal-O commented Jan 2, 2019

Rcently meet similar bug. Also ps aux | grep celery showed more processes than workers started, twice more. Appending parameter --pool gevent to command launching celery workers lowered number of processes to exact number of started workers and celery beat. And now i'm wathnig my tasks execution.

@auvipy auvipy modified the milestones: v4.3, v5.0.0 Jan 8, 2019
@killthekitten
Copy link

@killthekitten killthekitten commented May 25, 2019

Might another solution be disabling ack emulation entirely? i.e. "broker_transport_options": {"ack_emulation": False}. Any drawbacks for short-running tasks / countdowns?

@nitish-itilite
Copy link

@nitish-itilite nitish-itilite commented Dec 1, 2019

Celery is receiving same task twice, with same task id at same time.
Here are the logs

[2019-11-29 08:07:35,464: INFO/MainProcess] Received task: app.jobs.booking.bookFlightTask[657985d5-c3a3-438d-a524-dbb129529443]  
[2019-11-29 08:07:35,465: INFO/MainProcess] Received task: app.jobs.booking.bookFlightTask[657985d5-c3a3-438d-a524-dbb129529443]  
[2019-11-29 08:07:35,471: WARNING/ForkPoolWorker-4] in booking funtion1
[2019-11-29 08:07:35,473: WARNING/ForkPoolWorker-3] in booking funtion1
[2019-11-29 08:07:35,537: WARNING/ForkPoolWorker-3] book_request_pp
[2019-11-29 08:07:35,543: WARNING/ForkPoolWorker-4] book_request_pp

both are are running simultaneously,

using celery==4.4.0rc4 , boto3==1.9.232, kombu==4.6.6 with SQS in pyhton flask.
in SQS, Default Visibility Timeout is 30 minutes, and my task is not having ETA and not ack.

running worker like,
celery worker -A app.jobs.run -l info --pidfile=/var/run/celery/celery.pid --logfile=/var/log/celery/celery.log --time-limit=7200 --concurrency=8

@auvipy , any help would be great.

@auvipy
Copy link
Member

@auvipy auvipy commented Dec 1, 2019

which broker and result back end are you using? can you try switching to another back end?

@nitish-itilite
Copy link

@nitish-itilite nitish-itilite commented Dec 1, 2019

which broker and result back end are you using? can you try switching to another back end?

using SQS, result backend is MYSQL, with sqlalchemy.
details are here at SO, https://stackoverflow.com/questions/59123536/celery-is-receiving-same-task-twice-with-same-task-id-at-same-time
@auvipy can you please have a look.

@auvipy
Copy link
Member

@auvipy auvipy commented Dec 1, 2019

@thedrow do you face this issue in bloomberg?

@2ps
Copy link

@2ps 2ps commented Dec 2, 2019

@nitish-itilite : what timezone are you using for celery?

@nitish-itilite
Copy link

@nitish-itilite nitish-itilite commented Dec 2, 2019

@nitish-itilite : what timezone are you using for celery?

it is default UTC. in sqs , region is this US East (N. Virginia).

@chaws
Copy link

@chaws chaws commented Feb 18, 2020

I had a similar case running celery with SQS. I ran a dummy task with countdown=60, while visibility timeout in SQS is 30 seconds. Here's what I get:

NOTE: I've started celery with --concurrency=1, so there are two threads, right?

[2020-02-18 14:46:32 +0000] [INFO] Received task: notification[b483a22f-31cc-4335-9709-86041baa8f05]  ETA:[2020-02-18 14:47:31.898563+00:00] 
[2020-02-18 14:47:02 +0000] [INFO] Received task: notification[b483a22f-31cc-4335-9709-86041baa8f05]  ETA:[2020-02-18 14:47:31.898563+00:00] 
[2020-02-18 14:47:32 +0000] [INFO] Task notification[b483a22f-31cc-4335-9709-86041baa8f05] succeeded in 0.012232275999849662s: None
[2020-02-18 14:47:32 +0000] [INFO] Task notification[b483a22f-31cc-4335-9709-86041baa8f05] succeeded in 0.012890915997559205s: None

What happened in chronological order:

  1. 14:46:32 task was received, and SQS put it in inflight mode for 30 seconds
  2. 14:47:02 same task was received, because the visibility timeout expired
  3. 14:47:32 both tasks are executed at the same time

My guess is that this is a bug in Celery (?), I think it should've checked if the message id (b483a22f-31cc-4335-9709-86041baa8f05) has already been taken by that worker.

Maybe there could be a hash list with all message ids, so that celery could decide if a received task is valid for processing. Can celery do that?

NOTE 2:
We can't set a visibility timeout for too long, because if the worker does actually die, the message would take too long to be picked up by another worker. Setting it too low would expose this condition.

@JulieGoldberg
Copy link

@JulieGoldberg JulieGoldberg commented May 11, 2020

This seems to be happening to me too.

[2020-05-11 15:31:23,673: INFO/MainProcess] Received task: ee_external_attributes.tasks. recreate_specific_values[53046bd7-2a19-4f72-808f-d712eaecb0e8]
[2020-05-11 15:31:28,673: INFO/MainProcess] Received task: ee_external_attributes.tasks.recreate_specific_values[53046bd7-2a19-4f72-808f-d712eaecb0e8]

(I tweaked the task name in the logs for public posting.)

Due to uniqueness constraints, one of my workers throws an error partway through the task, and the other one succeeds.

I tried setting
CELERY_WORKER_PREFETCH_MULTIPLIER = 1
That turned out not to help.

I'm using
celery==4.4.1
django-celery-results==1.2.1

And I'm using AWS SQS for the queue.

@JulieGoldberg
Copy link

@JulieGoldberg JulieGoldberg commented May 12, 2020

I do have a theory. Apparently my "Default Visibility Timeout" setting on my queue was only set to 5 seconds. It may be that the second worker pulled the job while the first was working on it, because it assumed the first worker had died. I upped the visibility timeout to 2 minutes, and it seems to be doing better. I had plenty of tasks that took 8-12 seconds, so 2 minutes may be overkill. But hopefully that solves it.

@jenstroeger
Copy link
Author

@jenstroeger jenstroeger commented May 12, 2020

It may be that the second worker pulled the job while the first was working on it, because it assumed the first worker had died.

@JulieGoldberg, that would be a crummy way for Celery to handle jobs. A Celery worker should never start a job that another worker has pulled off the queue and is actively processing; think that would be seriously broken. (But it’s Celery, I’m surprised by nothing anymore 😒)

@elMateso
Copy link

@elMateso elMateso commented Jul 17, 2020

I have a similar problem with an application that is running in Kubernetes. In the Kubernetes instance, we have 10 workers (celery app instance) who consume the tasks from the Redis.

Symptoms:
The celery worker schedules an ETA task twhich will be planed after 30 minutes. If the Kubernetes pod is rotated (the worker is killed by Kubernetes) or a newer version of the application is deployed (all workers are killed and new workers are created), all workers will take the scheduled task and start executing in the defined time.
For the worker, I tried to set different values of visibility_timeout for several hours up to one year, but the result was still the same. The same behavior was reached with the setting enable_utc = True, or a reduction of worker_prefetch_multiplier = 1.

@adnathanail
Copy link

@adnathanail adnathanail commented Jul 18, 2020

I don't know if this will help anyone but this was my issue:

I had tasks (report generation) that were being run when a page was loaded via GET. For some reason (something to do with favicons) Chrome would send 2 GET requests on every page load, triggering the task twice.
GET requests are supposed to be side effect free, so I turned them all into forms that you submit and the issue was resolved.

@thedrow
Copy link
Member

@thedrow thedrow commented Jul 27, 2020

It may be that the second worker pulled the job while the first was working on it, because it assumed the first worker had died.

@JulieGoldberg, that would be a crummy way for Celery to handle jobs. A Celery worker should never start a job that another worker has pulled off the queue and is actively processing; think that would be seriously broken. (But it’s Celery, I’m surprised by nothing anymore 😒)

Instead of complaining, you can help us fix the issue by coming up with a solution and a PR.

@turbaszek
Copy link

@turbaszek turbaszek commented Jul 27, 2020

I have a similar problem with an application that is running in Kubernetes. In the Kubernetes instance, we have 10 workers (celery app instance) who consume the tasks from the Redis.

Symptoms:
The celery worker schedules an ETA task twhich will be planed after 30 minutes. If the Kubernetes pod is rotated (the worker is killed by Kubernetes) or a newer version of the application is deployed (all workers are killed and new workers are created), all workers will take the scheduled task and start executing in the defined time.

@elMateso I faced similar issues with Airflow deployment on k8s (consumers on pods and redis as a queue). But I was able to make the deployment stable and working as expected, maybe those tips will help you:
https://www.polidea.com/blog/application-scalability-kubernetes/#tips-for-hpa

@jgbmattos
Copy link

@jgbmattos jgbmattos commented Jul 29, 2020

Facing the same here.

Doesn't seems to be a problem with any timing configuration (visibility timeout, ETA, etc..), for me at least. In mine case it happens microseconds between executions. Didn't find how celery does in fact ACK a message, but, if, in rabbitMQ it's working perfectly it seems to be a problem with concurrency and ACK in Redis.

@ErikKalkoken
Copy link

@ErikKalkoken ErikKalkoken commented Aug 19, 2020

I am seeing the same issue and we are using Redis as broker too. Changing to rabbitMQ is not an option for us.

Does anyone have tried using a lock to ensure the task is only executed once only. Could that work?

e.g. https://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#ensuring-a-task-is-only-executed-one-at-a-time

@jgbmattos
Copy link

@jgbmattos jgbmattos commented Aug 19, 2020

@ErikKalkoken we end up doing exactly that.

def semaphore(fn):
    @wraps(fn)
    def wrapper(self_origin, *args, **kwargs):
        cache_name = f"{UID}-{args[0].request.id}-semaphore"
        agreement_redis = AgreementsRedis()
        if not agreement_redis.redis.set(cache_name, "", ex=30, nx=True):
          Raise Exception("...")
        try:
            return fn(self_origin, *args, **kwargs)
        finally:
            agreement_redis.redis.delete(cache_name)

    return wrapper

The code above is not used for celery, but celery multiple execution is the same logic, you just need to get the task_id and set the cache. So far is working fine.

@auvipy
Copy link
Member

@auvipy auvipy commented Oct 14, 2020

can someone check this pr vinayinvicible/kombu@a755ba1

@thedrow thedrow added this to To do in Celery 5.1.0 Feb 24, 2021
@thedrow thedrow moved this from To do to Backlog in Celery 5.1.0 Mar 23, 2021
@auvipy auvipy modified the milestones: 5.1.0, 5.2 Mar 28, 2021
@xiaozuo7
Copy link

@xiaozuo7 xiaozuo7 commented Apr 27, 2021

This seems to be a feasible way, https://github.com/cameronmaske/celery-once

@thedrow thedrow moved this from Backlog to Postponed in Celery 5.1.0 May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Celery 5.1.0
  
Postponed
Linked pull requests

Successfully merging a pull request may close this issue.

None yet