Celery worker deletes task from MQ during fetching after shutdown #7950

igor-gorohovsky · 2022-12-09T13:20:16Z

Checklist

I have verified that the issue exists against the master branch of Celery.
This has already been asked to the discussions forum first.
I have read the relevant section in the
contribution guide
on reporting bugs.
I have checked the issues list
for similar or identical bug reports.
I have checked the pull requests list
for existing proposed fixes.
I have checked the commit log
to find out if the bug was already fixed in the master branch.
I have included all related issues and possible duplicate issues
in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

I have included the output of celery -A proj report in the issue.
(if you are not able to do this, then at least specify the Celery
version affected).
I have verified that the issue exists against the master branch of Celery.
I have included the contents of pip freeze in the issue.
I have included all the versions of all the external dependencies required
to reproduce this bug.

Optional Debugging Information

I have tried reproducing the issue on more than one Python version
and/or implementation.
I have tried reproducing the issue on more than one message broker and/or
result backend.
I have tried reproducing the issue on more than one version of the message
broker and/or result backend.
I have tried reproducing the issue on more than one operating system.
I have tried reproducing the issue on more than one workers pool.
I have tried reproducing the issue with autoscaling, retries,
ETA/Countdown & rate limits disabled.
I have tried reproducing the issue after downgrading
and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

None

Possible Duplicates

None

Environment & Settings

Celery version: 5.2.7 and 4.3.0

celery report Output:

software -> celery:5.2.7 (dawn-chorus) kombu:5.2.4 py:3.8.0
            billiard:3.6.4.0 redis:4.4.0
platform -> system:Linux arch:64bit, ELF
            kernel version:5.15.0-56-generic imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://:**@localhost:6379/

Queue: <class 'kombu.entity.Queue'>
broker_url: 'redis://:********@localhost:6379//'
imports: 
    ('worker',)
logging: <module 'logging' from '/usr/local/lib/python3.8/logging/__init__.py'>
result_backend: 'redis://:********@localhost:6379/'
task_queues: [<unbound Queue test -> <unbound Exchange ''(direct)> -> >]
task_routes: {
 'task*': {'queue': 'test'}}
worker_concurrency: 1
worker_prefetch_multiplier: 1
deprecated_settings: None

Steps to Reproduce

Required Dependencies

Minimal Python Version: 3.8
Minimal Celery Version: 4.3.0
Minimal Kombu Version: 4.6.11
Minimal Broker Version: 7.0
Minimal Result Backend Version: 7.0
Minimal OS and/or Kernel Version: Ubuntu 22.04 and kernel version:5.15.0-56-generic
Minimal Broker Client Version: N/A or Unknown
Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

amqp==5.1.1
async-timeout==4.0.2
billiard==3.6.4.0
celery==5.2.7
click==8.1.3
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.2.0
kombu==5.2.4
prompt-toolkit==3.0.36
psutil==5.9.4
pytz==2022.6
redis==4.4.0
six==1.16.0
vine==5.0.0
wcwidth==0.2.5

Other Dependencies

N/A

Minimally Reproducible Test Case

celeryconfig.py

from kombu import Queue

broker_url = "redis://password@localhost:6379"
result_backend = "redis://:password@localhost:6379"
imports = ("worker",)
worker_concurrency = 1
worker_prefetch_multiplier = 1
task_routes = {'task*': {'queue': 'test'}}
task_queues = [Queue("test")]

worker.py

import logging
import time

from celery import Celery

import celeryconfig

app = Celery("test")
app.config_from_object(celeryconfig)
logger = logging.getLogger(__name__)


@app.task(name="task1")
def task1():
    pass

main.py

import signal
import subprocess
import time

import redis

from worker import task1


if __name__ == '__main__':
    r = redis.Redis(host='localhost', port=6379, db=0, password="")
    while True:
        #  You may need to edit the initial timeout, depending on how fast/slow your PC is
        timeout = 1.2439  # this value actual for my PC
        step = 0.00001  # we will increase timeout with such step to catch the error
        task_id = task1.delay().task_id

        while True:
            try:
                p = subprocess.Popen(["celery", "-A", "worker", "worker", "-l", "info", "-E"])
                time.sleep(timeout)
                p.send_signal(signal.SIGINT)
            except (subprocess.TimeoutExpired, BrokenPipeError):
                pass

            print(timeout)

            # Checking if there are task still in our queue. We need to continue test,
            # if that's true
            if r.lrange("test", 0, 1):
                timeout += step
            else:
                break

        # Sometimes we may miss the timing, so if the task is processed, we start all over again
        is_task_processed = not r.get("celery-task-meta-" + task_id)
        if is_task_processed:
            break

Expected Behavior

The task will be fetched from the queue when a warm shutdown is received in the "middle" of the fetching process

Expected redis operations:

LPUSH test {"body": "W1tdLCB7fSwgeyJjYWxsYmFja3MiOiBudWxsLCAiZXJyYmFja3MiOiBudWxsLCAiY2hhaW4iOiBudWxsLCAiY2hvcmQiOiBudWxsfV0=", "content-encoding": "utf-8", "content-type": "application/json", "headers": {"lang": "py", "task": "task1", "id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "shadow": null, "eta": null, "expires": null, "group": null, "group_index": null, "retries": 0, "timelimit": [null, null], "root_id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "parent_id": null, "argsrepr": "()", "kwargsrepr": "{}", "origin": "gen13915@igor-pc-work", "ignore_result": false}, "properties": {"correlation_id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "reply_to": "36f586c8-1387-3a19-8eb3-12dbefcfaeb8", "delivery_mode": 2, "delivery_info": {"exchange": "", "routing_key": "test"}, "priority": 0, "body_encoding": "base64", "delivery_tag": "beeb3fc1-8524-4212-a14d-9bee5343f353"}}
RPOP test
MULTI
ZADD unacked_index 1670591793.386102 beeb3fc1-8524-4212-a14d-9bee5343f353
HSET unacked beeb3fc1-8524-4212-a14d-9bee5343f353 [{"body": "W1tdLCB7fSwgeyJjYWxsYmFja3MiOiBudWxsLCAiZXJyYmFja3MiOiBudWxsLCAiY2hhaW4iOiBudWxsLCAiY2hvcmQiOiBudWxsfV0=", "content-encoding": "utf-8", "content-type": "application/json", "headers": {"lang": "py", "task": "task1", "id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "shadow": null, "eta": null, "expires": null, "group": null, "group_index": null, "retries": 0, "timelimit": [null, null], "root_id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "parent_id": null, "argsrepr": "()", "kwargsrepr": "{}", "origin": "gen13915@igor-pc-work", "ignore_result": false}, "properties": {"correlation_id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "reply_to": "36f586c8-1387-3a19-8eb3-12dbefcfaeb8", "delivery_mode": 2, "delivery_info": {"exchange": "", "routing_key": "test"}, "priority": 0, "body_encoding": "base64", "delivery_tag": "beeb3fc1-8524-4212-a14d-9bee5343f353"}}, "", "test"]
EXEC

Actual Behavior

The task is removed from redis

When fetching a task from the queue, the worker calls the RPOP function in redis and moves the returned task to unacked queue. In my case, if you get to the right timing with the worker shutdown, then the worker will only call RPOP in redis and then shutdown without transferring the task to unacked queue. Thus, the task will simply be deleted.

Actual redis operations logs:

LPUSH test {"body": "W1tdLCB7fSwgeyJjYWxsYmFja3MiOiBudWxsLCAiZXJyYmFja3MiOiBudWxsLCAiY2hhaW4iOiBudWxsLCAiY2hvcmQiOiBudWxsfV0=", "content-encoding": "utf-8", "content-type": "application/json", "headers": {"lang": "py", "task": "task1", "id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "shadow": null, "eta": null, "expires": null, "group": null, "group_index": null, "retries": 0, "timelimit": [null, null], "root_id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "parent_id": null, "argsrepr": "()", "kwargsrepr": "{}", "origin": "gen13915@igor-pc-work", "ignore_result": false}, "properties": {"correlation_id": "697a9736-46a6-4978-96cd-547a7b46f6ac", "reply_to": "36f586c8-1387-3a19-8eb3-12dbefcfaeb8", "delivery_mode": 2, "delivery_info": {"exchange": "", "routing_key": "test"}, "priority": 0, "body_encoding": "base64", "delivery_tag": "beeb3fc1-8524-4212-a14d-9bee5343f353"}}
RPOP test

P.S. We have such a problem in a production using Kuber autoscaling in the Google cloud.

The text was updated successfully, but these errors were encountered:

open-collective-bot · 2022-12-09T13:20:18Z

Hey @igor-gorohovsky 👋,
Thank you for opening an issue. We will get back to you as soon as we can.
Also, check out our Open Collective and consider backing us - every little helps!

We also offer priority support for our sponsors.
If you require immediate assistance please consider sponsoring us.

Shaked · 2023-03-18T20:58:55Z

HI @igor-gorohovsky - have you figured out a way to overcome this problem?

igor-gorohovsky added the Issue Type: Bug Report label Dec 9, 2022

auvipy added the Component: Redis Broker label Dec 14, 2022

mbierma mentioned this issue May 26, 2023

fix: Prevent redis task loss when closing connection while in poll celery/kombu#1733

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery worker deletes task from MQ during fetching after shutdown #7950

Celery worker deletes task from MQ during fetching after shutdown #7950

igor-gorohovsky commented Dec 9, 2022 •

edited

open-collective-bot bot commented Dec 9, 2022

Shaked commented Mar 18, 2023

Celery worker deletes task from MQ during fetching after shutdown #7950

Celery worker deletes task from MQ during fetching after shutdown #7950

Comments

igor-gorohovsky commented Dec 9, 2022 • edited

Checklist

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

open-collective-bot bot commented Dec 9, 2022

Shaked commented Mar 18, 2023

igor-gorohovsky commented Dec 9, 2022 •

edited