Messages requested from SQS queue just prior to warm shutdown are taken off queue and processed only when visibility timeout is reached #8875

wcislo-saleor · 2024-02-26T14:34:58Z

Checklist

I have verified that the issue exists against the main branch of Celery.
This has already been asked to the discussions forum first.
I have read the relevant section in the
contribution guide
on reporting bugs.
I have checked the issues list
for similar or identical bug reports.
I have checked the pull requests list
for existing proposed fixes.
I have checked the commit log
to find out if the bug was already fixed in the main branch.
I have included all related issues and possible duplicate issues
in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

I have included the output of celery -A proj report in the issue.
(if you are not able to do this, then at least specify the Celery
version affected).
I have verified that the issue exists against the main branch of Celery.
I have included the contents of pip freeze in the issue.
I have included all the versions of all the external dependencies required
to reproduce this bug.

Optional Debugging Information

I have tried reproducing the issue on more than one Python version
and/or implementation.
I have tried reproducing the issue on more than one message broker and/or
result backend.
I have tried reproducing the issue on more than one version of the message
broker and/or result backend.
I have tried reproducing the issue on more than one operating system.
I have tried reproducing the issue on more than one workers pool.
I have tried reproducing the issue with autoscaling, retries,
ETA/Countdown & rate limits disabled.
I have tried reproducing the issue after downgrading
and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

None

Possible Duplicates

Not a bug report but this describes the same behaviour I'm experiencing - Worker receiving SIGTERM causes issues with SQS Long Polling #7283

Environment & Settings

Celery version:

celery report Output:

$ celery -A tasks report

software -> celery:5.3.6 (emerald-rush) kombu:5.3.5 py:3.9.18
            billiard:4.2.0 sqs:N/A
platform -> system:Darwin arch:64bit
            kernel version:21.6.0 imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:sqs results:disabled

broker_url: 'sqs://localhost//'
broker_transport_options:
 'region': 'eu-west-1'}
task_create_missing_queues: False
deprecated_settings: None
task_default_queue: 'test-queue'

Steps to Reproduce

Required Dependencies

Minimal Python Version: 3.9.18
Minimal Celery Version: 5.3.6
Minimal Kombu Version: 5.3.5
Minimal Broker Version: N/A
Minimal Result Backend Version: N/A
Minimal OS and/or Kernel Version: macOS 12.5
Minimal Broker Client Version: N/A
Minimal Result Backend Client Version: N/A

Python Packages

pip freeze Output:

https://gist.github.com/wcislo-saleor/969d036545879f28b24b5594dc9b49ab

Other Dependencies

SQS queue created with Terraform

resource "aws_sqs_queue" "queue" {
  name                       = "test-queue"
  delay_seconds              = 0
  max_message_size           = 262144
  message_retention_seconds  = 345600
  receive_wait_time_seconds  = 0
  visibility_timeout_seconds = 150
}

Minimally Reproducible Test Case

Run run_test.py.

run_test.py

#!/usr/bin/env python3

import subprocess
import time
from datetime import datetime
from pathlib import Path
from tasks import simulate_async_task


time_started = datetime.now().isoformat()

main_n = 1
initial_n = 3
late_n = 2
tasks_n = 80

min_prefetch_multiplier = False


def run_workers(group_name, n):
    procs = []

    args = [
        "celery",
        "-A" "tasks",
        "worker",
        "--loglevel=INFO",
        "--concurrency",
        "2",
    ]

    if min_prefetch_multiplier:
        args.append("--prefetch-multiplier")
        args.append("1")

    for i in range(n):
        fd = open(f"./logs/{time_started}/{group_name}-{i}.log", "w")
        procs.append(
            subprocess.Popen(
                args,
                stdout=fd,
                stderr=fd,
            )
        )

    print(f"Started {[proc.pid for proc in procs]}")
    return procs


def terminate_workers(procs):
    for proc in procs:
        print(f"Terminating {proc.pid}")
        time.sleep(5)
        proc.terminate()


def run_simulated_tasks(name, n=5):
    print(f"Running {n} simulated tasks")

    for i in range(1, n + 1):
        simulate_async_task.delay(f"{name} task {i}")


def main():
    Path(f"./logs/{time_started}").mkdir(exist_ok=True)

    print(f"Started at {time_started}")
    main_proc = run_workers(group_name="main", n=main_n)
    initial_procs = run_workers(group_name="initial", n=initial_n)

    time.sleep(10)

    run_simulated_tasks(name="run_test", n=tasks_n)

    late_procs = run_workers(group_name="late", n=late_n)

    terminate_workers(initial_procs)
    terminate_workers(late_procs)

    prompt = None
    while prompt != "stop":
        prompt = input("Type 'stop' to finish: ")

    terminate_workers(main_proc)


if __name__ == "__main__":
    main()

tasks.py

from celery import Celery
import time

app = Celery(
    "tasks",
    broker_url="sqs://",
    broker_transport_options={
        "region": "eu-west-1",
    },
    task_create_missing_queues=False,
)
app.conf.task_default_queue = "test-queue"


@app.task(queue="test-queue", bind=True)
def simulate_async_task(self, name):
    print(f"Starting {name} with id {self.request.id} ")
    time.sleep(3)
    print(f"Ending   {name} with id {self.request.id} ")

Expected Behavior

All messages added to SQS are processed within maximum a few minutes after test script starts (there are multiple workers and 80 messages to process, each message takes 3 seconds to process).

Actual Behavior

With given setup (number of workers and number of messages) I can reliably reproduce a situation where a few messages will not be processed until reaching visibility timeout.

Why this matters? In our setup the SQS queue has visibility timeout of 30 minutes. Celery workers are launched in Kubernetes where they are autoscaled based on number of messages in the queue. Once number of messages in the queue gets low enough, most Celery workers are shutdown (SIGTERM, initiating warm shutdown). This usually provokes a situation where certain number of messages are left in the queue and are processed only after visibility timeout is reached.

By adding print statements I compared following:

Number of initiated "ReceiveMessage" calls to AWS API (https://github.com/celery/kombu/blob/abe91098b7c4b8008117030d0f806b4a743ba972/kombu/asynchronous/aws/sqs/connection.py#L141).
Number of function calls to "ReceiveMessage" callback (https://github.com/celery/kombu/blob/abe91098b7c4b8008117030d0f806b4a743ba972/kombu/transport/SQS.py#L602).
Number of received "ReceiveMessage" calls in AWS Cloudtrail.

I can observe that when issue is reproduced number 1 and number 3 are the same, while number 2 is slightly lower. This tells me that Celery did ask to receive messages from SQS but never got chance to process them (neither to actually execute the task, nor put them back onto the queue). When that happens, these messages are sitting in SQS queue until visibility timeout is reached and only then are taken off the queue and processed by a worker.

The text was updated successfully, but these errors were encountered:

wcislo-saleor added the Issue Type: Bug Report label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Messages requested from SQS queue just prior to warm shutdown are taken off queue and processed only when visibility timeout is reached #8875

Messages requested from SQS queue just prior to warm shutdown are taken off queue and processed only when visibility timeout is reached #8875

wcislo-saleor commented Feb 26, 2024 •

edited

Messages requested from SQS queue just prior to warm shutdown are taken off queue and processed only when visibility timeout is reached #8875

Messages requested from SQS queue just prior to warm shutdown are taken off queue and processed only when visibility timeout is reached #8875

Comments

wcislo-saleor commented Feb 26, 2024 • edited

Checklist

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

wcislo-saleor commented Feb 26, 2024 •

edited