Worker receiving SIGTERM causes issues with SQS Long Polling #7283

JMiller-FINCAD · 2022-02-04T21:49:15Z

JMiller-FINCAD
Feb 4, 2022

We've noticed a issue with Celery using SQS. We're hosting our workers on AWS ECS containers, and scaling out the number of containers when we have a lot of messages in the SQS queue. When the message count goes down we scale in the number of containers, which will send SIGTERMs to the containers to shut them down.

We hit an edge case where the worker in the container which receives the SIGTERM when it is currently polling SQS for messages. The worker appears to receive the message from SQS, but is shutdown before it processes the message. Because we have TASK_ACKS_LATE = True, it means the message waits until it's queue timeout (set to 30 minutes) before being picked up by another worker. We seem to minimize this by lowering the wait_time_seconds from the default of 10 seconds down to 2 seconds, but I'm not sure if the issue has gone away.

This is a tough issue to reproduce, and only happens with large loads on our system (with lots of workers...). Lowering the wait_time_seconds does seem to lower chance of it happening, but it would nice to remove completely. Any insight would be helpful.

Some relevant settings:
celery==5.1.2
WORKER_CONCURRENCY = 1
WORKER_PREFETCH_MULTIPLIER = 1
TASK_ACKS_LATE = True

abomariam · 2022-02-14T15:27:23Z

abomariam
Feb 14, 2022

I'm having the same issue.

We are hosting our workers on AWS EC2 Auto Scale group, and when we scale in we get some messages stuck in In Flight status and re-queued again after it timeout.

But looking more into that I found that the message is never processed or appeared in our logs, and using the DEBUG option I found that I can reproduce this issue if I interrupted the celery after it send the ReceiveMessage request and before receiving the response.

Here is the log

[2022-02-09 10:03:39,859: DEBUG/MainProcess] Event needs-retry.sqs.GetQueueUrl: calling handler <botocore.retryhandler.RetryHandler object at 0x104596760>
[2022-02-09 10:03:39,859: DEBUG/MainProcess] No retry needed.
[2022-02-09 10:03:39,859: DEBUG/MainProcess] Event choose-signer.sqs.ReceiveMessage: calling handler <function set_operation_specific_signer at 0x102ce99d0>
[2022-02-09 10:03:39,859: DEBUG/MainProcess] Calculating signature using v4 auth.
[2022-02-09 10:03:39,859: DEBUG/MainProcess] CanonicalRequest:
POST
/936812282633/celery-ticketleap-webtier-mahmouds-mac-mini-local

host:queue.amazonaws.com
x-amz-date:20220209T100339Z

host;x-amz-date
5349a17ef4c644796bd1d409b4088ac5f3838ce046bd8af03807fe55ceb958c4
[2022-02-09 10:03:39,859: DEBUG/MainProcess] StringToSign:
AWS4-HMAC-SHA256
20220209T100339Z
20220209/us-east-1/sqs/aws4_request
6112a36a8498cf1d682856148227c0282cd8f950e7d244dbc9a712371ee2c50f
[2022-02-09 10:03:39,859: DEBUG/MainProcess] Signature:
0265b3d279742bf44613def2215094a2ef2f041ee41f9cee3aa9305b3329bcbd

worker: Hitting Ctrl+C again will terminate all running tasks!

worker: Warm shutdown (MainProcess)
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Worker: Closing Hub...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Worker: Closing Pool...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Worker: Closing Consumer...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Worker: Stopping Consumer...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Closing Connection...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Closing Events...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Closing Heart...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Closing Tasks...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Closing event loop...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Stopping event loop...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Stopping Tasks...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] Canceling task consumer...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Stopping Heart...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Stopping Events...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Consumer: Stopping Connection...
[2022-02-09 10:03:39,882: DEBUG/MainProcess] | Worker: Stopping Pool...
[2022-02-09 10:03:40,889: DEBUG/MainProcess] | Worker: Stopping Hub...
[2022-02-09 10:03:40,890: DEBUG/MainProcess] | Consumer: Shutdown Tasks...
[2022-02-09 10:03:40,890: DEBUG/MainProcess] Canceling task consumer...
[2022-02-09 10:03:40,890: DEBUG/MainProcess] Closing consumer channel...
[2022-02-09 10:03:40,890: DEBUG/MainProcess] | Consumer: Shutdown Heart...
[2022-02-09 10:03:40,890: DEBUG/MainProcess] | Consumer: Shutdown Events...
[2022-02-09 10:03:40,891: DEBUG/MainProcess] | Consumer: Shutdown Connection...
[2022-02-09 10:03:40,893: DEBUG/MainProcess] removing tasks from inqueue until task handler finished

Process finished with exit code 0

I'm having wait_time_seconds set to 0 for our queues and I have these as our Celery configs

app.conf.broker_transport_options = {
    'visibility_timeout': 1200,     # 20 min
    'polling_interval': 2,          # 2 seconds
}

app.conf.worker_prefetch_multiplier = 1
app.conf.task_acks_late = True
app.conf.task_reject_on_worker_lost = True

4 replies

JMiller-FINCAD Feb 15, 2022
Author

Setting wait_time_seconds = 2 in the broker_transport_options has fixed/improved the issue for us so far. We haven't seen a failure for weeks now. Perhaps give it a try yourself to see if it helps.
I'll wait another week or two, and then possibly log an official bug if we don't hear anything.

app.conf.broker_transport_options = {
    'visibility_timeout': 1200,     # 20 min
    'polling_interval': 2,          # 2 seconds
}

app.conf.task_acks_late = True
app.conf.task_reject_on_worker_lost = True
app.conf.worker_max_memory_per_child = 200000   # 200MB

And it is working.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker receiving SIGTERM causes issues with SQS Long Polling #7283

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Worker receiving SIGTERM causes issues with SQS Long Polling #7283

JMiller-FINCAD Feb 4, 2022

Replies: 1 comment · 4 replies

abomariam Feb 14, 2022

JMiller-FINCAD Feb 15, 2022 Author

abomariam Feb 15, 2022

rohitverma02 Jun 11, 2023

abomariam Jun 12, 2023

JMiller-FINCAD
Feb 4, 2022

Replies: 1 comment 4 replies

abomariam
Feb 14, 2022

JMiller-FINCAD Feb 15, 2022
Author