Worker receiving SIGTERM causes issues with SQS Long Polling #7283
JMiller-FINCAD
started this conversation in
General
Replies: 1 comment 4 replies
-
I'm having the same issue. We are hosting our workers on AWS EC2 Auto Scale group, and when we scale in we get some messages stuck in But looking more into that I found that the message is never processed or appeared in our logs, and using the DEBUG option I found that I can reproduce this issue if I interrupted the celery after it send the Here is the log
I'm having
|
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We've noticed a issue with Celery using SQS. We're hosting our workers on AWS ECS containers, and scaling out the number of containers when we have a lot of messages in the SQS queue. When the message count goes down we scale in the number of containers, which will send SIGTERMs to the containers to shut them down.
We hit an edge case where the worker in the container which receives the SIGTERM when it is currently polling SQS for messages. The worker appears to receive the message from SQS, but is shutdown before it processes the message. Because we have TASK_ACKS_LATE = True, it means the message waits until it's queue timeout (set to 30 minutes) before being picked up by another worker. We seem to minimize this by lowering the wait_time_seconds from the default of 10 seconds down to 2 seconds, but I'm not sure if the issue has gone away.
This is a tough issue to reproduce, and only happens with large loads on our system (with lots of workers...). Lowering the wait_time_seconds does seem to lower chance of it happening, but it would nice to remove completely. Any insight would be helpful.
Some relevant settings:
celery==5.1.2
WORKER_CONCURRENCY = 1
WORKER_PREFETCH_MULTIPLIER = 1
TASK_ACKS_LATE = True
Beta Was this translation helpful? Give feedback.
All reactions