New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis result backend connections leak #6819
Comments
Hey @ronlut 👋, We also offer priority support for our sponsors. |
I've just experienced the same issue. Setting
P.S. Celery report output:
|
An update: At the moment I set redis idle connections timeout to 60 seconds which kills all leaked connections after a minute. |
Is there any progress on this? |
5.0.2 is the max version that doesn't have this leak (at least for me). Something happened after that. |
Well I played around it and found out, that if we change diff --git a/celery/app/base.py b/celery/app/base.py
index a00d46513..cc244e77d 100644
--- a/celery/app/base.py
+++ b/celery/app/base.py
@@ -1243,7 +1243,7 @@ class Celery:
"""AMQP related functionality: :class:`~@amqp`."""
return instantiate(self.amqp_cls, app=self)
- @property
+ @cached_property
def backend(self):
"""Current backend instance."""
try: connections do not leak anymore. Not sure if it is valid fix, but at least it makes celery usable with redis again. What do you think @matusvalo? |
celery/kombu@96ca00f was added, you can try both |
Well, it did not help, only |
you can come with your proposed change as well. as it is solving the problem |
Let me have a look on the issue. I will get back asap when I find something. |
The 'leak' is caused by the thread mechanism of the flask. When the flask runs with default
For Celery that version is above 5.0.3, the backend is created and used per threads. It uses
Because every request thread is new, celery creates a new backend and connections every time these new requests send tasks. In my opinion, I am not sure it is a problem with celery. I thought flask have a thread pool to reuse the threads. |
As per my comment, the leak exists in a pure Celery setup with gevent workers, too, making it essentially a problem in Celery itself. |
I checked the backends and they seemed to be destroyed properly ( |
I was thinking, can it be caused by WSL? I am running celery under WSL Ubuntu 20.04 (Windows Server 2019) using gevent as pool |
OK I have checked more deep the issue using master branch of celery and here are my findings: Running the reproducer over flask develop serverAs mentioned by @bright2227 the flask developer server spawns new thread for each http request. This causes creation of new # omitted multiple runs of redis-cli with increasing connected_clients
matus@matus-debian:~$ redis-cli info Clients
# Clients
connected_clients:497
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:95
client_recent_max_output_buffer:0
blocked_clients:1
tracking_clients:0
clients_in_timeout_table:1
matus@matus-debian:~$ redis-cli info Clients
# Clients
connected_clients:77
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:95
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0 It is clearly seen that for some time it rose to more than 400 connections but after that GC destroys some of unused instances and frees connections to the pool. To be honest, I am totally fine with this behaviour for flask development web server since it is not intended to be used in production. Running the reproducer in gunicorn with threadsI have also checked the deployment with gunicorn using threads. I have installed latest stable version of gunicorn and run matus@matus-debian:~/dev/celery$ gunicorn --threads=5 web:app
[2021-10-25 22:26:44 +0200] [13162] [INFO] Starting gunicorn 20.1.0
[2021-10-25 22:26:44 +0200] [13162] [INFO] Listening at: http://127.0.0.1:8000 (13162)
[2021-10-25 22:26:44 +0200] [13162] [INFO] Using worker: gthread
[2021-10-25 22:26:44 +0200] [13169] [INFO] Booting worker with pid: 13169 I have run 3 parallel instances of
From pstree snippet can be seen that 5 threads were spawned serving the request. I have waited for longer time and the number of redis connection was stable at 30: matus@matus-debian:~$ redis-cli info Clients
# Clients
connected_clients:30
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:95
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0 So this deployment was runnning flawlesly, I also increased the load by lowering the sleep in task: @app.task
def sleep():
time.sleep(0.01) and still the number of connections was stable at 30 with 3 clients posting the data to the gunicorn. This deployment works 100% fine. Hence, I was not able to reproduce any serious problem with connection leaks. I did not checked the gevent case so it is possible that this problem is specific to gevent deployment. @pomo-mondreganto @ronlut could you provide reproducer for gevent case? |
G'day folks, we've been seeing what I think is the same issue. We're doing daily reboots of our busiest production site to prevent hitting the OS file handle limit from all the Redis connections. This gets me out of bed each day with purpose. Checking the Redis client list, we have thousands of old connections all with
These connections are all for the same Redis database, which contains keys of the pattern There is a minimal repro over here with instructions: https://github.com/LivePreso/redis-leak The issue is reproducible using |
I've confirmed the repro at https://github.com/LivePreso/redis-leak still holds with the new 5.2.0 release. |
Hello @matusvalo , I've created a new rebased PR (#7631) with the fix in #6895. |
I am facing the error but with different settings. The celery workers in my application are running fine, however if I chain two tasks with the main task, I receive this error. In addition, the error persist even if the next time I run any task (even with no chained tasks) it still fail with same error, the only fix then is to restart celery and redis. There are two strange things I noticed:
This is how I am chaining the tasks: link_tasks = []
link_tasks.append(task1.signature(args=(...)))
link_tasks.append(task2.signature(args=(...)))
main_task.apply_async((...),
kwargs=dict(...),
task_id=str(task.reference_id),
queue=queue,
priority=priority,
link_error=[error_handler.s()],
delivery_mode=2,
link=link_tasks or None) I tried the following answers from above:
However the error still persist. |
We're running into the same problem since adding redis as result backend. A fix or guidance on how to fix this would be much appreciated. |
@wochinge Can your fix pass the tests ? |
@uzi0espil |
@hiimdoublej-swag I want to investigate them today 👍🏻 |
@hiimdoublej-swag Here are my findings so far. I think the tests are not failing because of the change but rather because the test itself is rather flakey. I ran We also don't think it's a connection leak but since this PR Redis connections are no longer shared between Threads/Geventlets/Eventlets . Especially when using gevent/eventlet this causes a spike in connections due to the larger pool size. I've also received the warning below which in my opinion tells us that Ideally it shouldn't be a big deal if we have one/two connections per eventlet/gevent but we still can't explain the spikes in our production system by that 😬 |
In our Grafana logs we saw a huge spike in Redis connections after the tasks were finished. The connections were closed once the results were deleted from the result backend after the default result expire time (1 day). Our quick fix is now to use |
can you help improve the flaky test? |
I think the issue is that it's spawning so many threads which are then scheduled by the OS in a non-ideal way sometimes so that waiting times can be quite large and then the test takes more time then normal. In my opinion it's not the most pressing issue as the I think my personal take-aways from the investigation yesterday are:
|
@uzi0espil I think you're running into this issue: #6963 |
I think this should be what we're targeting to fix, instead of trying to share the connections ? |
Agree! I created a draft PR with an integration test which (I think) reproduces the problem: #7685 |
All my tasks are like that but at some point even then Redis memory grows to a point where it becomes unusable. If redis is periodically flushed (some data is lost, so it's bad) then at some point redis maximum connection limit is reached. If that is risen, the machine runs out of ports it can use for establishing connections. At the moment I have resorted to restarting the entire stack periodically after five days or after a billion tasks. |
Like mentioned above, this is how I bypass it at the moment. It's not perfect, but it keeps things from piling up and collapsing... |
I would like more feedback on #8058 |
after upgrading to 5.3.1 , with setting result_backend_thread_safe=true, works for us. p.s. we've try 4.4.7 & 5.2.7 , both failed with the same problem. |
Checklist
master
branch of Celery.contribution guide
on reporting bugs.
for similar or identical bug reports.
for existing proposed fixes.
to find out if the bug was already fixed in the master branch.
in this issue (If there are none, check this box anyway).
Mandatory Debugging Information
celery -A proj report
in the issue.(if you are not able to do this, then at least specify the Celery
version affected).
master
branch of Celery.pip freeze
in the issue.to reproduce this bug.
Optional Debugging Information
and/or implementation.
result backend.
broker and/or result backend.
ETA/Countdown & rate limits disabled.
and/or upgrading Celery and its dependencies.
Related Issues and Possible Duplicates
Related Issues
- #4465
Possible Duplicates
Environment & Settings
Celery version:
celery report
Output:Steps to Reproduce
Required Dependencies
Python Packages
pip freeze
Output:Other Dependencies
N/A
Minimally Reproducible Test Case
tasks.py
web.py
simulate.py
pip install celery[redis] requests flask
docker run -p 6379:6379 --name redis redis
celery -A tasks worker --loglevel=INFO
python web.py
Note the
connected_clients
numberpython simulate.py
info Clients
again in redis-cli.Note the
connected_clients
number which is now a lot higherExpected Behavior
Connections close after getting/waiting for a result
Actual Behavior
I'm experiencing a very strange problem with redis connections staying open after each request (after each result.wait() or result.get()).
Help will be super appreciated.
Running flask with
threaded=False
solves the issue.Obviously this is a simplified reproduction code. Our real environment is gunicorn, eventlet, flask, redis, celery.
In production we are getting 60k open connections to redis quite fast and we had to restart our server a few times to reset the leaks.
The text was updated successfully, but these errors were encountered: