Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis result backend connections leak #6819

Closed
11 of 18 tasks
ronlut opened this issue Jun 21, 2021 · 50 comments
Closed
11 of 18 tasks

Redis result backend connections leak #6819

ronlut opened this issue Jun 21, 2021 · 50 comments

Comments

@ronlut
Copy link

ronlut commented Jun 21, 2021

Checklist

  • I have verified that the issue exists against the master branch of Celery.
  • This has already been asked to the discussion group first.
  • I have read the relevant section in the
    contribution guide
    on reporting bugs.
  • I have checked the issues list
    for similar or identical bug reports.
  • I have checked the pull requests list
    for existing proposed fixes.
  • I have checked the commit log
    to find out if the bug was already fixed in the master branch.
  • I have included all related issues and possible duplicate issues
    in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

  • I have included the output of celery -A proj report in the issue.
    (if you are not able to do this, then at least specify the Celery
    version affected).
  • I have verified that the issue exists against the master branch of Celery.
  • I have included the contents of pip freeze in the issue.
  • I have included all the versions of all the external dependencies required
    to reproduce this bug.

Optional Debugging Information

  • I have tried reproducing the issue on more than one Python version
    and/or implementation.
  • I have tried reproducing the issue on more than one message broker and/or
    result backend.
  • I have tried reproducing the issue on more than one version of the message
    broker and/or result backend.
  • I have tried reproducing the issue on more than one operating system.
  • I have tried reproducing the issue on more than one workers pool.
  • I have tried reproducing the issue with autoscaling, retries,
    ETA/Countdown & rate limits disabled.
  • I have tried reproducing the issue after downgrading
    and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

- #4465

Possible Duplicates

  • None

Environment & Settings

Celery version:

celery report Output:

software -> celery:5.1.0 (sun-harmonics) kombu:5.1.0 py:3.8.8
            billiard:3.6.4.0 redis:3.5.3
platform -> system:Darwin arch:64bit
            kernel version:20.4.0 imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis:///

broker_url: 'redis://localhost:6379//'
result_backend: 'redis:///'
task_queue_max_priority: 10
deprecated_settings: None

Steps to Reproduce

Required Dependencies

  • Minimal Python Version: N/A or Unknown
  • Minimal Celery Version: N/A or Unknown
  • Minimal Kombu Version: N/A or Unknown
  • Minimal Broker Version: N/A or Unknown
  • Minimal Result Backend Version: N/A or Unknown
  • Minimal OS and/or Kernel Version: N/A or Unknown
  • Minimal Broker Client Version: N/A or Unknown
  • Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

amqp==5.0.6
billiard==3.6.4.0
celery==5.1.0
certifi==2021.5.30
chardet==4.0.0
click==7.1.2
click-didyoumean==0.0.3
click-plugins==1.1.1
click-repl==0.2.0
dnspython==1.16.0
eventlet==0.31.0
Flask==2.0.1
greenlet==1.1.0
idna==2.10
itsdangerous==2.0.1
Jinja2==3.0.1
kombu==5.1.0
MarkupSafe==2.0.1
prompt-toolkit==3.0.18
pytz==2021.1
redis==3.5.3
requests==2.25.1
six==1.16.0
urllib3==1.26.5
vine==5.0.0
wcwidth==0.2.5
Werkzeug==2.0.1

Other Dependencies

N/A

Minimally Reproducible Test Case

tasks.py

import time
from celery import Celery

app = Celery('tasks', backend='redis://', broker='redis://', task_queue_max_priority=10)


@app.task
def sleep():
    time.sleep(1)

web.py

from flask import Flask
from tasks import sleep
app = Flask(__name__)

@app.route('/')
def hello_world():
    result = sleep.apply_async()
    result.wait()
    return '', 200

if __name__ == '__main__':
    app.run(port=4444)

simulate.py

import requests
for i in range(20):
    requests.get("http://localhost:4444")
  1. pip install celery[redis] requests flask
  2. Start redis
    docker run -p 6379:6379 --name redis redis
  3. Run celery celery -A tasks worker --loglevel=INFO
  4. Run web server python web.py
  5. connect to redis
docker exec -it redis /bin/bash
redis-cli
info Clients

Note the connected_clients number

  1. simulate requests python simulate.py
  2. Run info Clients again in redis-cli.
    Note the connected_clients number which is now a lot higher

Expected Behavior

Connections close after getting/waiting for a result

Actual Behavior

I'm experiencing a very strange problem with redis connections staying open after each request (after each result.wait() or result.get()).

Help will be super appreciated.
Running flask with threaded=False solves the issue.
Obviously this is a simplified reproduction code. Our real environment is gunicorn, eventlet, flask, redis, celery.
In production we are getting 60k open connections to redis quite fast and we had to restart our server a few times to reset the leaks.

@open-collective-bot
Copy link

Hey @ronlut 👋,
Thank you for opening an issue. We will get back to you as soon as we can.
Also, check out our Open Collective and consider backing us - every little helps!

We also offer priority support for our sponsors.
If you require immediate assistance please consider sponsoring us.

@pomo-mondreganto
Copy link

pomo-mondreganto commented Jun 21, 2021

I've just experienced the same issue. Setting redis_max_connections does not fix the issue. We're using gevent workers and RabbitMQ as a broker, so the leak is definitely in the result backend part. The only app writing to the Redis db experiencing the leak is Celery. Redis's CLIENT LIST looks like this (truncated):

id=28 addr=172.24.0.13:58812 fd=32 name= age=309 idle=295 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=755 addr=172.24.0.13:60638 fd=437 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=756 addr=172.24.0.13:60640 fd=438 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=757 addr=172.24.0.13:60642 fd=439 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=758 addr=172.24.0.13:60644 fd=440 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=759 addr=172.24.0.13:60646 fd=441 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=760 addr=172.24.0.13:60648 fd=442 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=761 addr=172.24.0.13:60650 fd=443 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=762 addr=172.24.0.13:60652 fd=444 name= age=9 idle=9 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=29 addr=172.24.0.13:58814 fd=33 name= age=309 idle=309 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=30 addr=172.24.0.13:58816 fd=34 name= age=309 idle=309 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=31 addr=172.24.0.13:58818 fd=35 name= age=309 idle=29 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=20 addr=172.24.0.13:58792 fd=24 name= age=309 idle=230 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=132 addr=172.24.0.13:59084 fd=91 name= age=270 idle=230 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=133 addr=172.24.0.13:59086 fd=92 name= age=270 idle=230 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=21 addr=172.24.0.13:58794 fd=25 name= age=309 idle=230 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=468 addr=172.24.0.13:59928 fd=270 name= age=130 idle=129 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=469 addr=172.24.0.13:59930 fd=271 name= age=130 idle=110 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=470 addr=172.24.0.13:59932 fd=272 name= age=130 idle=130 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=279 addr=172.24.0.13:59466 fd=176 name= age=210 idle=210 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=280 addr=172.24.0.13:59468 fd=177 name= age=210 idle=210 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
id=306 addr=172.24.0.13:59532 fd=179 name= age=191 idle=190 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=353 addr=172.24.0.13:59648 fd=202 name= age=171 idle=110 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe
id=354 addr=172.24.0.13:59650 fd=203 name= age=170 idle=110 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=unsubscribe

P.S. Celery report output:

software -> celery:5.1.1 (sun-harmonics) kombu:5.1.0 py:3.9.5
            billiard:3.6.4.0 librabbitmq:2.0.0
platform -> system:Linux arch:64bit, ELF
            kernel version:5.10.25-linuxkit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:librabbitmq results:redis://:**@redis:6379/1

include: ['tasks.actions', 'tasks.handlers']
deprecated_settings: None
broker_url: 'amqp://forcad:********@rabbitmq:5672/forcad'
result_backend: 'redis://:********@redis:6379/1'
timezone: 'Europe/Moscow'
worker_prefetch_multiplier: 1
redis_socket_timeout: 10
redis_socket_keepalive: True
redis_retry_on_timeout: True
accept_content: ['pickle']
result_serializer: 'pickle'
task_serializer: 'pickle'
redis_max_connections: 10

@ronlut
Copy link
Author

ronlut commented Jun 30, 2021

An update:
I tried making oid and backend properties of celery shared between all threads by using if-lock-if but got into trouble as I think there are other places which also count on the fact that backend and oid are per thread.
Didn't have enough time to play with that more yet.

At the moment I set redis idle connections timeout to 60 seconds which kills all leaked connections after a minute.
To do that, in redis-cli: config set timeout 60.
Or use the correct way in your cloud provider (parameter group for aws elasticache)

@pomo-mondreganto
Copy link

Is there any progress on this?

@reederz
Copy link

reederz commented Sep 15, 2021

5.0.2 is the max version that doesn't have this leak (at least for me). Something happened after that.

@d0d0
Copy link

d0d0 commented Oct 5, 2021

Agree with @reederz, 5.0.2 version is fine, after upgrade to 5.0.3, I am hitting connection limit

EDIT: definitely broken by this PR #6416 I reverted it and connections do not leak anymore

@auvipy
Copy link
Member

auvipy commented Oct 5, 2021

@matusvalo

@d0d0
Copy link

d0d0 commented Oct 6, 2021

Well I played around it and found out, that if we change

diff --git a/celery/app/base.py b/celery/app/base.py
index a00d46513..cc244e77d 100644
--- a/celery/app/base.py
+++ b/celery/app/base.py
@@ -1243,7 +1243,7 @@ class Celery:
         """AMQP related functionality: :class:`~@amqp`."""
         return instantiate(self.amqp_cls, app=self)
 
-    @property
+    @cached_property
     def backend(self):
         """Current backend instance."""
         try:

connections do not leak anymore. Not sure if it is valid fix, but at least it makes celery usable with redis again.

What do you think @matusvalo?

@auvipy
Copy link
Member

auvipy commented Oct 6, 2021

celery/kombu@96ca00f was added, you can try both

@d0d0
Copy link

d0d0 commented Oct 6, 2021

celery/kombu@96ca00f was added, you can try both

Well, it did not help, only cached_property for backend

@auvipy
Copy link
Member

auvipy commented Oct 6, 2021

you can come with your proposed change as well. as it is solving the problem

d0d0 added a commit to d0d0/celery that referenced this issue Oct 6, 2021
d0d0 added a commit to d0d0/celery that referenced this issue Oct 6, 2021
@matusvalo
Copy link
Member

Let me have a look on the issue. I will get back asap when I find something.

@bright2227
Copy link

The 'leak' is caused by the thread mechanism of the flask.

When the flask runs with default threaded=True, it creates a new thread when handling every incoming new request.

class ThreadingMixIn:
    daemon_threads = False
    block_on_close = True
    _threads = _NoThreads()

    def process_request(self, request, client_address):
        if self.block_on_close:
            vars(self).setdefault('_threads', _Threads())
        t = threading.Thread(target = self.process_request_thread,
                             args = (request, client_address))
        t.daemon = self.daemon_threads
        self._threads.append(t)
        t.start()

For Celery that version is above 5.0.3, the backend is created and used per threads. It uses threading.local to separate backend for every thread.

@property
def backend(self):
    """Current backend instance."""
    try:
        return self._local.backend
    except AttributeError:
        self._local.backend = new_backend = self._get_backend()
        return new_backend

Because every request thread is new, celery creates a new backend and connections every time these new requests send tasks.

In my opinion, I am not sure it is a problem with celery. I thought flask have a thread pool to reuse the threads.

@pomo-mondreganto
Copy link

pomo-mondreganto commented Oct 24, 2021

@bright2227

As per my comment, the leak exists in a pure Celery setup with gevent workers, too, making it essentially a problem in Celery itself.

@matusvalo
Copy link
Member

matusvalo commented Oct 24, 2021

I checked the backends and they seemed to be destroyed properly (__del__ method was called) but I still was not able to find why the connections were not closed. I need to spend more time with investigation.

@d0d0
Copy link

d0d0 commented Oct 24, 2021

I checked the backends and they seemed to be destroyed properly (__del__ method was called) but I still was not able to find why the connections were not closed. I need to spend more time with investigation.

I was thinking, can it be caused by WSL? I am running celery under WSL Ubuntu 20.04 (Windows Server 2019) using gevent as pool

@matusvalo
Copy link
Member

matusvalo commented Oct 25, 2021

OK I have checked more deep the issue using master branch of celery and here are my findings:

Running the reproducer over flask develop server

As mentioned by @bright2227 the flask developer server spawns new thread for each http request. This causes creation of new RedisBackend and ResultConsumer class instances. Both of them are stored in thread local storage. ResultConsumer instance takes a connection from redis connection pool and have it allocated until it is destroyed. After request is served the thread is destroyed and with it pointers to ResultConsumer and RedisBackend. Hence, both instances are waiting to be destroyed by Garbage Collector and GC destroys them after some time and with them the redis connection is returned to the connection pool - see the connection information of redis for 3 clients pushing data to flask:

# omitted multiple runs of redis-cli with increasing connected_clients
matus@matus-debian:~$ redis-cli info Clients
# Clients
connected_clients:497
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:95
client_recent_max_output_buffer:0
blocked_clients:1
tracking_clients:0
clients_in_timeout_table:1

matus@matus-debian:~$ redis-cli info Clients
# Clients
connected_clients:77
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:95
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

It is clearly seen that for some time it rose to more than 400 connections but after that GC destroys some of unused instances and frees connections to the pool. To be honest, I am totally fine with this behaviour for flask development web server since it is not intended to be used in production.

Running the reproducer in gunicorn with threads

I have also checked the deployment with gunicorn using threads. I have installed latest stable version of gunicorn and run web.py under gunicorn using following command:

matus@matus-debian:~/dev/celery$ gunicorn --threads=5 web:app
[2021-10-25 22:26:44 +0200] [13162] [INFO] Starting gunicorn 20.1.0
[2021-10-25 22:26:44 +0200] [13162] [INFO] Listening at: http://127.0.0.1:8000 (13162)
[2021-10-25 22:26:44 +0200] [13162] [INFO] Using worker: gthread
[2021-10-25 22:26:44 +0200] [13169] [INFO] Booting worker with pid: 13169

I have run 3 parallel instances of simulator.py posting the data indefinitely. I have checked gunicorn with pstree to verify that threads were spawned:

|-sshd---sshd---sshd---bash---screen---screen-+-bash---gunicorn---gunicorn---5*[{gunicorn}]
|                                             |-2*[bash---python]
|                                             |-bash---celery---4*[celery]
|                                             |-bash
|                                             `-bash---pstree

From pstree snippet can be seen that 5 threads were spawned serving the request. I have waited for longer time and the number of redis connection was stable at 30:

matus@matus-debian:~$ redis-cli info Clients
# Clients
connected_clients:30
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:95
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

So this deployment was runnning flawlesly, I also increased the load by lowering the sleep in task:

@app.task
def sleep():
    time.sleep(0.01)

and still the number of connections was stable at 30 with 3 clients posting the data to the gunicorn. This deployment works 100% fine.

Hence, I was not able to reproduce any serious problem with connection leaks. I did not checked the gevent case so it is possible that this problem is specific to gevent deployment. @pomo-mondreganto @ronlut could you provide reproducer for gevent case?

@bennullgraham
Copy link

bennullgraham commented Nov 3, 2021

G'day folks, we've been seeing what I think is the same issue. We're doing daily reboots of our busiest production site to prevent hitting the OS file handle limit from all the Redis connections. This gets me out of bed each day with purpose.

Checking the Redis client list, we have thousands of old connections all with cmd=unsubscribe:

id=14458162  ...  age=70890  idle=70886  ...  events=r  cmd=unsubscribe  user=default
id=14460081  ...  age=61500  idle=61497  ...  events=r  cmd=unsubscribe  user=default
id=14463207  ...  age=56382  idle=56300  ...  events=r  cmd=unsubscribe  user=default
<snip>

These connections are all for the same Redis database, which contains keys of the pattern celery-task-meta-<uuid>.

There is a minimal repro over here with instructions: https://github.com/LivePreso/redis-leak

The issue is reproducible using gunicorn -k gevent but not gunicorn -k sync. It's also not reproducible on Celery 5.0.2 but becomes so somewhere up to 5.0.5, the version we are currently using in production, and remains so on 5.1.2, the version in the repro.

@bennullgraham
Copy link

I've confirmed the repro at https://github.com/LivePreso/redis-leak still holds with the new 5.2.0 release.

@auvipy auvipy added this to the 5.2.x milestone Nov 16, 2021
@hiimdoublej-swag
Copy link

Hello @matusvalo , I've created a new rebased PR (#7631) with the fix in #6895.
However it introduced another problem, so maybe the #6895 fix isn't the right fix.
I've put together a pure-celery reproducible case here without involving any web framework, it would be great if you can take a look and see if it brings you some ideas to fix this bug.

@uzi0espil
Copy link

uzi0espil commented Jul 27, 2022

I am facing the error but with different settings. The celery workers in my application are running fine, however if I chain two tasks with the main task, I receive this error. In addition, the error persist even if the next time I run any task (even with no chained tasks) it still fail with same error, the only fix then is to restart celery and redis.

There are two strange things I noticed:

  1. If I chain only single task with the main task, then it works fine no matter how many times I run it.
  2. If I chain two tasks with the main task, the error only shows on the first task while the other linked tasks run normally.

This is how I am chaining the tasks:

link_tasks  = []
link_tasks.append(task1.signature(args=(...)))
link_tasks.append(task2.signature(args=(...)))

main_task.apply_async((...),
                      kwargs=dict(...),
                      task_id=str(task.reference_id),
                      queue=queue,
                      priority=priority,
                      link_error=[error_handler.s()],
                      delivery_mode=2,
                      link=link_tasks or None)

I tried the following answers from above:

  • Set timeout of redis.
  • Revert to celery 5.0.2

However the error still persist.

@wochinge
Copy link
Contributor

wochinge commented Aug 2, 2022

We're running into the same problem since adding redis as result backend. A fix or guidance on how to fix this would be much appreciated.

@hiimdoublej-swag
Copy link

hiimdoublej-swag commented Aug 3, 2022

@wochinge Can your fix pass the tests ?
I had a fix similar to yours but it didn't pass some tests under threaded environments.
Probably suggesting we shouldn't be reusing the pubsub object from redis.

@hiimdoublej-swag
Copy link

@uzi0espil
I'd think that if the revert to 5.0.2 didn't work for you then this is a different bug. All my reproducible cases didn't happen with 5.0.2.

@wochinge
Copy link
Contributor

wochinge commented Aug 3, 2022

@hiimdoublej-swag I want to investigate them today 👍🏻

@wochinge
Copy link
Contributor

wochinge commented Aug 3, 2022

@hiimdoublej-swag Here are my findings so far.

I think the tests are not failing because of the change but rather because the test itself is rather flakey. I ran test_multithread_producer (one of the failing tests) locally and they sometimes pass and sometimes don't.

We also don't think it's a connection leak but since this PR Redis connections are no longer shared between Threads/Geventlets/Eventlets . Especially when using gevent/eventlet this causes a spike in connections due to the larger pool size.

I've also received the warning below which in my opinion tells us that redis-py is not as thread-safe as it should be:

Bildschirmfoto 2022-08-03 um 12 00 10

Ideally it shouldn't be a big deal if we have one/two connections per eventlet/gevent but we still can't explain the spikes in our production system by that 😬

@wochinge
Copy link
Contributor

wochinge commented Aug 3, 2022

In our Grafana logs we saw a huge spike in Redis connections after the tasks were finished. The connections were closed once the results were deleted from the result backend after the default result expire time (1 day). Our quick fix is now to use ignore_result=True for all tasks where we don't require the result.

@auvipy
Copy link
Member

auvipy commented Aug 4, 2022

can you help improve the flaky test?

@wochinge
Copy link
Contributor

wochinge commented Aug 4, 2022

I think the issue is that it's spawning so many threads which are then scheduled by the OS in a non-ideal way sometimes so that waiting times can be quite large and then the test takes more time then normal. In my opinion it's not the most pressing issue as the @flakey notation with the timeout / re-run) seems to work reasonably well.

I think my personal take-aways from the investigation yesterday are:

  • It's correct that the Redis connection is no longer shared between threads/green threads
  • It's very weird that we observed a spike in connections after a result is published and that these connections persist until the result expires on the result backend

@wochinge
Copy link
Contributor

wochinge commented Aug 4, 2022

@uzi0espil I think you're running into this issue: #6963

@hiimdoublej-swag
Copy link

It's very weird that we observed a spike in connections after a result is published and that these connections persist until the result expires on the result backend

I think this should be what we're targeting to fix, instead of trying to share the connections ?

@wochinge
Copy link
Contributor

wochinge commented Aug 9, 2022

Agree! I created a draft PR with an integration test which (I think) reproduces the problem: #7685

@Avamander
Copy link

Our quick fix is now to use ignore_result=True for all tasks where we don't require the result.

All my tasks are like that but at some point even then Redis memory grows to a point where it becomes unusable.

If redis is periodically flushed (some data is lost, so it's bad) then at some point redis maximum connection limit is reached. If that is risen, the machine runs out of ports it can use for establishing connections.

At the moment I have resorted to restarting the entire stack periodically after five days or after a billion tasks.

@jobec
Copy link

jobec commented Aug 9, 2022

I bumped into the same issue. Celery 5.2.1, Kombu 5.2.2, gevent 21.8.0

One sort-of-workaround seems to be setting the timeout setting in redis.conf (see here) which eventually closes the stale connections from Redis's side.

Like mentioned above, this is how I bypass it at the moment. It's not perfect, but it keeps things from piling up and collapsing...

@auvipy
Copy link
Member

auvipy commented Feb 7, 2023

I would like more feedback on #8058

@auvipy auvipy modified the milestones: 5.3.x, 5.3 Feb 9, 2023
@auvipy auvipy closed this as completed Feb 9, 2023
@hzc989
Copy link

hzc989 commented Jul 12, 2023

I would like more feedback on #8058

after upgrading to 5.3.1 , with setting result_backend_thread_safe=true, works for us.

p.s. we've try 4.4.7 & 5.2.7 , both failed with the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment