-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvalidResponse: Protocol Error: b'1' #6335
Comments
Thank you for the very detailed bug report. This exception does not get handled. @andymccurdy I'm not very familiar with the Redis protocol. Do you have a clue on what's going on here? |
I can confirm that there is no error when the wsgi settings are changed to this:
(Notice the single thread per process) |
My guess is that this is a race condition. All Redis messages start with a control character that indicates what type of data comes next. For some reason, the stack trace above shows that the control character is not present. The only time I’ve seen this happen is when multiple threads are reading from the same socket at the same time. Keep in mind that while the connection pool is indeed threadsafe, the connections that the pool hand out are not thread safe. This includes PubSub objects as they hold a reference to a single connection that’s in pubsub mode. When celery enqueues a task, does it use a single pubsub connection? Multiple threads are likely enqueueing at the same time and if celery is enqueueing those tasks with a single pubsub connection without guarding access with a threading.Lock, that would lead to an error like this. |
I'm not really familiar with celery's internals, so there is nothing that I could add to your response. But I'm plenty much sure that this is a race condition, and would be fixed by having some kind of lock system inside; because I have faced similar problems (nonsense response from server) while working with sockets in multiple threads and fixed it this way. Is there any workaround for it in the meantime, so that I could use the WSGI's multi-threading? The single-thread system is performing poorly now. |
|
This is how we subscribe to a channel. |
How do |
@thedrow ah, that makes sense then and explains this issue. 2 threads are calling You could use a thread local as #5145 suggests. Remember that each thread will have its own socket connection. If Celery still puts all those file descriptors in a list for a An easier way might be to keep the single consumer instance and add a Alternatively, you could abandon pubsub completely and replace it with Redis streams (added in 5.0). Streams are a time ordered data structure that holds messages. They also remember which consumers have read which messages. They’re far more durable than pubsub. If a connection is severed, the client can pick up where it left off with no message loss. |
We are going for a new implementation of the Redis broker but in the meanwhile we need to support that broken one. @andymccurdy You make good points. Thank you. |
@thedrow The ConnectionPool is thread-safe. The connections the pool hands out are not thread safe. Connections would incur a massive performance penalty to make them thread safe. You'd either have to synchronize the entire request/response round trip (which would block all other threads while waiting for the socket) or synchronize each send/read and keep track of the order each thread made a request so that responses can be distributed to the appropriate thread. In both cases it's far more performant for each thread to have its own connection. The docs (last paragraph of this section) explicitly state that PubSub objects are not thread safe |
I encountered this same issue with the Redis result backend. Sometime after completing a large number of tasks, my API queries task results like so: Thought I would chime in that I can't reproduce this when load testing with hundreds of concurrent threads after downgrading to these dependencies:
I may attempt to introduce a global lock when retrieving results (per the previous comment) in my gunicorn process so I can stick with Celery 4.x until this is fixed. |
Hi. I encountered this same issue with the Redis result backend. Most of the time starting a task works fine, but sometimes I get this error. If I understand correctly, these are similar errors. My error:
|
Yes that's correct. If someone can, please do. |
@thedrow I guess that would have a massive impact on performance (in environments with high request count per second), as @andymccurdy said; Because the lock would block all other threads from accessing Redis until the entire request/response cycle is finished for one active thread. |
@thedrow @HosseyNJF I'm sorry, but I didn't really understand. |
In that case we should test the patch we do have and complete it. |
@mskogorevrmc If you want to fix this bug temporarily, just set the settings like this: |
I changed the WSGI settings, but I still get errors. My WSGI configuration: I currently use the following versions of packages:
Errors:
When I use RabbitMQ as a broker, I get these errors:
Could someone tell me if there is a fix? |
Also, I use the mpm_event module for apache, can these cause problems with threads? |
I'm not familiar with that module but a quick Google search shows that it shouldn't. |
When reading apache docs:
So it seems that requests are served by multiple threads. @mskogorevrmc can you check using prefork module [1]? Please let us know whether it helps. |
@matusvalo With the module prefork for apache, I don't get errors for about a week, but it uses a lot of resources. Are there any plans to fix this? |
#6416 was fixed and merged to master. @mskogorevrmc could you test your bug against master of Celery to check if it is fixed for you? |
I've had a similar issue with |
Same problem when using celery with tornado. I know it is not a recommend way. to make it async
it works fine if there is no concurrency. but when using benchmark tools,
still happned when upgrade to 5.0.4 current solution is to make an individual process as executor
ProcessPoolExecutor is ok ,but ThreadPoolExecutor still has such error. Some try, hope it helps. |
@kkito Can you please provide a minimally reproducible test case? |
https://github.com/kkito/celery-tornado-intergration Here is a demo. @thedrow |
I have checked the the reproducer and the code and here are my findings:
# file celery/app/base.py
class Celery:
# code skipped
def __init__(self, main=None, loader=None, backend=None,
amqp=None, events=None, log=None, control=None,
set_as_current=True, tasks=None, broker=None, include=None,
changes=None, config_source=None, fixups=None, task_cls=None,
autofinalize=True, namespace=None, strict_typing=True,
**kwargs):
self.context_oid = contextvars.ContextVar('context_oid')
self.context_backend = contextvars.ContextVar('context_backend')
# code skipped
@property
def thread_oid(self):
"""Per-thread unique identifier for this app."""
return self.context_oid.get(oid_from(self, threads=True))
# code skipped
@property
def backend(self):
"""Current backend instance."""
return self.context_backend.get(self._get_backend())
# code skipped The proposed solution is just workaround and needs to be deeply verified if something else is not broken (e.g. multithreaded/multiprocessed calling). @thedrow, does Celery supports AsyncIO? If not this issue should be fixed but in more bigger effort to migrating whole Celery stack to AsyncIO. |
Celery currently does not support asyncio. |
Can i use it like kkito/celery-tornado-intergration@7e6eb39 ? |
Theoretically yes. I tested it and the testing script was working with this fix. BUT recent versions of Tornado are built on top of AsyncIO loop (as far as I understand). Celery stack is not adjusted for AsyncIO stack so there can be some other problems like crashes in other corner cases or performance downgrade. You can use it but you need to understand what you are doing :-) |
Checklist
master
branch of Celery. (I cannot test this because it's a production server)contribution guide
on reporting bugs.
for similar or identical bug reports.
for existing proposed fixes.
to find out if the bug was already fixed in the master branch.
in this issue (If there are none, check this box anyway).
Mandatory Debugging Information
celery -A proj report
in the issue.(if you are not able to do this, then at least specify the Celery
version affected).
master
branch of Celery. (I cannot test this because it's a production server)pip freeze
in the issue.to reproduce this bug.
Optional Debugging Information
and/or implementation.
result backend.
broker and/or result backend.
ETA/Countdown & rate limits disabled.
and/or upgrading Celery and its dependencies.
Related Issues and Possible Duplicates
Related Issues
Possible Duplicates
Environment & Settings
Celery version: 4.4.7 (cliffs)
celery report
Output:Steps to Reproduce
Required Dependencies
Python Packages
pip freeze
Output:Other Dependencies
I'm using celery with Django, placed behind an Apache2 server, using the WSGI mod. The entries for WSGI are:
Minimally Reproducible Test Case
Just use celery with Redis and Django behind apache2, in multi-threaded mode.
Expected Behavior
I expect that celery works every time I run a task.
Actual Behavior
Most of the time starting a task works fine, but sometimes I get this error (or something similar with different values inside ProtocolError):
Full stacktrace here
The text was updated successfully, but these errors were encountered: