New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in Consul result backend when saving a result #5605
Comments
@ShaheedHaque Basically if you are using the prefork pool which is the default for celery, it will use forking the process instead of threads to get past limitations of the Python GIL. This means that if you are running a multi-core CPU (you probably are) that having more than one worker process would mean the possibility of writing in parallel while consuming tasks and sending the results along. I would have to know more about the setup, but if your python-consul library is only listening in one spot for those response and thinks that its going to enforce a linear order of execution, that could be the issue. |
Hi @matteius, I am indeed using the default prefork pool. I have taken a cursory look at the Celery code, but could not easily see where the result backend code is called from, so it is unclear to me if there are multiple ports involved (i.e. one per forked worker) or one (i.e. in the master process). However, my suspicion is that in either event, the result backend is doing both "set" operations to store results and "delete" operations to expire them. The suspicion is based on instrumenting the Consul backend code, and seeing that the "set" code, which expects to have created a Consul session, and expects a dictionary response (which describes the newly created session) something like this:
is actually sometimes getting a simple boolean response. What can return such a result? The Consul HTTP API that returns a boolean response is the "delete". The "set" code and "delete" code both appear to be called, and the evidence is clear that the various responses are getting mixed up. For example, I see instances where the code is attempting to get the ['ID'] of a boolean! From the way the backend code is written with inline callbacks, the only way I can see this happening is if a given listener is threaded and the session context is getting confused, but before I plunge into the deep end and propose a fix for the Consul library handling the sessions, I am looking for confirmation that my suspicions match Celery usage of its backend. |
Which backend are you using for the broker, ie: RabbitMQ, Redis, SQS, something else? |
RabbitMQ. |
In that case every task that gets a result must define a unique response queue which celery handles internally -- is consul reading from the queue for both expired results or just using the result.get() method? I read somewhere here recently you have to call .get() or .ignore() on every result promise object when using the results backend otherwise it can lead to something possibly like you are describing. It can happen if in your code you only care about the results from some tasks. |
For clarity, I'm just a user. I didn't write either the Consul result backend or the Python-consul package. That said my code does wait for and check every result. However, I don't understand what any of that has to do with the problem I described, or the clarification I am seeking. All I need to check is whether, when Celery invokes the backend to save results, it uses multiple threads? The reason for asking is that I am seeing errors which can be explained if threads are in use because the Python-consul package is not threadsafe |
In the future, please use the template for bug reports. Do you have a test case? |
@ShaheedHaque As I already stated above, if you are using the default prefork pool, which you have said you are, then Celery is invoking multiple processes and not multiple threads. |
@thedrow Apologies, I did not consider this to fit any of the templates, and specifically I did not consider this a bug report against Celery. Here is the template anyway...
Mandatory Debugging Information
Optional Debugging Information
Related Issues and Possible DuplicatesRelated Issues
Possible Duplicates
Environment & SettingsCelery version:
|
I now believe this is an issue in Consul itself. I'll leave this open to report on progress. |
OK, I believe there is a bug in Celery here too. When these errors occur (which as I say I currently believe to be caused upstream), the celery worker process gets "stuck" with the job in hand:
until all workers are stuck like this. Here is an example of what Celery reports:
My guess is that the Celery worker, having caught this error, does not go on to clean up as expected. |
Filed #5642. |
I finally have what I think is a smoking gun pointing to an issue in Celery. My setup is much as before though with versions as follows:
and using, as before, the Consul result backend. I modified the Celery code slightly to emit these logs:
However, it is probably easier to see what is going on by looking at the attached wireshark packet capture.
Entries 8953 and 8963 seem to be a request response pair between Consul on port 8500 and the instance of Celery running on port 50268 (which I take to be ForkPoolWorker-27). Note the request is:
and the response is encoded as application/json with content:
Note that
which successfully stored the result of an earlier task. The obvious implication is that after initiating the storing of the result for celery-task-meta-4546e0f5-dd1b-45c6-bfba-6fd3de9cc5e4, Celery did not wait for the acknowledgement Does that make sense? |
Just to refine the previous hypothesis based on looking into how Celery uses python-consul:
Thus, once the Celery worker thinks it has finished sending the task result to the backend, even though the associated callback has not fired, it can pick up the next task. Since the next task uses the same consul.std.Consul() and thereby the same requests.session(), the stage is set for the scenario I see. As a quick hack mostly intended to minimise the source changes, I am trying this edit:
where each reference to An initial run looks promising, I will report back after further runs. |
Multiple test runs on Linux and Mac OSX have not shown the problem after the hack in my previous comment was applied. |
You could create a client per thread using local thread storage. |
@thedrow As far as I know, that won't help because the problem is visible when Celery is not using threads. I think the correct approach is to create one client per request (which the above hack achieves in a fragile fashion). |
But that means we have to connect each time we make a request. |
We discussed this briefly at the core contributors meeting today and I wanted to chime in with some thoughts and questions for @ShaheedHaque.
@ShaheedHaque - do you think you might be able to get some information about the return value from that One other option would be to try using one of the other consul implementations rather than |
@maybe-sybr Thanks for your analysis. The bad news is that I have seen no evidence of anything like an interrupted syscall for either the set() or the put(). The callbacks always get a well-formed JSON response (either the right one or the wrong one). However, I think I may have found an explanation for the race...according to the documentation a requests.Session:
I certainly was not expecting any pooling...presumably that is what is giving rise to the race. Assuming this is the root cause, a bit of poking around the API documentation suggests that there might be a way to nobble the pool using something called an "HTTPAdapter" (it looks a little unpleasant...but better than the present hack). |
Well, either the pooling is not the problem, or my attempt to disable it was wrongly conceived:
I'm stumped. |
It seems like python-consul is unmaintained. @ShaheedHaque Can you please uninstall python-consul and install python-consul2. |
The unit tests pass with @ShaheedHaque Please test it and see if you can reproduce the problem. |
python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref #5605.
I confirm that switching to python-consul2 shows exactly the same behaviour:
Nevertheless, given the support situation, switching to it is clearly a good step to take. |
python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref #5605.
As per the discussion in #5642, if there is interest is incorporating the above workaround (on the basis that correctness trumps performance), I'd be happy to provide a PR. It is perhaps worth saying that we have run the workaround extensively in test and in production, and not seen a single instance of this error with the workaround in place. We do see very occasional errors connecting to Consul (possibly as a result of more frequent connection attempts), but at least those result in a very clear connection error rather than this very murky issue. I plan to submit a PR to python-consul2 to add support for connection retries (which it currently disables) which further reduces this uncommon knock-on effect. |
I've submitted the PR to python-consul2 to add support for connection retries (which it currently disables) which compensates for the knock-on effect of the workaround, i.e. more connection attempts at poppyred/python-consul2#31. |
Since there has been no progress on a refined solution, I'd like to propose a PR based on my fix at #5605 (comment). This has been extensively tested by now, and works perfectly. As per the discussion following the above comment, there is a connection per request (which nobody is keen on), but at least the code works reliably. I'll put one together later for your consideration. |
…ponses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests.
…ponses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests. To opt-out from the new behaviour add a parameter "one_client=1" to the connection URL.
…ponses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests. To opt-out from the new behaviour add a parameter "one_client=1" to the connection URL.
…6823) * As per #5605, the Consul backend does not cleanly associate responses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests. To opt-out from the new behaviour add a parameter "one_client=1" to the connection URL. * Increase code coverage. * Rewrite Consul backend documentation, and describe the options now available.
Should we actually close this with the merge of #6823? We have a workaround but not a fix that we really understand. That said, I'm not confident we'll ever really spend much more time on working out what's going on here. It seems like we can only really chalk it up to being weirdness in an upstream project, and probably not a great use of our time. As long as we can make the workarounds (the per-request clients and oneclient trapdoor) discoverable for others who might hit this issue, perhaps leaving this closed is fine. I think @ShaheedHaque has done a good job on the documentation in #6823, so I'd be happy to leave this closed rather than have it linger on the backlog forever. |
python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref celery#5605.
…elery#6823) * As per celery#5605, the Consul backend does not cleanly associate responses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests. To opt-out from the new behaviour add a parameter "one_client=1" to the connection URL. * Increase code coverage. * Rewrite Consul backend documentation, and describe the options now available.
I'm encountering an issue using the Consul result backend which I suspect is NOT a Celery issue as such but rather an issue in the lack of thread safety in the library the backend uses.
The symptoms are consistent with confusion between the sorts of responses that Consul gives to the library when creating saved results, and expiring those saved results. So, if Celery both writes and expires saved results using threads and, the lack of thread-safety in the python-consul library could produce the results described.
Can somebody confirm if Celery access to a results backend involves multiple threads?
The text was updated successfully, but these errors were encountered: