New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task loss on retry when using a hybrid/staged Celery 3->4 deployment #4356
Comments
thanks for the report and fixes Russel! could you also provide the tests for your patches? |
Not sure how to go about an automated test for these, as on both ends, the test involves running a version of Celery that isn't the version being tested. Any tips/suggestions appreciated. |
@thedrow any insights? |
NB: This can happen also with a 4.1.0 producer and 3.1.25 consumer, where the consumer calls I manually applied #4357 to my production system (eek!) and it appears to be working. |
Great! |
…ersions (#4358) * handle "hybrid" messages which have passed through a protocol 1 and protocol 2 consumer in its life. we detected an edgecase which is proofed out in https://gist.github.com/ewdurbin/ddf4b0f0c0a4b190251a4a23859dd13c#file-readme-md which mishandles messages which have been retried by a 3.1.25, then a 4.1.0, then again by a 3.1.25 consumer. as an extension, this patch handles the "next" iteration of these mutant payloads. * explicitly construct proto2 from "hybrid" messages * remove unused kwarg * fix pydocstyle check * flake8 fixes * correct fix for misread pydocstyle error
The issue referenced was [resolved](celery/celery#6374), but it actually does not apply to the bit of code it's pointing to. Fixes #4649 I tried removing the code referenced by the comment, and it causes the test `tests/contrib/celery/test_integration.py::CeleryDistributedTracingIntegrationTask::test_distributed_tracing_propagation_async` to fail. As far as I can tell from about an hour of investigation, the celery [fix](celery/celery#4356) only affects "hybrid messages" (those that blend protocol 1 and 2 syntax as part of a rolling Celery upgrade), which our test suite doesn't use.
The issue referenced was [resolved](celery/celery#6374), but it actually does not apply to the bit of code it's pointing to. Fixes DataDog#4649 I tried removing the code referenced by the comment, and it causes the test `tests/contrib/celery/test_integration.py::CeleryDistributedTracingIntegrationTask::test_distributed_tracing_propagation_async` to fail. As far as I can tell from about an hour of investigation, the celery [fix](celery/celery#4356) only affects "hybrid messages" (those that blend protocol 1 and 2 syntax as part of a rolling Celery upgrade), which our test suite doesn't use.
If you have a Celery 3.1.25 deployment involving many workers, and you want to upgrade to Celery 4, you may wish to do "canary" testing of a limited subset of workers to validate that the upgrade won't introduce any problems, prior to upgrading your entire worker fleet to Celery4. This "canary" mode involves having both Celery 3.1.25 and Celery 4 workers running at the same time.
However, if you do this, and you have tasks that retry, you experience problems if a task is attempted on a Celery 3.1.25 node, then a Celery 4 node, and then a Celery 3.1.25 node.
When the Celery 3.1.25 task is executed on a Celery 4, the task message is upgraded to Protocol 2. However, the upgrade results in a hybrid message that complies with both formats, and when the task fails and is retried on the Celery 3.1.25 worker, the "hybrid" message is mis-identified as a Protocol 1 message, resulting in a hard crash and message loss.
Checklist
celery -A proj report
in the issue.master
branch of Celery.Steps to reproduce
A full reproduction case can be found in this gist:
https://gist.github.com/ewdurbin/ddf4b0f0c0a4b190251a4a23859dd13c
In local testing, the following two versions were used:
###Celery 3.1.25:
###Celery 4.1.0:
Although these test results were obtained on a Mac running Sierra, the problem has also been observed in production on AWS EC2 Linux machines.
Expected behavior
A task should be able to move back and forth between a 3.1.25 worker and a 4.1.0 worker without any problems.
Actual behavior
A task can be executed on a Celery 3.1.25 worker , then on a Celery 4.1.0 worker; but when the task is then run on a Celery 3.1.25 worker, the following error is produced:
This kills the worker, and the task message is lost.
The text was updated successfully, but these errors were encountered: