New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APM agent crashes in Celery at random times during the day #823
Comments
Hmmm, I wonder how we're managing to send an invalid gzip header? Very odd. We'll see if we can track this down next week. Can you do a |
Here you are:
|
I wonder if #783 fixed this? Can you try upgrading to 5.6.0 and see if the crashes go away? |
Sure just did and will let you know |
The crash is still happing randomly. Since I enabled it last night I got around 50 crashes. I do get a few more errors since I deployed 5.6.0. I'm getting lots of celery data in APM so I assumed everything configured correctly. Here are the multiple crashes I have in sentry:
|
I'm pretty certain these errors are around remote config, not normal data sending to the APM Server. When you say it crashes, does the celery worker actually stop or restart? Or do you just see an error log? Crashing is definitely not behavior we want, I'm going to figure out how to make these errors less intrusive and more useful. As a band-aid fix, assuming you're not actually using the remote config feature, you can disable it by setting |
One other question: what are you using for your celery pool? Are you using |
We get a crash report in sentry but I'm not sure what is crashing. I'm adding two more errors I get maybe it helps. We're using
|
No, we're not using central_config, disabled it as sugested. |
Are there any errors present in the APM Server logs? I wonder if it's overloaded or needs to be tuned? Those errors do not appear to be anything on the agent side, could be network issues or APM Server load issues. I'm still a bit mystified about the 400 errors that you're seeing. I'm going to do some more digging. |
Do you have any sort of proxy that could be modifying headers? The strangest piece of those errors is the fact that they're intermittent. We gzip-compress all data that we send to the APM server to reduce payload sizes, and I can't imagine why most data would be getting through fine but some would have invalid headers... |
No, nothing in the logs. It can't be the network as in this particular server I installed the APM server locally. No, I have this celery in the same server as the apm server and I'm using the minimal default settings in Django to get APM working :
|
On the timeouts, it may be that your apm server is a touch overloaded and thus responding slowly. You can modify those timeouts with I'm still trying to track down the source of the gzip errors. One of the APM Server devs pointed out that golang will throw that error in two situations:
I haven't been able to figure out how we would botch either of those situations, especially inconsistently like you're seeing. It may be a bug in python's gzip library... anyway, I'm still digging. |
This specific server is massively over-provisioned with a load average consistently below the available cores, so I think it's strange it is a server timeone. I increased to 10s and will let you know if something changes. |
Maybe all these issues are related to celery workers exiting when apm-agent wasn't expecting it? Here's another crash from today.
|
That could very well be the case. I could see that causing most of the issues that you're seeing -- in fact, if the celery worker shuts down while a message is being written to the gzip buffer in memory, it's possible that could be causing the corruption that results in the original error you reported. Perhaps it would be worth turning off |
Disabled, I'll let you tomorrow. |
It looks like the worker theory is correct. I'm left with one crash in Sentry that happens between 1 am and 2 am that I hadn't even noticed before related to the size of the event. We're inserting around 700k new rows into a table between 1 am and 2 am. Is there a smart way to prevent sending this huge SQL to APM Server?
|
Yes we should definitely truncate that. Can you create a separate issue for that SQL truncation? I want to keep this issue focused on making the transport shut down more gracefully. |
Opened #827 |
Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes elastic#823
Opened #838 to address these errors and make them more descriptive. |
* Catch and log any exceptions while closing the transport Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes #823 * Remove trailing comma I don't know how this happened, I think black added it? But then it didn't re-add it this time when I removed it. Probably more of black's magic comma stuff that has been added recently * Add to changelog
* Catch and log any exceptions while closing the transport Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes elastic#823 * Remove trailing comma I don't know how this happened, I think black added it? But then it didn't re-add it this time when I removed it. Probably more of black's magic comma stuff that has been added recently * Add to changelog
* Catch and log any exceptions while closing the transport Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes elastic#823 * Remove trailing comma I don't know how this happened, I think black added it? But then it didn't re-add it this time when I removed it. Probably more of black's magic comma stuff that has been added recently * Add to changelog
* Catch and log any exceptions while closing the transport Previously these would show up as big tracebacks in the logs which was ugly, considering all sorts of things can go wrong when suddenly closing the transport (such as when a celery worker terminates) Fixes elastic#823 * Remove trailing comma I don't know how this happened, I think black added it? But then it didn't re-add it this time when I removed it. Probably more of black's magic comma stuff that has been added recently * Add to changelog
I'm not sure how to reproduce but I've had this issue in production for several months. The crash happens randomly during the day, 20-50 times (we run 400k+ tasks per day)
The crash is:
Environment (please complete the following information)
The text was updated successfully, but these errors were encountered: