New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First exception not flushed on Celery+Python until second occurrence #687
Comments
How are you launching the celery workers? Is there anything special about the way you are configuring celery? |
Thanks for the follow up @untitaker: Nothing special. I can reproduce the issue 100% of the time locally and I launch it with
On production it runs in a docker container. |
I'm also seeing this issue. It appears to happen when an error is logged in the celery worker's main process before celery creates the process pool child processes. I can reproduce it with this code: # tasks.py
import logging
import time
from secrets import token_hex
from celery import Celery
import sentry_sdk
sentry_sdk.init(dsn="CHANGE_ME", debug=True)
app = Celery(backend="rpc://")
logger = logging.getLogger(__name__)
logger.error(f"SETUP {token_hex(2)}")
@app.task
def my_task():
logger.error(f"MY_TASK {token_hex(2)}")
time.sleep(1) To reproduce:
When the worker first starts we see in the logs:
The "Sending event" line indicates When the task is submitted to the worker (step 4 above) we see in the worker logs:
There is no "Sending event" line so When we kill the celery worker we see in the warm shutdown logs:
This indicates the error is dropped and so never gets reported to Sentry. The fact it says "2 event(s) pending on flush" is interesting because there is actually only one event that didn't get sent. Note if we submit two events (and they are handled by the same pool worker process) we see:
This matches what @fellowshipofone reported: the first error is not processed until there is a 2nd occurrence. Unfortunately this could be a long time after the first error is logged or could be never if the pool worker is terminated before a 2nd error is logged. This is especially a problem when using celery's autoscale feature as worker processes will be terminated shortly after the task is complete. I did a little bit of debugging and found if I put log messages before and after while not self._qsize():
logger.warning("before self.not_empty.wait()")
self.not_empty.wait()
logger.warning("after self.not_empty.wait()") and put a sleep and a log before sleep(1)
logger.warning("before self.not_empty.notify()")
self.not_empty.notify()
logger.warning("after self.not_empty.notify()") we get the following log output:
Note that we see My system configuration is:
Here is the full output of the celery worker: worker output before task is run
worker output when task is run
worker output when worker is stopped with ctrl-c
|
I've confirmed that the bug can still be reproduced with the steps outlined above with the latest release of sentry-sdk (v1.3.1). logs
@untitaker Would you mind removing the |
This issue has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you label it "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
I've confirmed that this bug is still present in sentry-sdk 1.5.1 and master although the behaviour has changed due to commit a6cc971. There is still the issue that the first logged error doesn't get processed immediately and could stay pending on the queue indefinitely. However:
I still consider this an issue because the Celery worker may not be terminated for quite some time and so the error may not be reported until it is too late. The code in #687 (comment) can still be used to replicate the issue. By default Celery autoscales down inactive workers after 30 seconds so the error will be reported after 30 seconds. This can be changed with the |
Hey @nicolasgrasset and @RobbieClarken Better late than never. I just tried to reproduce this with Sentry SDK version 1.35.0, Python 3.9.15, celery 5.3.5 and redis as the Celery broker. Redis server version 7.2.1 I used this
This is how I run celery:
And this is how I tall the task:
This is the output of Celery I get:
So immediately after the my_task is finished the error (and one transaction event) is sent to Sentry. I am not able to reproduce this behavior. As this is a quiet old issue, could you please check if the issue has been resolved as of now, or if it is still occurring! |
@antonpirker Yes, I can still reproduce the issue including with the dependencies you specified. However if I include Could you please try reproducing without the Logsversions
|
Thanks for the update. Yes, will try with |
Hi @RobbieClarken, I attempted to reproduce the issue using the instructions you provided. My reproduction is available here. It indeed appears that there is a delay in sending the error event; however, in my testing, it appeared that the event still got sent eventually, either after a delay of approximately a few minutes or after interrupting the Celery worker with Ctrl-C. Does this match the behavior you are observing? |
This issue has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you remove the label "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
@szokeasaurusrex I'm seeing the same behaviour when I install the latest sentry-sdk (v2.1.1). The behaviour looks to have changed in sentry-sdk v1.4.0; prior to this version (eg v1.3.1) when the Celery worker was scaled down (after ~60 seconds of inactivity) we would see:
From sentry-sdk v1.4.0 onwards we see:
So it seems that the errors will now be eventually reported... it may just be after ~60 second delay. |
@RobbieClarken So, if I am understanding you correctly, it sounds like this issue has now been resolved. Is that correct? Or, is there still some problem we need to address before we close this issue? |
@szokeasaurusrex IMO the fact the error doesn't get reported until the celery worker is scaled down is an indication there is still a bug in sentry-sdk since errors should be reported immediately. I'm not sure if there is any situation where the worker will never get scaled down (which would mean the error would never be reported) or whether the bug can present itself in other ways. That said, it may be that the current behaviour could be considered acceptable. With the default celery autoscaling configuration I doubt anyone would notice the delay in the error reporting. |
@RobbieClarken adding some more info here because I just tried this out. In general, we've been telling celery users to use the @signals.celeryd_init.connect
# @signals.worker_init.connect
def init_worker(**_kwargs):
init_sentry() With the signal init, I can see the error events sent immediately while with the module level init, there does seem to be a consistent delay. I don't know the root cause (would need to dig into the celery worker / threading condition mess) but maybe this helps for other people looking at this issue. We've tried to clarify this in the docs but maybe we should make this really clear. |
Here is an issue that's a little hard to understand. We run Sentry on Django+Celery.
We run many packages on that project, so I suspect it's a conflict with another one but I do not know where to start.
Details
Hub.current.client.flush()
doesn't change any of itCelery task
Sentry init
The text was updated successfully, but these errors were encountered: