Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor intermitten failures on long running tasks #2422

Open
anze3db opened this issue Oct 8, 2023 · 14 comments
Open

Monitor intermitten failures on long running tasks #2422

anze3db opened this issue Oct 8, 2023 · 14 comments
Labels
Feature: Crons Triaged Has been looked at recently during old issue triage Type: Bug Something isn't working

Comments

@anze3db
Copy link

anze3db commented Oct 8, 2023

How do you use Sentry?

Sentry Saas (sentry.io)

Version

1.31.0

Steps to Reproduce

  1. Have a job that runs for over 1hour
  2. Set up a Cron Monitor with 300mins of run threshold
    Screenshot 2023-10-08 at 09 29 01
  3. Use the monitor in the code:
    @monitor(monitor_slug="daily-sync")
    def job(self):
        self.console.print("Starting daily job")
        self.console.print("Running crawler")
        with monitor(monitor_slug="daily-sync-crawler"):
            management.call_command("crawler", skip_inactive_for=3, pre_filter=True)
        self.console.print("Running indexer")
        with monitor(monitor_slug="daily-sync-indexer"):
            management.call_command("indexer")
        self.console.print("Running optimizer")
        with monitor(monitor_slug="daily-sync-optimizer"):
            management.call_command("optimizer")
        self.console.print("Running statuser")
        with monitor(monitor_slug="daily-sync-statuser"):
            management.call_command("statuser")
        self.console.print("Running dailystats")
        with monitor(monitor_slug="daily-sync-stats"):
            management.call_command("dailystats")
        self.console.print("All done! 🎉")

Expected Result

No alerts when the job finishes without errors as seen by the logs:

Oct 08 02:00:00 raspberrypi bash[405]: Starting daily job
Oct 08 02:00:00 raspberrypi bash[405]: Running crawler
Oct 08 02:07:00 raspberrypi bash[405]: Running optimizer
Oct 08 02:21:31 raspberrypi bash[405]: Running statuser
Oct 08 03:29:49 raspberrypi bash[405]: Running dailystats
Oct 08 03:29:49 raspberrypi bash[405]: Gathering daily stats
Oct 08 03:29:50 raspberrypi bash[405]: All done! 🎉

Actual Result

The monitor sometimes fails on the job that takes more than 1hour:

Screenshot 2023-10-08 at 09 33 14

Even though the monitor after it (daily-sync-stats) worked ok:

Screenshot 2023-10-08 at 09 35 11

And interestingly, the daily-sync monitor which wraps the whole job function also passed without problems.

I initially only had the daily-sync monitor configured since I don't really need every step monitored, but when I only had a single monitor it failed consistently every day. With multiple monitors the failures are more intermittent.

I'm also not sure if it's relevant, but the job function is part of a always running Django Command that looks like this:

    def handle(self, *args, run_now=False, **options):
        self.console.print("Starting scheduler 🕐")
        schedule.every().day.at("01:00").do(self.job)

        while True:
            schedule.run_pending()
            time.sleep(1)
@antonpirker
Copy link
Member

Hey @anze3db !
Thanks for reporting this.
Indeed this should not happen. The fact that daily-sync passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.

What are the grace and timeout settings of the daily-sync and the other sub-monitors?

@anze3db
Copy link
Author

anze3db commented Oct 9, 2023

The fact that daily-sync passes is OK, because monitors are independent and we do not have a concept of child-monitors and thus do not check for status of child-monitors.

Yes, I agree. I just thought I'd mention that the wrapper monitor passed. I think this should rule out any errors in my function.

What are the grace and timeout settings of the daily-sync and the other sub-monitors?

All the monitors have a grace period of 60 minutes and max runtime at 300 minutes:

Screenshot 2023-10-09 at 09 23 49.

I can try to make the values less loose and report back if the issue still reproduces.

@anze3db
Copy link
Author

anze3db commented Oct 10, 2023

Just an update that the job timed out today with the following threshold settings:

Screenshot 2023-10-10 at 14 24 54

Screenshot 2023-10-10 at 14 30 36

Let me know if there is anything I can do to help reproduce/resolve this.

@antonpirker
Copy link
Member

Thanks for the update. Hm, indeed very strange. Currently I do not have even a hunch why this is happening.
All your daily tasks are Django manage commands and look the same. Is there anything in the Statsuser tasks that is fundamentally different compared to the other tasks? Does it use async code for example, or anything else you can think of?

@anze3db
Copy link
Author

anze3db commented Oct 11, 2023

The only difference is that Statsuser runs the longest for more than an hour while the other tasks are all less than 20minutes long.

Statuser does use async, but in exactly the same way as the Crawler command. I use the @async_to_sync in both cases to get around the fact that Django Commands are sync only:

    # Main entry point for a Django Command:
    def handle(
        self,
        *args,
        offset=0,
        instances=None,
        skip_inactive_for: int = 0,
        pre_filter: bool = False,
        **options,
    ):
        self.main(
            offset=offset,
            instances=instances,
            skip_inactive_for=skip_inactive_for,
            pre_filter=pre_filter,
        )

    # My Async method that does all the work:
    @async_to_sync
    async def main(
        self,
        offset: int,
        instances: str | None,
        skip_inactive_for: int = 0,
        pre_filter: bool = False,
    ):
         ...

I can try rewriting the code to not use async to see if async is the cause of this problem.

@anze3db
Copy link
Author

anze3db commented Oct 12, 2023

Even without any async code in Statuser the monitor still failed today. 😔

@antonpirker
Copy link
Member

Dang. And the monitor is marked as "Timed out" right?
One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out. Because Sentry only sets it to timed out after the full 5 hours. If it is set to timed out before then something strange is going on.

@anze3db
Copy link
Author

anze3db commented Oct 12, 2023

And the monitor is marked as "Timed out" right?

Correct.

One thing is that you could check if the status of the monitor a bit over an hour into processing is set to running or timed out.

I've just done that. It is in the in progress state while it's processing and stays in the in progress state when the job finishes:

Screenshot 2023-10-12 at 13 22 26

@antonpirker
Copy link
Member

So it stays "in progress" for the whole 5 hours and then is set to "failed"? Is this correct?

@anze3db
Copy link
Author

anze3db commented Oct 17, 2023

Yes, correct.

@antonpirker
Copy link
Member

Ok, thanks for confirming. I still dont know what is going wrong. I have asked the team handling the checkins of crons in our backend, if they have any ideas what is going wrong...

@antonpirker
Copy link
Member

Just wanted to check in, if you try with the newest Version of the SDK, does this still apply?

@anze3db
Copy link
Author

anze3db commented Dec 16, 2023

Yesterday, I reenabled the monitor that was timing out before, and today it looks like it passed:

Screenshot 2023-12-16 at 10 40 32

Looks like the issue has been resolved, so I'm closing this, but I'll reopen if I see any more failures. Thank you for taking care of this!

@anze3db anze3db closed this as completed Dec 16, 2023
@anze3db
Copy link
Author

anze3db commented Dec 18, 2023

Oh no, I am reopening the issue because it has just happened again:

Screenshot 2023-12-18 at 13 19 04

I see from my logs that there were no errors and that the statuser job finished after about 30minutes:

Dec 18 01:05:08 raspberrypi bash[758000]: Running optimizer
Dec 18 01:05:08 raspberrypi bash[758000]: Running statuser
Dec 18 01:05:15 raspberrypi bash[758000]: Batch 0 done, sleeping for 90s
Dec 18 01:06:55 raspberrypi bash[758000]: Batch 1 done, sleeping for 90s
Dec 18 01:08:31 raspberrypi bash[758000]: Batch 2 done, sleeping for 90s
Dec 18 01:10:47 raspberrypi bash[758000]: Batch 3 done, sleeping for 90s
Dec 18 01:12:26 raspberrypi bash[758000]: Batch 4 done, sleeping for 90s
Dec 18 01:14:08 raspberrypi bash[758000]: Batch 5 done, sleeping for 90s
Dec 18 01:15:45 raspberrypi bash[758000]: Batch 6 done, sleeping for 90s
Dec 18 01:17:22 raspberrypi bash[758000]: Batch 7 done, sleeping for 90s
Dec 18 01:18:58 raspberrypi bash[758000]: Batch 8 done, sleeping for 90s
Dec 18 01:20:35 raspberrypi bash[758000]: Batch 9 done, sleeping for 90s
Dec 18 01:22:12 raspberrypi bash[758000]: Batch 10 done, sleeping for 90s
Dec 18 01:23:48 raspberrypi bash[758000]: Batch 11 done, sleeping for 90s
Dec 18 01:25:25 raspberrypi bash[758000]: Batch 12 done, sleeping for 90s
Dec 18 01:27:02 raspberrypi bash[758000]: Batch 13 done, sleeping for 90s
Dec 18 01:28:39 raspberrypi bash[758000]: Batch 14 done, sleeping for 90s
Dec 18 01:30:16 raspberrypi bash[758000]: Batch 15 done, sleeping for 90s
Dec 18 01:31:53 raspberrypi bash[758000]: Batch 16 done, sleeping for 90s
Dec 18 01:33:30 raspberrypi bash[758000]: Batch 17 done, sleeping for 90s
Dec 18 01:35:07 raspberrypi bash[758000]: Batch 18 done, sleeping for 90s
Dec 18 01:36:41 raspberrypi bash[758000]: Batch 19 done, sleeping for 90s
Dec 18 01:38:11 raspberrypi bash[758000]: Running dailystats
Dec 18 01:38:11 raspberrypi bash[758000]: Gathering daily stats
Dec 18 01:38:22 raspberrypi bash[758000]: All done! 🎉

@sentrivana sentrivana added the Triaged Has been looked at recently during old issue triage label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Crons Triaged Has been looked at recently during old issue triage Type: Bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

3 participants