Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensors that return many run requests result in heavy querying of the run storage by the daemon #3726

Closed
gibsondan opened this issue Feb 23, 2021 · 15 comments
Assignees
Labels
type: bug Something isn't working

Comments

@gibsondan
Copy link
Member

GCP profiling suggests that several runs table queries take 8-15ms and are responsible for a large portion of the load on our postgres DB.

SELECT
runs.run_body
FROM
runs
WHERE
runs.status IN ($1,
$2,
$3)
ORDER BY
runs.id DESC

and

SELECT
runs.run_body
FROM
runs
JOIN
run_tags
ON
runs.run_id = run_tags.run_id
WHERE
run_tags.key = $1
AND run_tags.value = $2
GROUP BY
runs.run_body,
runs.id
HAVING
COUNT(runs.run_id) = $3
ORDER BY
runs.id DESC

@gibsondan gibsondan added the type: bug Something isn't working label Feb 23, 2021
@nancydyc nancydyc added this to To do in Scheduled Runs & gRPC via automation Feb 23, 2021
@prha prha removed their assignment Sep 30, 2021
@jackowitz
Copy link

@gibsondan I see you closed this issue recently. Was a fix for this issue ever merged but GitHub missed the tagging? We are currently encountering this but even more pronounced with each execution taking close to 400ms and Dagster running the query multiple times a second, effectively keeping our database (RDS PostgreSQL) CPU pegged while orchestrating only a small number of jobs.

SELECT runs.run_body FROM runs JOIN run_tags ON runs.run_id = run_tags.run_id
WHERE run_tags.key = ? AND run_tags.value = ? OR run_tags.key = ? AND run_tags.value = ?
GROUP BY runs.run_body, runs.id HAVING count(runs.run_id) = ? ORDER BY runs.id DESC

@gibsondan gibsondan reopened this Mar 3, 2022
@gibsondan
Copy link
Member Author

Hi @jackowitz - thanks for the report, on which version of Dagster are you seeing this?

@jackowitz
Copy link

We are on 0.13.13. An update to 0.14.x is on our eventual roadmap but we are also trying to gauge the cost/benefit of timing that update. A fix to this issue would be good motivation but nothing stood out to me in the relevant history since then, although I could be missing it.

@gibsondan
Copy link
Member Author

Makes sense. You mentioned that dagster is running the query multiple times per second, that's the part that jumps out as unexpected to me (and could be something that has been fixed since 0.13.13, we'll attempt to run that down). Does that extreme query behavior still happen if nobody is currently accessing dagit in the browser?

@jackowitz
Copy link

I actually had the same suspicion regarding Dagit. The query rate (as reported by RDS "Performance Insights") doesn't appear to meaningfully change outside of normal business hours when both users and their laptops are likely to be asleep, though. Maybe possible that Dagit is running something in the background? This isn't a crazy rate (2-4/second), but it actually does look to correlate with the average latency enough that I could plausibly believe there's always one such query "in-flight" in a tight loop. I guess the true test would be to disable Dagit entirely for a short time and see if the queries cease, which might something I can do and report back.

@gibsondan
Copy link
Member Author

Dagit does do a bit of polling in the background sometimes (we're working on improving that / making it smarter)

I will say that since this issue was filed almost a year ago, this particular query no longer shows up in our top 10 queries on the same deployment running dagster 0.14.2, which is what prompted me to close this task. But I haven't narrowed that change down to a specific change yet.

@jackowitz
Copy link

I was able to find a time window to shut off Dagit. Surprisingly, it made no difference in the database CPU load or query profile. Both with and without Dagit running we see a consistent pattern like the following:
Screen Shot 2022-03-04 at 2 14 19 PM

The only query that shows up in our "Performance Insights" with anything other than negligible weight is the query in question (called ~3 times/second with an average latency of ~333ms) although it's definitely not clear to me where the very precise 6-minute valley comes from if this is a tight loop.

I'll start looking into 0.14.x on our end, but it does seem useful to understand what is causing this behavior since it's a quite extreme performance impact on RDS for an application that seems like it should be overall pretty idle (we've only migrate a tiny fraction of our jobs at this point). Does Dagster provide any detailed instrumentation/tracing/metrics for its Dagit and daemon processes that I can enable?

@gibsondan
Copy link
Member Author

Do you know what was still running in your cluster while Dagit was down? The main other possibilities would be the daemon, any gRPC servers (which could be running sensors that access the DB), and the runs themeslves.

@gibsondan
Copy link
Member Author

OK here's another thing that would be helpful. Any chance you could send a representative output of your daemon logs during a couple of minutes where this is happening (either here or to daniel on slack)? That might give a clue about what it's doing in such a tight loop (one shot in the dark - any chance you have any big backfills that are still running or something? That would show up in the Status tab in Dagit under Backfills)

@jackowitz
Copy link

jackowitz commented Mar 4, 2022

Was actually doing just that, and I think I tracked down the source. The timing and parameters of the queries appear to correlate pretty exactly with these lines in the logs:

[32m2022-03-04 19:35:35 +0000�[0m - SensorDaemon - �[34mINFO�[0m - Skipping run for < run_tags.value >, found < some UUID >.

So it looks like it is coming from sensors, although they're scheduled to run only every 5 minutes and the log lines and queries are far more frequent. Possibly some kind of fan-out where when sensor run is resulting in many, many of these queries?

@gibsondan
Copy link
Member Author

Got it - I like that theory. Do you have a sensor that's producing a large number of run requests and then relying on run keys to keep them running exactly once? If you are, this very likely points to something we can make better on our side, but you can also set a cursor instead of solely relying on run keys for idempotence: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors

@jackowitz
Copy link

That appears to be exactly what's happening:

Skipping 787 runs for sensor already completed with run keys: [].

>>> 787 / (5 * 60)
2.6233333333333335

That many runs, 5 minute schedule, gives us exactly the reported query frequency. So looks like it's looping them 1-by-1 and at 350ms each is enough to keep the CPU pretty constantly busy just doing this.

something we can make better on our side

Presumably the thinking is to batch these requests? Hopefully the information on this issue serves as a good starting point/motivation if that work gets taken up. In the meanwhile, I'll check out cursors to see if they can help with this particular case. Thanks, @gibsondan.

@gibsondan gibsondan changed the title Runs storage makes some expensive queries Sensors that return many run requests result in heavy querying of the run storage by the daemon Mar 4, 2022
@gibsondan
Copy link
Member Author

Got it - definitely plenty of room for improvement here, updated the issue to reflect our new understanding. If cursors are an option here that should make this much faster though.

@prha
Copy link
Member

prha commented Mar 4, 2022

Oh, this is a very good insight. Thanks @jackowitz. I think we can implement some batching of the requests here, and maybe surface some more info on the run_request/run_key per tick on the sensors page

@prha
Copy link
Member

prha commented Mar 29, 2022

I think this has mostly been addressed / mitigated by better documentation, query batching, and surfacing the skipped checks in dagit.

@prha prha closed this as completed Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

4 participants