Sensors that return many run requests result in heavy querying of the run storage by the daemon #3726

gibsondan · 2021-02-23T19:44:45Z

GCP profiling suggests that several runs table queries take 8-15ms and are responsible for a large portion of the load on our postgres DB.

SELECT
runs.run_body
FROM
runs
WHERE
runs.status IN ($1,
$2,
$3)
ORDER BY
runs.id DESC

and

SELECT
runs.run_body
FROM
runs
JOIN
run_tags
ON
runs.run_id = run_tags.run_id
WHERE
run_tags.key = $1
AND run_tags.value = $2
GROUP BY
runs.run_body,
runs.id
HAVING
COUNT(runs.run_id) = $3
ORDER BY
runs.id DESC

jackowitz · 2022-03-03T16:31:17Z

@gibsondan I see you closed this issue recently. Was a fix for this issue ever merged but GitHub missed the tagging? We are currently encountering this but even more pronounced with each execution taking close to 400ms and Dagster running the query multiple times a second, effectively keeping our database (RDS PostgreSQL) CPU pegged while orchestrating only a small number of jobs.

SELECT runs.run_body FROM runs JOIN run_tags ON runs.run_id = run_tags.run_id
WHERE run_tags.key = ? AND run_tags.value = ? OR run_tags.key = ? AND run_tags.value = ?
GROUP BY runs.run_body, runs.id HAVING count(runs.run_id) = ? ORDER BY runs.id DESC

gibsondan · 2022-03-03T16:38:43Z

Hi @jackowitz - thanks for the report, on which version of Dagster are you seeing this?

jackowitz · 2022-03-03T16:47:52Z

We are on 0.13.13. An update to 0.14.x is on our eventual roadmap but we are also trying to gauge the cost/benefit of timing that update. A fix to this issue would be good motivation but nothing stood out to me in the relevant history since then, although I could be missing it.

gibsondan · 2022-03-03T16:54:58Z

Makes sense. You mentioned that dagster is running the query multiple times per second, that's the part that jumps out as unexpected to me (and could be something that has been fixed since 0.13.13, we'll attempt to run that down). Does that extreme query behavior still happen if nobody is currently accessing dagit in the browser?

jackowitz · 2022-03-03T19:10:06Z

I actually had the same suspicion regarding Dagit. The query rate (as reported by RDS "Performance Insights") doesn't appear to meaningfully change outside of normal business hours when both users and their laptops are likely to be asleep, though. Maybe possible that Dagit is running something in the background? This isn't a crazy rate (2-4/second), but it actually does look to correlate with the average latency enough that I could plausibly believe there's always one such query "in-flight" in a tight loop. I guess the true test would be to disable Dagit entirely for a short time and see if the queries cease, which might something I can do and report back.

gibsondan · 2022-03-03T19:12:20Z

Dagit does do a bit of polling in the background sometimes (we're working on improving that / making it smarter)

I will say that since this issue was filed almost a year ago, this particular query no longer shows up in our top 10 queries on the same deployment running dagster 0.14.2, which is what prompted me to close this task. But I haven't narrowed that change down to a specific change yet.

jackowitz · 2022-03-04T19:22:29Z

I was able to find a time window to shut off Dagit. Surprisingly, it made no difference in the database CPU load or query profile. Both with and without Dagit running we see a consistent pattern like the following:

The only query that shows up in our "Performance Insights" with anything other than negligible weight is the query in question (called ~3 times/second with an average latency of ~333ms) although it's definitely not clear to me where the very precise 6-minute valley comes from if this is a tight loop.

I'll start looking into 0.14.x on our end, but it does seem useful to understand what is causing this behavior since it's a quite extreme performance impact on RDS for an application that seems like it should be overall pretty idle (we've only migrate a tiny fraction of our jobs at this point). Does Dagster provide any detailed instrumentation/tracing/metrics for its Dagit and daemon processes that I can enable?

gibsondan · 2022-03-04T19:24:22Z

Do you know what was still running in your cluster while Dagit was down? The main other possibilities would be the daemon, any gRPC servers (which could be running sensors that access the DB), and the runs themeslves.

gibsondan · 2022-03-04T19:39:37Z

OK here's another thing that would be helpful. Any chance you could send a representative output of your daemon logs during a couple of minutes where this is happening (either here or to daniel on slack)? That might give a clue about what it's doing in such a tight loop (one shot in the dark - any chance you have any big backfills that are still running or something? That would show up in the Status tab in Dagit under Backfills)

jackowitz · 2022-03-04T19:43:57Z

Was actually doing just that, and I think I tracked down the source. The timing and parameters of the queries appear to correlate pretty exactly with these lines in the logs:

[32m2022-03-04 19:35:35 +0000�[0m - SensorDaemon - �[34mINFO�[0m - Skipping run for < run_tags.value >, found < some UUID >.

So it looks like it is coming from sensors, although they're scheduled to run only every 5 minutes and the log lines and queries are far more frequent. Possibly some kind of fan-out where when sensor run is resulting in many, many of these queries?

gibsondan · 2022-03-04T19:50:15Z

Got it - I like that theory. Do you have a sensor that's producing a large number of run requests and then relying on run keys to keep them running exactly once? If you are, this very likely points to something we can make better on our side, but you can also set a cursor instead of solely relying on run keys for idempotence: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors

jackowitz · 2022-03-04T20:03:35Z

That appears to be exactly what's happening:

Skipping 787 runs for sensor already completed with run keys: [].

>>> 787 / (5 * 60)
2.6233333333333335

That many runs, 5 minute schedule, gives us exactly the reported query frequency. So looks like it's looping them 1-by-1 and at 350ms each is enough to keep the CPU pretty constantly busy just doing this.

something we can make better on our side

Presumably the thinking is to batch these requests? Hopefully the information on this issue serves as a good starting point/motivation if that work gets taken up. In the meanwhile, I'll check out cursors to see if they can help with this particular case. Thanks, @gibsondan.

gibsondan · 2022-03-04T20:07:28Z

Got it - definitely plenty of room for improvement here, updated the issue to reflect our new understanding. If cursors are an option here that should make this much faster though.

prha · 2022-03-04T20:13:12Z

Oh, this is a very good insight. Thanks @jackowitz. I think we can implement some batching of the requests here, and maybe surface some more info on the run_request/run_key per tick on the sensors page

prha · 2022-03-29T21:45:59Z

I think this has mostly been addressed / mitigated by better documentation, query batching, and surfacing the skipped checks in dagit.

gibsondan added the type: bug Something isn't working label Feb 23, 2021

gibsondan assigned prha Feb 23, 2021

nancydyc added this to To do in Scheduled Runs & gRPC via automation Feb 23, 2021

alangenfeld added the platform label Apr 1, 2021

prha removed their assignment Sep 30, 2021

gibsondan closed this as completed Feb 20, 2022

gibsondan reopened this Mar 3, 2022

gibsondan changed the title ~~Runs storage makes some expensive queries~~ Sensors that return many run requests result in heavy querying of the run storage by the daemon Mar 4, 2022

prha self-assigned this Mar 18, 2022

This was referenced Mar 18, 2022

enable saving run key for ticks that skip runs because of run key idempotence #7130

Merged

add tick tooltip info for skipped ticks due to run key idempotence #7131

Merged

prha closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensors that return many run requests result in heavy querying of the run storage by the daemon #3726

Sensors that return many run requests result in heavy querying of the run storage by the daemon #3726

gibsondan commented Feb 23, 2021

jackowitz commented Mar 3, 2022

gibsondan commented Mar 3, 2022

jackowitz commented Mar 3, 2022

gibsondan commented Mar 3, 2022

jackowitz commented Mar 3, 2022

gibsondan commented Mar 3, 2022

jackowitz commented Mar 4, 2022

gibsondan commented Mar 4, 2022

gibsondan commented Mar 4, 2022

jackowitz commented Mar 4, 2022 •

edited

gibsondan commented Mar 4, 2022

jackowitz commented Mar 4, 2022

gibsondan commented Mar 4, 2022

prha commented Mar 4, 2022

prha commented Mar 29, 2022

Sensors that return many run requests result in heavy querying of the run storage by the daemon #3726

Sensors that return many run requests result in heavy querying of the run storage by the daemon #3726

Comments

gibsondan commented Feb 23, 2021

jackowitz commented Mar 3, 2022

gibsondan commented Mar 3, 2022

jackowitz commented Mar 3, 2022

gibsondan commented Mar 3, 2022

jackowitz commented Mar 3, 2022

gibsondan commented Mar 3, 2022

jackowitz commented Mar 4, 2022

gibsondan commented Mar 4, 2022

gibsondan commented Mar 4, 2022

jackowitz commented Mar 4, 2022 • edited

gibsondan commented Mar 4, 2022

jackowitz commented Mar 4, 2022

gibsondan commented Mar 4, 2022

prha commented Mar 4, 2022

prha commented Mar 29, 2022

jackowitz commented Mar 4, 2022 •

edited