New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensors that return many run requests result in heavy querying of the run storage by the daemon #3726
Comments
@gibsondan I see you closed this issue recently. Was a fix for this issue ever merged but GitHub missed the tagging? We are currently encountering this but even more pronounced with each execution taking close to 400ms and Dagster running the query multiple times a second, effectively keeping our database (RDS PostgreSQL) CPU pegged while orchestrating only a small number of jobs.
|
Hi @jackowitz - thanks for the report, on which version of Dagster are you seeing this? |
We are on 0.13.13. An update to 0.14.x is on our eventual roadmap but we are also trying to gauge the cost/benefit of timing that update. A fix to this issue would be good motivation but nothing stood out to me in the relevant history since then, although I could be missing it. |
Makes sense. You mentioned that dagster is running the query multiple times per second, that's the part that jumps out as unexpected to me (and could be something that has been fixed since 0.13.13, we'll attempt to run that down). Does that extreme query behavior still happen if nobody is currently accessing dagit in the browser? |
I actually had the same suspicion regarding Dagit. The query rate (as reported by RDS "Performance Insights") doesn't appear to meaningfully change outside of normal business hours when both users and their laptops are likely to be asleep, though. Maybe possible that Dagit is running something in the background? This isn't a crazy rate (2-4/second), but it actually does look to correlate with the average latency enough that I could plausibly believe there's always one such query "in-flight" in a tight loop. I guess the true test would be to disable Dagit entirely for a short time and see if the queries cease, which might something I can do and report back. |
Dagit does do a bit of polling in the background sometimes (we're working on improving that / making it smarter) I will say that since this issue was filed almost a year ago, this particular query no longer shows up in our top 10 queries on the same deployment running dagster 0.14.2, which is what prompted me to close this task. But I haven't narrowed that change down to a specific change yet. |
Do you know what was still running in your cluster while Dagit was down? The main other possibilities would be the daemon, any gRPC servers (which could be running sensors that access the DB), and the runs themeslves. |
OK here's another thing that would be helpful. Any chance you could send a representative output of your daemon logs during a couple of minutes where this is happening (either here or to daniel on slack)? That might give a clue about what it's doing in such a tight loop (one shot in the dark - any chance you have any big backfills that are still running or something? That would show up in the Status tab in Dagit under Backfills) |
Was actually doing just that, and I think I tracked down the source. The timing and parameters of the queries appear to correlate pretty exactly with these lines in the logs:
So it looks like it is coming from sensors, although they're scheduled to run only every 5 minutes and the log lines and queries are far more frequent. Possibly some kind of fan-out where when sensor run is resulting in many, many of these queries? |
Got it - I like that theory. Do you have a sensor that's producing a large number of run requests and then relying on run keys to keep them running exactly once? If you are, this very likely points to something we can make better on our side, but you can also set a cursor instead of solely relying on run keys for idempotence: https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors |
That appears to be exactly what's happening:
That many runs, 5 minute schedule, gives us exactly the reported query frequency. So looks like it's looping them 1-by-1 and at 350ms each is enough to keep the CPU pretty constantly busy just doing this.
Presumably the thinking is to batch these requests? Hopefully the information on this issue serves as a good starting point/motivation if that work gets taken up. In the meanwhile, I'll check out cursors to see if they can help with this particular case. Thanks, @gibsondan. |
Got it - definitely plenty of room for improvement here, updated the issue to reflect our new understanding. If cursors are an option here that should make this much faster though. |
Oh, this is a very good insight. Thanks @jackowitz. I think we can implement some batching of the requests here, and maybe surface some more info on the run_request/run_key per tick on the sensors page |
I think this has mostly been addressed / mitigated by better documentation, query batching, and surfacing the skipped checks in dagit. |
GCP profiling suggests that several runs table queries take 8-15ms and are responsible for a large portion of the load on our postgres DB.
SELECT
runs.run_body
FROM
runs
WHERE
runs.status IN ($1,
$2,
$3)
ORDER BY
runs.id DESC
and
SELECT
runs.run_body
FROM
runs
JOIN
run_tags
ON
runs.run_id = run_tags.run_id
WHERE
run_tags.key = $1
AND run_tags.value = $2
GROUP BY
runs.run_body,
runs.id
HAVING
COUNT(runs.run_id) = $3
ORDER BY
runs.id DESC
The text was updated successfully, but these errors were encountered: