-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure pools via Helm chart #29401
Conversation
9a02977
to
b1c40a5
Compare
9ac5630
to
2232eea
Compare
9416332
to
a37dd08
Compare
a37dd08
to
543cf3b
Compare
@potiuk @jedcunningham test are (finally) passing (the failing ones are due to timeouts). Could you please review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a fan of this approach, unfortunately. This has "source of truth" problems in my eyes, as one can still modify these values in the UI. At that point, is the chart right to overwrite it on the next update, or is the UI value the right one? Also, removing a pool here doesn't actually remove it from the db.
There isn't really a "better" option, short of doing some Airflow side enhancements of pools. I'd almost rather waiting for that and only supporting it on newer Airflows instead of supporting something with a number of nuances to it?
I'm curious what other maintainers think.
############################################# | ||
## Airflow Import Pools Job ServiceAccount | ||
############################################## | ||
{{- if .Values.importPoolsJob.serviceAccount.create }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason do deploy anything when pools aren't defined.
@@ -757,6 +757,80 @@ scheduler: | |||
|
|||
env: [] | |||
|
|||
# Pools that will be added to Airflow (Keys are pools names and values are pools settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be very clear here that this doesn't actually manage the pools, just does a one off, at time of deploy, import.
+1 -- Agreed with @jedcunningham . That is one of the things we said the user-community chart did that we would never do when we start building the official helm chart for Airflow (or so I remember :) ) |
What if we just make pools optionally configurable with env vars? Sort of like var or conn now. Might screw up some queries but, maybe could be handled. Then you could in effect accomplish same from chart |
We had discussion on that. See #18582 |
+1 too. Managing pools this way is a bad idea for the reasons @jedcunningham nicely laid out.
As @eladkal mentioned, this would have to redefine the way our SQL queries are done in scheduler. There is a certain complexity of Pools in multi-scheduler context that makes it a bit complex (but not impossible I think). I once assesed this might be hard. But I looked closely how it works and I think possibly we could attempt to change how Pools table is used. I think either code was different when I looked at it in #18582, or maybe I have not thought about this idea before, but let me explain the context:
What happen next in the method then it builds the "PoolStats" dict and marks tasks as queued if they still have "free slots" (but it uses the in-memory Dict for that). Only after tasks are marked as queued then it releases the lock on Pools. This is - in order to accomodate multiple schedulers and the critical section is to make sure schedulers will not mark dag runs as executable when Pools would be over-subscribed. So while indeed Pools table is locked and used in the query, fundamentally I think we could take the "total" values from ENV variables rather than from Pools table. This is somewhat coincidental that we use "totals" from the Pools table - it's convenient and the code already has Pool objects returned by the query to build PoolStats, but not really necessary. We even could have a separate lock for this ciricial section in There is no referential identity between Pools and any other table it seems so I think we would just have to modify slots_stats function to retrieve totals from elsewhere (env vars) and things would work. The
I still do not like the "multiple sources of truth" in this case. But there is potentially an option that we could disable the pool counts in the UI entirely and replace them with pool counts. managed in env vars only. Also I think changing the approach in this area is something that we want anyway so we could kill two birds with the same stone - see #29416 - our pool count currently does not currently take into account deferred tasks. but there are cases where it would be desired. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
closes #11707
Credits to @FloChehab (#15093)
With no pools defined
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.