-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Secrets backend failover #16404
Secrets backend failover #16404
Conversation
Submitting for initial review. Assuming I will need to make changes based on feedback :) |
I believe the conclusion of our discussion was that in case of |
@potiuk For |
But I think we should add a better message in this case and a test to cover this scenario explicitly. Then no-one will ever "fix it" thinking it is a bug. |
Ah yes, I understand now. Agree, was planning to add tests for all changes once I got feedback on the approach. I will implement your suggestion for |
Thanks for the PR, can you fix the static checks and rebase on latest main @fhoda |
This leads to some other possible bugs -- if you have different connection info in the DB to the external secrets store then sometimes/somewhat randomly you would end up with different crednetials returned. I'm not sure that is a good behaviour... |
If the goal is to protect against a misconfigured secrets backend only, then we should change it to test the settings somehow in |
I quite agree. Just a bit of context @ashb We had big discussion about it in #14592 and my original "extreme" statement was that we should crash Airflow if secret backend is unreachable (I had same concern). Finally we reached a bit "softer" approach (which might allow the tasks to survive secret manager temporary unavailability if the user decides that this is the case). We decideed the crash should only happen when you read configuration (because there are default fallback values which might change airflow behaviour), but there might be some valid cases of fallback for connections or variables. The idea here is that the user might then implement scenarios where the fallback values will be OK. It requires a deliberate action from the user - of defining both variable and connection fallback. There are no "defaults" for neither connections nor variables, so unless the users explicitly defines them, the task will not proceed/fail in case secret backend is unreachable (there will be no connection nor variable to use). While this is not as "clean" as ('crash whenever the backend is not available'), it makes sense to be implemented this way. I'd love to hear your thoughts about it. |
8e8e4c8
to
f9e843e
Compare
airflow/models/variable.py
Outdated
log.exception( | ||
'Unable to retrieve variable from alternative secrets backend. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will raise the same error if there is any error getting variable from default secret backends (env var and db).
Let's separate this PR into 2:
- GCP Change in airflow/providers/google/cloud/secrets/secret_manager.py
- Broader Secrets Failover Change (with optional retries controlled via airflow.cfg 🤷 - that can even be a 3rd PR )
I will split this PR up and reference accordingly as @kaxil suggested. |
56b88d9
to
d361614
Compare
de034ac
to
0214007
Compare
Due to recent changes in airflow apache/airflow#16404 get variable will never raise an exception, it will return a None instead. This will cause setdefault call to reset everything to default if the database is not available when it try to fetch the value. For now we set the default value once during the initialization and let the dag fail if it gets a None parameter.
Currently Airflow does not check the default secrets backends (
env
andmetastore db
) if there is any sort of connection related error to an Alternative Backend, causing related tasks to fail. The change proposed here allows it to fail over to checking the default backends when this happens.Additionally GCP Secrets Manager causes the Airflow Webserver to crash at startup if credentials for the backend cannot be found. This behavior seems to be unique to GCP Secret Manager and this PR addresses that for parity in behavior regarding missing credentials across all backends.
closes: #14592
@kaxil