Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secrets backend failover #16404

Merged
merged 7 commits into from
Aug 27, 2021
Merged

Secrets backend failover #16404

merged 7 commits into from
Aug 27, 2021

Conversation

fhoda
Copy link
Contributor

@fhoda fhoda commented Jun 11, 2021

  1. Currently Airflow does not check the default secrets backends (env and metastore db) if there is any sort of connection related error to an Alternative Backend, causing related tasks to fail. The change proposed here allows it to fail over to checking the default backends when this happens.

  2. Additionally GCP Secrets Manager causes the Airflow Webserver to crash at startup if credentials for the backend cannot be found. This behavior seems to be unique to GCP Secret Manager and this PR addresses that for parity in behavior regarding missing credentials across all backends.

closes: #14592

@kaxil

@boring-cyborg boring-cyborg bot added area:providers area:secrets provider:google Google (including GCP) related issues labels Jun 11, 2021
@fhoda fhoda marked this pull request as draft June 11, 2021 21:57
@fhoda
Copy link
Contributor Author

fhoda commented Jun 11, 2021

Submitting for initial review. Assuming I will need to make changes based on feedback :)

@potiuk
Copy link
Member

potiuk commented Jun 13, 2021

I believe the conclusion of our discussion was that in case of get_config use where Secrets Backend is missing-in-action we should raise an Exception. I gues It should be added here (including tests?)

@fhoda
Copy link
Contributor Author

fhoda commented Jun 15, 2021

I believe the conclusion of our discussion was that in case of get_config use where Secrets Backend is missing-in-action we should raise an Exception. I gues It should be added here (including tests?)

@potiuk For get_config currently it causes airflow to crash if it cannot reach the Secrets Backend. This happens when get_config is called. If I understood properly we considered this behavior as expected so I left it alone.

@potiuk
Copy link
Member

potiuk commented Jun 15, 2021

I believe the conclusion of our discussion was that in case of get_config use where Secrets Backend is missing-in-action we should raise an Exception. I gues It should be added here (including tests?)

@potiuk For get_config currently it causes airflow to crash if it cannot reach the Secrets Backend. This happens when get_config is called. If I understood properly we considered this behavior as expected so I left it alone.

But I think we should add a better message in this case and a test to cover this scenario explicitly. Then no-one will ever "fix it" thinking it is a bug.

@fhoda
Copy link
Contributor Author

fhoda commented Jun 15, 2021

But I think we should add a better message in this case and a test to cover this scenario explicitly. Then no-one will ever "fix it" thinking it is a bug.

Ah yes, I understand now. Agree, was planning to add tests for all changes once I got feedback on the approach. I will implement your suggestion for get_config as well and then add tests.

@kaxil
Copy link
Member

kaxil commented Jun 28, 2021

Thanks for the PR, can you fix the static checks and rebase on latest main @fhoda

airflow/configuration.py Outdated Show resolved Hide resolved
@ashb
Copy link
Member

ashb commented Jun 29, 2021

Currently Airflow does not check the default secrets backends (env and metastore db) if there is any sort of connection related error to an Alternative Backend, causing related tasks to fail. The change proposed here allows it to fail over to checking the default backends when this happens.

This leads to some other possible bugs -- if you have different connection info in the DB to the external secrets store then sometimes/somewhat randomly you would end up with different crednetials returned.

I'm not sure that is a good behaviour...

@ashb
Copy link
Member

ashb commented Jun 29, 2021

If the goal is to protect against a misconfigured secrets backend only, then we should change it to test the settings somehow in get_custom_secret_backend, and if it's not working there, to not use it, rather than to put the try/except around every use of the secrets backends.

@potiuk
Copy link
Member

potiuk commented Jun 29, 2021

I'm not sure that is a good behaviour...
If the goal is to protect against a misconfigured secrets backend only, then we should change it to test the settings somehow in get_custom_secret_backend, and if it's not working there, to not use it, rather than to put the try/except around every use of the secrets backends.

I quite agree. Just a bit of context @ashb

We had big discussion about it in #14592 and my original "extreme" statement was that we should crash Airflow if secret backend is unreachable (I had same concern).

Finally we reached a bit "softer" approach (which might allow the tasks to survive secret manager temporary unavailability if the user decides that this is the case). We decideed the crash should only happen when you read configuration (because there are default fallback values which might change airflow behaviour), but there might be some valid cases of fallback for connections or variables.

The idea here is that the user might then implement scenarios where the fallback values will be OK. It requires a deliberate action from the user - of defining both variable and connection fallback. There are no "defaults" for neither connections nor variables, so unless the users explicitly defines them, the task will not proceed/fail in case secret backend is unreachable (there will be no connection nor variable to use). While this is not as "clean" as ('crash whenever the backend is not available'), it makes sense to be implemented this way.

I'd love to hear your thoughts about it.

@fhoda fhoda marked this pull request as ready for review June 30, 2021 04:08
Comment on lines 209 to 210
log.exception(
'Unable to retrieve variable from alternative secrets backend. '
Copy link
Member

@kaxil kaxil Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise the same error if there is any error getting variable from default secret backends (env var and db).

Let's separate this PR into 2:

  1. GCP Change in airflow/providers/google/cloud/secrets/secret_manager.py
  2. Broader Secrets Failover Change (with optional retries controlled via airflow.cfg 🤷 - that can even be a 3rd PR )

@fhoda
Copy link
Contributor Author

fhoda commented Jul 19, 2021

I will split this PR up and reference accordingly as @kaxil suggested.

@fhoda fhoda requested a review from kaxil August 17, 2021 15:37
@kaxil kaxil merged commit 0abbd2d into apache:main Aug 27, 2021
@kaxil kaxil added this to the Airflow 2.2 milestone Aug 27, 2021
ranlu added a commit to seung-lab/seuron that referenced this pull request Jun 23, 2022
Due to recent changes in airflow

apache/airflow#16404

get variable will never raise an exception, it will return a None
instead. This will cause setdefault call to reset everything to default
if the database is not available when it try to fetch the value. For now
we set the default value once during the initialization and let the dag
fail if it gets a None parameter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers area:secrets provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unreachable Secrets Backend Causes Web Server Crash
5 participants