-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up airflow variable defaults with descriptions automatically #4297
Set up airflow variable defaults with descriptions automatically #4297
Conversation
Hi @AetherUnbound here's a draft PR tackling this issue, kindly take a look and let me know if I'm missing something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fantastic start! I hate to say it yet again, but unfortunately this looks a little bit more complicated than I initially thought 😆
The entrypoint.sh
file runs for every Airflow container on every start. When I tried running just down -v && just c && just logs
, all the containers failed to start because they were all trying to run airflow variables list
and unable to read from the (not yet initialized!) database. That database initialization happens during the exec /entrypoint "$@"
line for the scheduler
container only, and so we get into a situation where we're trying to modify the database when it hasn't been initialized yet, and we can' initialize the database because we're trying to modify it first!
Thankfully, Airflow gives us some tools for this. What I was able to do was wrap all of the variable adding logic in the following:
if [[ "$*" == "webserver" ]]; then
# Wait for the database to initialize, will time out if not
airflow db check-migrations
# ... the rest of the logic you added
fi
What this does is:
- Check that we're only executing this command on the
webserver
container (by checking the command passed in, accessible via$*
, which should bewebserver
for the webserver) - Wait for the scheduler container to finish the migrations (by running
airflow db check-migrations
, which will wait to 60s for migrations to be applied)
This ensures that we're only running the variable adding command on one container, and only after the database has already been set up!
We also need one more change to the catalog's Dockerfile:
diff --git a/catalog/Dockerfile b/catalog/Dockerfile
index 64ad1837d..7a164fd52 100644
--- a/catalog/Dockerfile
+++ b/catalog/Dockerfile
@@ -69,5 +69,6 @@ ARG CONSTRAINTS_FILE="https://raw.githubusercontent.com/apache/airflow/constrain
RUN pip install -r ${REQUIREMENTS_FILE} -c ${CONSTRAINTS_FILE}
COPY entrypoint.sh /opt/airflow/entrypoint.sh
+COPY variables.tsv /opt/airflow/variables.tsv
ENTRYPOINT ["/usr/bin/dumb-init", "--", "/opt/airflow/entrypoint.sh"]
This will bake the variables.tsv
file into the image, so it should also be available in our production image when we deploy (our docker mounts differ between development and production).
With all that in place, I was able to get this working great! Let me know if you'd like any assistance making those changes 🙂
catalog/entrypoint.sh
Outdated
@@ -61,4 +61,71 @@ while read -r var_string; do | |||
# only include Slack airflow connections | |||
done < <(env | grep "^AIRFLOW_CONN_SLACK*") | |||
|
|||
# Set up Airflow Variable defaults with descriptions automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a header
here, for clarity:
# Set up Airflow Variable defaults with descriptions automatically | |
# Set up Airflow Variable defaults with descriptions automatically | |
header "SETTING VARIABLE DEFAULTS" |
The other calls to header
which have been added below should be changed to echo
instead.
catalog/entrypoint.sh
Outdated
@@ -61,4 +61,71 @@ while read -r var_string; do | |||
# only include Slack airflow connections | |||
done < <(env | grep "^AIRFLOW_CONN_SLACK*") | |||
|
|||
# Set up Airflow Variable defaults with descriptions automatically | |||
# List all existing airflow variables | |||
output=$(airflow variables list -o plain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you've captured below, Airflow adds key
as the first line of output on this command. We can just skip that first line altogether by using the following:
output=$(airflow variables list -o plain) | |
output=$(airflow variables list -o plain | tail -n +2) |
This can also be done for the creation of new_variables_list
.
catalog/entrypoint.sh
Outdated
fi | ||
done | ||
|
||
if [ "$column1" != "Key" ] && ! $matched; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am...confused about how this logic works 😅 We're checking if column3
is "description" above and skipping on that case, but then we're checking once again for "Key" here? And it looks like this would always be true even on the first line because the TSV uses the term key
and not Key
. I think this first predicate can be removed.
239cf14
to
a99ac68
Compare
Hi @AetherUnbound thanks for all the insights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really close! One more change and it should be ready 😄
catalog/entrypoint.sh
Outdated
done <<<"$output" | ||
|
||
if $found_existing_vars; then | ||
echoe -e "Found the following existing variables(The values of these will not be overwritten):\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoe -e "Found the following existing variables(The values of these will not be overwritten):\n" | |
echo -e "Found the following existing variables (the values of these will not be overwritten):\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting that! Updated Now
a99ac68
to
d45461f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for this awesome addition and for addressing all the feedback! 😄 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice DX improvement, @madewithkode ! Thank you for your contribution.
I've added several spelling suggestions.
Fixes
Fixes #4202 by @AetherUnbound
Description
Set up airflow variable defaults with descriptions automatically.
This change adds a
variables.tsv
file inside/catalog
, this file contains tab seperated values of variables, their default values and their description.It also adds new logic in
/catalog/entrypoint.sh
to read this file and use the contents to automatically set corresponding airflow variables. This logic is smart enough to not override existing airflow variables with any duplicates found in the file, but it goes ahead to add inexistent ones.Testing Instructions
This change can be observed by:
openverse_catalog
container(airflow) -just c
http://localhost:9090/variable/list/
you should be able to see all variables fromvariables.tsv
alongside their descriptions now automatically set.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin