Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up airflow variable defaults with descriptions automatically #4297

Conversation

madewithkode
Copy link
Contributor

@madewithkode madewithkode commented May 9, 2024

Fixes

Fixes #4202 by @AetherUnbound

Description

Set up airflow variable defaults with descriptions automatically.

This change adds a variables.tsv file inside /catalog, this file contains tab seperated values of variables, their default values and their description.
It also adds new logic in /catalog/entrypoint.sh to read this file and use the contents to automatically set corresponding airflow variables. This logic is smart enough to not override existing airflow variables with any duplicates found in the file, but it goes ahead to add inexistent ones.

Testing Instructions

This change can be observed by:

  • Spin up the openverse_catalog container(airflow) - just c
  • navigate to http://localhost:9090/variable/list/ you should be able to see all variables from variables.tsv alongside their descriptions now automatically set.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@madewithkode madewithkode requested a review from a team as a code owner May 9, 2024 18:50
@madewithkode madewithkode requested review from krysal and obulat May 9, 2024 18:50
@openverse-bot openverse-bot added 🟩 priority: low Low priority and doesn't need to be rushed 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🤖 aspect: dx Concerns developers' experience with the codebase 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels May 9, 2024
@madewithkode madewithkode marked this pull request as draft May 9, 2024 18:57
@madewithkode
Copy link
Contributor Author

Hi @AetherUnbound here's a draft PR tackling this issue, kindly take a look and let me know if I'm missing something.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic start! I hate to say it yet again, but unfortunately this looks a little bit more complicated than I initially thought 😆

The entrypoint.sh file runs for every Airflow container on every start. When I tried running just down -v && just c && just logs, all the containers failed to start because they were all trying to run airflow variables list and unable to read from the (not yet initialized!) database. That database initialization happens during the exec /entrypoint "$@" line for the scheduler container only, and so we get into a situation where we're trying to modify the database when it hasn't been initialized yet, and we can' initialize the database because we're trying to modify it first!

Thankfully, Airflow gives us some tools for this. What I was able to do was wrap all of the variable adding logic in the following:

if [[ "$*" == "webserver" ]]; then
  # Wait for the database to initialize, will time out if not
  airflow db check-migrations
  # ... the rest of the logic you added

fi

What this does is:

  1. Check that we're only executing this command on the webserver container (by checking the command passed in, accessible via $*, which should be webserver for the webserver)
  2. Wait for the scheduler container to finish the migrations (by running airflow db check-migrations, which will wait to 60s for migrations to be applied)

This ensures that we're only running the variable adding command on one container, and only after the database has already been set up!

We also need one more change to the catalog's Dockerfile:

diff --git a/catalog/Dockerfile b/catalog/Dockerfile
index 64ad1837d..7a164fd52 100644
--- a/catalog/Dockerfile
+++ b/catalog/Dockerfile
@@ -69,5 +69,6 @@ ARG CONSTRAINTS_FILE="https://raw.githubusercontent.com/apache/airflow/constrain
 RUN pip install -r ${REQUIREMENTS_FILE} -c ${CONSTRAINTS_FILE}
 
 COPY entrypoint.sh /opt/airflow/entrypoint.sh
+COPY variables.tsv /opt/airflow/variables.tsv
 
 ENTRYPOINT ["/usr/bin/dumb-init", "--", "/opt/airflow/entrypoint.sh"]

This will bake the variables.tsv file into the image, so it should also be available in our production image when we deploy (our docker mounts differ between development and production).

With all that in place, I was able to get this working great! Let me know if you'd like any assistance making those changes 🙂

@@ -61,4 +61,71 @@ while read -r var_string; do
# only include Slack airflow connections
done < <(env | grep "^AIRFLOW_CONN_SLACK*")

# Set up Airflow Variable defaults with descriptions automatically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a header here, for clarity:

Suggested change
# Set up Airflow Variable defaults with descriptions automatically
# Set up Airflow Variable defaults with descriptions automatically
header "SETTING VARIABLE DEFAULTS"

The other calls to header which have been added below should be changed to echo instead.

@@ -61,4 +61,71 @@ while read -r var_string; do
# only include Slack airflow connections
done < <(env | grep "^AIRFLOW_CONN_SLACK*")

# Set up Airflow Variable defaults with descriptions automatically
# List all existing airflow variables
output=$(airflow variables list -o plain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've captured below, Airflow adds key as the first line of output on this command. We can just skip that first line altogether by using the following:

Suggested change
output=$(airflow variables list -o plain)
output=$(airflow variables list -o plain | tail -n +2)

This can also be done for the creation of new_variables_list.

fi
done

if [ "$column1" != "Key" ] && ! $matched; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am...confused about how this logic works 😅 We're checking if column3 is "description" above and skipping on that case, but then we're checking once again for "Key" here? And it looks like this would always be true even on the first line because the TSV uses the term key and not Key. I think this first predicate can be removed.

@obulat obulat added 🧱 stack: catalog Related to the catalog and Airflow DAGs and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels May 10, 2024
@madewithkode madewithkode force-pushed the 4202_set_up_airflow_variable_defaults_with_descriptions_automatically branch from 239cf14 to a99ac68 Compare May 10, 2024 12:18
@madewithkode
Copy link
Contributor Author

Hi @AetherUnbound thanks for all the insights.
PR has been updated and I can confirm that everything works quite as expected locally. This was quite fun as it was my first time writing a bash script to solve an actual production issue. 😅

@madewithkode madewithkode marked this pull request as ready for review May 13, 2024 15:46
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really close! One more change and it should be ready 😄

done <<<"$output"

if $found_existing_vars; then
echoe -e "Found the following existing variables(The values of these will not be overwritten):\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echoe -e "Found the following existing variables(The values of these will not be overwritten):\n"
echo -e "Found the following existing variables (the values of these will not be overwritten):\n"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting that! Updated Now

@madewithkode madewithkode force-pushed the 4202_set_up_airflow_variable_defaults_with_descriptions_automatically branch from a99ac68 to d45461f Compare May 14, 2024 11:12
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for this awesome addition and for addressing all the feedback! 😄 🚀

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice DX improvement, @madewithkode ! Thank you for your contribution.
I've added several spelling suggestions.

catalog/entrypoint.sh Outdated Show resolved Hide resolved
catalog/entrypoint.sh Outdated Show resolved Hide resolved
catalog/entrypoint.sh Outdated Show resolved Hide resolved
catalog/entrypoint.sh Outdated Show resolved Hide resolved
catalog/entrypoint.sh Outdated Show resolved Hide resolved
@AetherUnbound AetherUnbound merged commit 84dd4bb into WordPress:main May 20, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Set up Airflow Variable defaults with descriptions automatically
4 participants