Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(helm): set worker safeToEvict properly #35130

Merged
merged 6 commits into from Oct 27, 2023
Merged

Conversation

hakuno
Copy link
Contributor

@hakuno hakuno commented Oct 23, 2023

It improves (or fixes) the workers.safeToEvict usage.

Kubernetes administrators/users agree that cluster-autoscaler.kubernetes.io/safe-to-evict can be managed to avoid the Autoscaler to kill the worker pod when scaling nodes down.

When this annotation is set to "true", the cluster autoscaler is allowed to evict a Pod even if other rules would normally prevent that. The cluster autoscaler never evicts Pods that have this annotation explicitly set to "false"; you could set that on an important Pod that you want to keep running. If this annotation is not set then the cluster autoscaler follows its Pod-level behavior.

So, if I set workers.safeToEvict to false, it gets nothing. The Autoscaler will kill that still. Because the Helm chart has no effect on handling that. See:

        {{- if .Values.workers.safeToEvict }}
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
        {{- end }}

What about if I need to set that to false? I couldn't. Anybody is unable to.

A probaly workaround would set up workers.podAnnotations with map of annotations and workers.safeToEvict falsely together.

Please, check it out carefully. Thanks in advance!

@boring-cyborg
Copy link

boring-cyborg bot commented Oct 23, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's better than the current implementation, could you add a test for it with the different values?

@hakuno
Copy link
Contributor Author

hakuno commented Oct 24, 2023

Thank you so, @hussein-awala

I tested the Helm Chart itself with --validate on a Kubernetes cluster to verify the generated manifests.

$ helm template --dry-run --validate --set workers.safeToEvict=false .

We got

# Source: airflow/templates/workers/worker-deployment.yaml
################################
## Airflow Worker Deployment
#################################
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: release-name-worker
  labels:
    tier: airflow
    component: worker
    release: release-name
    chart: "airflow-1.12.0-dev"
    heritage: Helm
spec:
  serviceName: release-name-worker
  replicas: 1
  selector:
    matchLabels:
      tier: airflow
      component: worker
      release: release-name
  template:
    metadata:
      labels:
        tier: airflow
        component: worker
        release: release-name
      annotations:
        checksum/metadata-secret: 8b8ce685079b3075a4b91c47e267db7b50cd8bfda2269dd36fef1e258a3a38eb
        checksum/result-backend-secret: 98a68f230007cfa8f5d3792e1aff843a76b0686409e4a46ab2f092f6865a1b71
        checksum/pgbouncer-config-secret: 1dae2adc757473469686d37449d076b0c82404f61413b58ae68b3c5e99527688
        checksum/webserver-secret-key: 1b48a02846657f40ac54dc50b230572e67e9f4a88e9a486e63bd25209173bfc4
        checksum/kerberos-keytab: 80979996aa3c1f48c95dfbe9bb27191e71f12442a08c0ed834413da9d430fd0e
        checksum/airflow-config: 520107ea8f2b203f84baed2ba60545a7afc8fc9a9f6978aa7cd45c4dba7c8dbe
        checksum/extra-configmaps: e862ea47e13e634cf17d476323784fa27dac20015550c230953b526182f5cac8
        checksum/extra-secrets: e9582fdd622296c976cbc10a5ba7d6702c28a24fe80795ea5b84ba443a56c827
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
...

And

$ helm template --dry-run --validate --set workers.safeToEvict=true .

We got

# Source: airflow/templates/workers/worker-deployment.yaml
################################
## Airflow Worker Deployment
#################################
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: release-name-worker
  labels:
    tier: airflow
    component: worker
    release: release-name
    chart: "airflow-1.12.0-dev"
    heritage: Helm
spec:
  serviceName: release-name-worker
  replicas: 1
  selector:
    matchLabels:
      tier: airflow
      component: worker
      release: release-name
  template:
    metadata:
      labels:
        tier: airflow
        component: worker
        release: release-name
      annotations:
        checksum/metadata-secret: 8b8ce685079b3075a4b91c47e267db7b50cd8bfda2269dd36fef1e258a3a38eb
        checksum/result-backend-secret: 98a68f230007cfa8f5d3792e1aff843a76b0686409e4a46ab2f092f6865a1b71
        checksum/pgbouncer-config-secret: 1dae2adc757473469686d37449d076b0c82404f61413b58ae68b3c5e99527688
        checksum/webserver-secret-key: c20aeee646c14a8ab31991c17de268a577d34f05d4a1ae045725bf974f999c69
        checksum/kerberos-keytab: 80979996aa3c1f48c95dfbe9bb27191e71f12442a08c0ed834413da9d430fd0e
        checksum/airflow-config: 520107ea8f2b203f84baed2ba60545a7afc8fc9a9f6978aa7cd45c4dba7c8dbe
        checksum/extra-configmaps: e862ea47e13e634cf17d476323784fa27dac20015550c230953b526182f5cac8
        checksum/extra-secrets: e9582fdd622296c976cbc10a5ba7d6702c28a24fe80795ea5b84ba443a56c827
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
...

And default values are same than workers.safeToEvict=true as expected.

Please, tell me if you know more ways to check it out.

@jedcunningham
Copy link
Member

@hakuno, Hussein meant a test in our test suit to make sure this doesn't regress. Here is an example, and some details on our helm chart tests.

@hakuno
Copy link
Contributor Author

hakuno commented Oct 24, 2023

@jedcunningham Thanks for that. Unfortunately, I've had some difficult to setup this test environment.

Thoughts?

@potiuk
Copy link
Member

potiuk commented Oct 24, 2023

@jedcunningham Thanks for that. Unfortunately, I've had some difficult to setup this test environment.

Thoughts?

If you do not write what kind of difficulties you had, It's hard to have more thoughts. But I recommend to follow the steps in https://github.com/apache/airflow/blob/main/CONTRIBUTORS_QUICK_START.rst to setup breeze and then the docs pointed at by @jedcunningham provide a step-by-step (almost wizard-like) approach to make it works.

@hakuno
Copy link
Contributor Author

hakuno commented Oct 24, 2023

Thanks, @potiuk - I meant the environment setup itself. My Pytest couldn't discovery tests etc.

I'll read that docs as soon as possible... thanks again :)

@hakuno
Copy link
Contributor Author

hakuno commented Oct 25, 2023

@jedcunningham @hussein-awala @potiuk I got it.

pytest helm_tests/airflow_core/test_worker.py::TestWorker::test_safetoevict_annotations
============================================================================================ test session starts ============================================================================================
platform linux -- Python 3.8.18, pytest-7.4.2, pluggy-1.3.0 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /opt/airflow
configfile: pyproject.toml
plugins: timeouts-1.2.1, xdist-3.3.1, capture-warnings-0.0.4, instafail-0.5.0, mock-3.12.0, asyncio-0.21.1, requests-mock-1.11.0, time-machine-2.13.0, cov-4.1.0, rerunfailures-12.0, anyio-4.0.0, httpx-0.21.3
asyncio: mode=strict
setup timeout: 0.0s, execution timeout: 0.0s, teardown timeout: 0.0s
collected 2 items                                                                                                                                                                                           

helm_tests/airflow_core/test_worker.py::TestWorker::test_safetoevict_annotations[true-True] PASSED                                                                                                    [ 50%]
helm_tests/airflow_core/test_worker.py::TestWorker::test_safetoevict_annotations[false-False] PASSED                                                                                                  [100%]

============================================================================================ 2 passed in 16.65s =============================================================================================

Ps. I ran breeze testing helm-tests --helm-test-package airflow_core too.

@hakuno
Copy link
Contributor Author

hakuno commented Oct 25, 2023

Ps.¹ Other Airflow components also has safeToEvict values.

Ps.² At another moment, I see I can contribute with other PRs. Because workers.podAnnotations and airflowPodAnnotations can fight each other whenever it has a set of annotations in common.

@eladkal eladkal added this to the Airflow Helm Chart 1.12.0 milestone Oct 27, 2023
@eladkal eladkal merged commit 764a0e3 into apache:main Oct 27, 2023
53 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 27, 2023

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:helm-chart Airflow Helm Chart changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants