Find Pod Before Cleanup In KubernetesPodOperator Execution #22092

michaelmicheal · 2022-03-08T16:12:26Z

As outlined in this issue, running multiple KubernetesPodOperators with random_name_suffix=False and is_delete_pod_operator=True leads to

First task creating a pod (with name 'my_pod' for example)
Second task attempting to create a pod with the same name and failing because a pod with name 'my_pod' already exists
Second tasks deletes pod with name 'my_pod', which is the pod from the first task.

Ideally the second tasks shouldn't delete the pod from the first task, so I added a check to make sure a task's pod exists with the find_pod method before calling the cleanup function (which handles the deletion of the pod).

Validation

To reproduce the issue and validate this change I ran two dag runs of the following DAG at the same time.

from datetime import timedelta
from airflow import models
from airflow import utils
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

dag = models.DAG(
    'kubernetes_change_validation',
    start_date=utils.dates.days_ago(2),
    max_active_runs=3,
    dagrun_timeout=timedelta(minutes=5),
    schedule_interval='@daily'
)

test_kubernetes_pod= KubernetesPodOperator(
    namespace='my_namespace',
    image="busybox",
    cmds=['sh', '-c', 'sleep 600'],
    name="test_kubernetes_pod",
    in_cluster=True,
    task_id="test_kubernetes_pod",
    get_logs=True,
    random_name_suffix=False,
    dag=dag,
    is_delete_operator_pod=True
)

boring-cyborg · 2022-03-08T16:12:30Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

potiuk · 2022-04-04T15:27:36Z

I am planning to release cncf.kubernetes provider soon (we need it to 2.3.0 release) so fixing the problems today/tomorrow might be usefule @michaelmicheal to get this one in :)

michaelmicheal · 2022-04-04T19:24:57Z

@potiuk @jedcunningham Is it possible to get approval to run all the CI workflows?

tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py

github-actions · 2022-04-04T23:08:25Z

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

michaelmicheal · 2022-04-05T14:05:43Z

@jedcunningham Do I need to update the helm chart tests?

jedcunningham · 2022-04-05T17:33:07Z

Yeah, it looks like those will need some attention. Hopefully you can reproduce following these instructions:
https://github.com/apache/airflow/blob/main/TESTING.rst#running-tests-with-kubernetes

michaelmicheal · 2022-04-12T20:51:09Z

@jedcunningham Is it possible to get the CI to run again? I updated the helm chart tests

dstandish · 2022-05-11T19:17:55Z

Hey @michaelmicheal thanks for this PR. I think i understand the issue now.

I think this solution is a little indirect. The reason that we want to skip deletion is, it tried to create a pod but one with that name already exists. But your "skip deletion" logic is "can't find pod". But there is a pod there.... it just seems like we can tighten it up a little bit. The other issue is you make a backward-incompatible signature change to cleanup (since you're removing an arg).

Here's what I would propose.

When we attempt to create and the pod exists we get an ApiException object e such that e.body looks like this:

{'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'pods "test-kubernetes-pod" already exists', 'reason': 'AlreadyExists', 'details': {'name': 'test-kubernetes-pod', 'kind': 'pods'}, 'code': 409}

so, what we could do is, in our try / finally we could add an except to capture and store the exception, and then pass it to cleanup. Then in cleanup, if we get this kind of response (i.e. can't create pod cus already exists), we can choose to skip pod deletion -- something else created a pod that we did not expect to be there, so let's just fail and leave the pod there.

What do you think?

michaelmicheal · 2022-05-11T21:20:14Z

@dstandish Any suggestions on how I should pass the exception or tell cleanup to skip deletion? Would it be too hacky to set is_delete_operator_pod to True?

dstandish · 2022-05-11T21:46:47Z

so to do this sort of thing i think you have to create a variable outside the scope of the try
e.g.

        exc = None
        try:
            ...
        except Exception as e:
            exc = e
        finally:
            self.cleanup(
                pod=self.pod or self.pod_request_obj,
                remote_pod=remote_pod,
                exc=exc,
            )

then you'd want to have some logic in cleanup to evaluate the exc and skip delete in that scenario.

i would not mess with is_delete_operator_pod -- that is something different and we should not mutate that. what we're doing here is conditionally skipping deletion because there's a conflicting pod there -- and we don't care about the value of is_delete_operator_pod in this case, and in any case is_delete_operator_pod is a reflection of the intention of the dag author not the circumstances encountered in the task execution..

stepping back, i realize the difference between find before delete and skip delete if there was a "pod already exists" error is a bit subtle. do you think this approach makes more sense? or not really? i think this way better reveals the intention (e.g. "why are we trying to find it again?"). do argue for it if you think you original way is better for whatever reason. maybe @jedcunningham will take another look and chime in.

dstandish · 2022-05-12T17:35:00Z

Coincidentally, I just encountered a different issue where we get 409 error. In that case, we were trying to patch the pod based on an outdated pod object and got this error response:

kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'b60acb80-fd12-433e-8cce-118d06160fa7', 'X-Kubernetes-Pf-Prioritylevel-Uid': '51a496a9-78a8-4594-91b9-2ea9c2d3d61e', 'Date': 'Thu, 12 May 2022 16:49:43 GMT', 'Content-Length': '388'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on pods \"test-kubernetes-pod-db9eedb7885c40099dd40cd4edc62415\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"test-kubernetes-pod-db9eedb7885c40099dd40cd4edc62415","kind":"pods"},"code":409}

So if we go the error parsing route, we just have to make sure we're being targeted enough (i.e. just looking for code 409 is not sufficient but we must also verify it's a pod already exists scenario.

michaelmicheal · 2022-05-12T19:54:47Z

I think the argument for finding the pod before cleanup is that it assures that a pod exists before attempting to delete it. This works not only for a specific edge case (like the situation where it tried to create a pod but one with that name already exists), but any situation in which the pod doesn't exist. I'm happy to implement your proposed solution @dstandish, but what do you think?

dstandish · 2022-05-13T14:05:25Z

I'm ok with it. Just try to document intention with comment and test
Thanks

dstandish · 2022-05-13T16:32:24Z

ok actually... i think there's a simpler way to fix this.

when we are calling _process_pod_deletion (i.e. here ), we could simply use remote_pod instead (if it's not None).

then it will only delete a pod that it has found already. that will solve your issue. wdyt? this is similar to #23676.

maybe we also add a remote_pod = self.find_pod(... after the get_or_create, to ensure that the variable is populated as early as possible.

michaelmicheal · 2022-05-13T17:20:51Z

maybe we also add a remote_pod = self.find_pod(... after the get_or_create, to ensure that the variable is populated as early as possible.

Makes sense to me. If we're calling find_pod though, why not just do it in the finally block or in cleanup so that it's the most up to date?

dstandish · 2022-05-13T18:29:06Z

Makes sense to me. If we're calling find_pod though, why not just do it in the finally block or in cleanup so that it's the most up to date?

because to do that we have to change the signature of cleanup and change more code.

dstandish · 2022-05-13T18:33:45Z

oh you mention also the option of putting it in finally. i guess putting it in finally would be ok too, but the thing about finally is, we don't know how we got there and we have to be careful not to do things that could fail and introduce more exceptions beyond tho one that (potentially) brought us there. so to me it seems marginally cleaner to just do it after get_or_create. and indeed in your case it would fail.

i think maybe ideally get or create would do the find itself but for some reason, it sometimes returns only the request object.

michaelmicheal · 2022-05-13T19:05:41Z

Fair enough, makes sense to me. I'll move the find_pod to right after the get_or_create and pass remote_pod to _process_pod_deletion if remote_pod isn't None?

dstandish · 2022-05-13T20:52:25Z

Fair enough, makes sense to me. I'll move the find_pod to right after the get_or_create and pass remote_pod to _process_pod_deletion if remote_pod isn't None?

yeah that sounds good to me

eladkal · 2022-06-01T19:01:52Z

@michaelmicheal there are conflicts :(

michaelmicheal · 2022-06-02T21:31:13Z

@dstandish @eladkal I resolved the conflicts, could I get the CI workflow to run?

jedcunningham · 2022-06-02T21:41:16Z

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

@@ -428,16 +430,18 @@ def cleanup(self, pod: k8s.V1Pod, remote_pod: k8s.V1Pod):
                with _suppress(Exception):
                    for event in self.pod_manager.read_pod_events(pod).items:
                        self.log.error("Pod Event: %s - %s", event.reason, event.message)
-            with _suppress(Exception):
-                self.process_pod_deletion(pod)
+            if remote_pod is not None:


I'm not sure if we care, but if the create succeeds but the find fails, we can leave the pod with this approach.

Is it fair to assume that that if the pod find fails then the pod doesn’t exist and we don’t need to delete it?

Stale.

jedcunningham · 2022-06-02T21:42:45Z

I've kicked CI off for you.

potiuk · 2022-06-06T13:29:47Z

Looks green @dstandish @jedcunningham :)

michaelmicheal · 2022-06-13T18:19:18Z

@potiuk @dstandish @jedcunningham Do I need to make any other changes or is this PR good to merge?

dstandish

small changes. sorry i had a half completed review that was just sitting there.

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

michaelmicheal · 2022-06-16T14:48:12Z

@dstandish I added the pod is None check to process_pod_deletion and removed the redundant mocking from that test. Let me know if I need to make any other changes

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

michaelmicheal · 2022-06-20T12:56:53Z

@jedcunningham any suggestions for changes?

michaelmicheal requested a review from jedcunningham as a code owner March 8, 2022 16:12

boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels Mar 8, 2022

jedcunningham reviewed Mar 8, 2022

View reviewed changes

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Outdated Show resolved Hide resolved

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Outdated Show resolved Hide resolved

michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 81c823f to 7b2f11a Compare March 31, 2022 22:07

SamWheating reviewed Apr 1, 2022

View reviewed changes

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Outdated Show resolved Hide resolved

michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 839de3f to f934912 Compare April 4, 2022 18:49

jedcunningham reviewed Apr 4, 2022

View reviewed changes

michaelmicheal commented Apr 4, 2022

View reviewed changes

tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py Outdated Show resolved Hide resolved

jedcunningham previously approved these changes Apr 4, 2022

View reviewed changes

jedcunningham requested a review from dstandish April 4, 2022 23:07

github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Apr 4, 2022

michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from a77893e to ac30bee Compare April 7, 2022 15:24

michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 83410f7 to d0fbe08 Compare June 2, 2022 20:18

jedcunningham reviewed Jun 2, 2022

View reviewed changes

michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 941e969 to 39a0fdc Compare June 6, 2022 15:39

michaelmicheal added 2 commits June 13, 2022 14:17

Deleting pod in cleanup if remote pod is found

f3399a0

Adding type error catch in get_container_termination_message

8856aa8

michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 39a0fdc to 8856aa8 Compare June 13, 2022 18:17

dstandish requested changes Jun 16, 2022

View reviewed changes

dstandish mentioned this pull request Jun 16, 2022

Kubernetes Pod Operator support running with multiple containers #23450

Closed

Passing remote pod to process_pod_deletion and remove redundant mock

97f15b8

dstandish approved these changes Jun 16, 2022

View reviewed changes

airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py Outdated Show resolved Hide resolved

Update airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py

3f747cf

dstandish requested a review from jedcunningham June 16, 2022 15:52

dstandish merged commit 78ac488 into apache:main Jun 21, 2022

potiuk mentioned this pull request Jul 13, 2022

Status of testing Providers that were prepared on July 13, 2022 #25037

Closed

95 tasks

SamWheating mentioned this pull request Aug 24, 2022

is_delete_operator_pod=True and random_name_suffix=False can cause KubernetesPodOperator to delete the wrong pod #21169

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find Pod Before Cleanup In KubernetesPodOperator Execution #22092

Find Pod Before Cleanup In KubernetesPodOperator Execution #22092

michaelmicheal commented Mar 8, 2022

boring-cyborg bot commented Mar 8, 2022

potiuk commented Apr 4, 2022

michaelmicheal commented Apr 4, 2022

github-actions bot commented Apr 4, 2022

michaelmicheal commented Apr 5, 2022

jedcunningham commented Apr 5, 2022

michaelmicheal commented Apr 12, 2022

dstandish commented May 11, 2022 •

edited

Loading

michaelmicheal commented May 11, 2022

dstandish commented May 11, 2022 •

edited

Loading

dstandish commented May 12, 2022

michaelmicheal commented May 12, 2022

dstandish commented May 13, 2022

dstandish commented May 13, 2022 •

edited

Loading

michaelmicheal commented May 13, 2022

dstandish commented May 13, 2022

dstandish commented May 13, 2022 •

edited

Loading

michaelmicheal commented May 13, 2022

dstandish commented May 13, 2022

eladkal commented Jun 1, 2022

michaelmicheal commented Jun 2, 2022

jedcunningham Jun 2, 2022

michaelmicheal Jun 2, 2022

jedcunningham commented Jun 2, 2022

potiuk commented Jun 6, 2022

michaelmicheal commented Jun 13, 2022

dstandish left a comment

michaelmicheal commented Jun 16, 2022

michaelmicheal commented Jun 20, 2022

Find Pod Before Cleanup In KubernetesPodOperator Execution #22092

Find Pod Before Cleanup In KubernetesPodOperator Execution #22092

Conversation

michaelmicheal commented Mar 8, 2022

Validation

boring-cyborg bot commented Mar 8, 2022

potiuk commented Apr 4, 2022

michaelmicheal commented Apr 4, 2022

github-actions bot commented Apr 4, 2022

michaelmicheal commented Apr 5, 2022

jedcunningham commented Apr 5, 2022

michaelmicheal commented Apr 12, 2022

dstandish commented May 11, 2022 • edited Loading

michaelmicheal commented May 11, 2022

dstandish commented May 11, 2022 • edited Loading

dstandish commented May 12, 2022

michaelmicheal commented May 12, 2022

dstandish commented May 13, 2022

dstandish commented May 13, 2022 • edited Loading

michaelmicheal commented May 13, 2022

dstandish commented May 13, 2022

dstandish commented May 13, 2022 • edited Loading

michaelmicheal commented May 13, 2022

dstandish commented May 13, 2022

eladkal commented Jun 1, 2022

michaelmicheal commented Jun 2, 2022

jedcunningham Jun 2, 2022

Choose a reason for hiding this comment

michaelmicheal Jun 2, 2022

Choose a reason for hiding this comment

jedcunningham commented Jun 2, 2022

potiuk commented Jun 6, 2022

michaelmicheal commented Jun 13, 2022

dstandish left a comment

Choose a reason for hiding this comment

michaelmicheal commented Jun 16, 2022

michaelmicheal commented Jun 20, 2022

dstandish commented May 11, 2022 •

edited

Loading

dstandish commented May 11, 2022 •

edited

Loading

dstandish commented May 13, 2022 •

edited

Loading

dstandish commented May 13, 2022 •

edited

Loading