Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find Pod Before Cleanup In KubernetesPodOperator Execution #22092

Merged

Conversation

michaelmicheal
Copy link
Contributor


As outlined in this issue, running multiple KubernetesPodOperators with random_name_suffix=False and is_delete_pod_operator=True leads to

  1. First task creating a pod (with name 'my_pod' for example)
  2. Second task attempting to create a pod with the same name and failing because a pod with name 'my_pod' already exists
  3. Second tasks deletes pod with name 'my_pod', which is the pod from the first task.

Ideally the second tasks shouldn't delete the pod from the first task, so I added a check to make sure a task's pod exists with the find_pod method before calling the cleanup function (which handles the deletion of the pod).

Validation

To reproduce the issue and validate this change I ran two dag runs of the following DAG at the same time.

from datetime import timedelta
from airflow import models
from airflow import utils
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

dag = models.DAG(
    'kubernetes_change_validation',
    start_date=utils.dates.days_ago(2),
    max_active_runs=3,
    dagrun_timeout=timedelta(minutes=5),
    schedule_interval='@daily'
)

test_kubernetes_pod= KubernetesPodOperator(
    namespace='my_namespace',
    image="busybox",
    cmds=['sh', '-c', 'sleep 600'],
    name="test_kubernetes_pod",
    in_cluster=True,
    task_id="test_kubernetes_pod",
    get_logs=True,
    random_name_suffix=False,
    dag=dag,
    is_delete_operator_pod=True
)

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels Mar 8, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 8, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@michaelmicheal michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 81c823f to 7b2f11a Compare March 31, 2022 22:07
@potiuk
Copy link
Member

potiuk commented Apr 4, 2022

I am planning to release cncf.kubernetes provider soon (we need it to 2.3.0 release) so fixing the problems today/tomorrow might be usefule @michaelmicheal to get this one in :)

@michaelmicheal michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 839de3f to f934912 Compare April 4, 2022 18:49
@michaelmicheal
Copy link
Contributor Author

@potiuk @jedcunningham Is it possible to get approval to run all the CI workflows?

jedcunningham
jedcunningham previously approved these changes Apr 4, 2022
@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Apr 4, 2022
@github-actions
Copy link

github-actions bot commented Apr 4, 2022

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@michaelmicheal
Copy link
Contributor Author

@jedcunningham Do I need to update the helm chart tests?

@jedcunningham
Copy link
Member

Yeah, it looks like those will need some attention. Hopefully you can reproduce following these instructions:
https://github.com/apache/airflow/blob/main/TESTING.rst#running-tests-with-kubernetes

@michaelmicheal michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from a77893e to ac30bee Compare April 7, 2022 15:24
@michaelmicheal
Copy link
Contributor Author

@jedcunningham Is it possible to get the CI to run again? I updated the helm chart tests

@dstandish
Copy link
Contributor

dstandish commented May 11, 2022

Hey @michaelmicheal thanks for this PR. I think i understand the issue now.

I think this solution is a little indirect. The reason that we want to skip deletion is, it tried to create a pod but one with that name already exists. But your "skip deletion" logic is "can't find pod". But there is a pod there.... it just seems like we can tighten it up a little bit. The other issue is you make a backward-incompatible signature change to cleanup (since you're removing an arg).

Here's what I would propose.

When we attempt to create and the pod exists we get an ApiException object e such that e.body looks like this:

{'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'pods "test-kubernetes-pod" already exists', 'reason': 'AlreadyExists', 'details': {'name': 'test-kubernetes-pod', 'kind': 'pods'}, 'code': 409}

so, what we could do is, in our try / finally we could add an except to capture and store the exception, and then pass it to cleanup. Then in cleanup, if we get this kind of response (i.e. can't create pod cus already exists), we can choose to skip pod deletion -- something else created a pod that we did not expect to be there, so let's just fail and leave the pod there.

What do you think?

@michaelmicheal
Copy link
Contributor Author

@dstandish Any suggestions on how I should pass the exception or tell cleanup to skip deletion? Would it be too hacky to set is_delete_operator_pod to True?

@dstandish
Copy link
Contributor

dstandish commented May 11, 2022

so to do this sort of thing i think you have to create a variable outside the scope of the try
e.g.

        exc = None
        try:
            ...
        except Exception as e:
            exc = e
        finally:
            self.cleanup(
                pod=self.pod or self.pod_request_obj,
                remote_pod=remote_pod,
                exc=exc,
            )

then you'd want to have some logic in cleanup to evaluate the exc and skip delete in that scenario.

i would not mess with is_delete_operator_pod -- that is something different and we should not mutate that. what we're doing here is conditionally skipping deletion because there's a conflicting pod there -- and we don't care about the value of is_delete_operator_pod in this case, and in any case is_delete_operator_pod is a reflection of the intention of the dag author not the circumstances encountered in the task execution..

stepping back, i realize the difference between find before delete and skip delete if there was a "pod already exists" error is a bit subtle. do you think this approach makes more sense? or not really? i think this way better reveals the intention (e.g. "why are we trying to find it again?"). do argue for it if you think you original way is better for whatever reason. maybe @jedcunningham will take another look and chime in.

@dstandish
Copy link
Contributor

Coincidentally, I just encountered a different issue where we get 409 error. In that case, we were trying to patch the pod based on an outdated pod object and got this error response:

kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'b60acb80-fd12-433e-8cce-118d06160fa7', 'X-Kubernetes-Pf-Prioritylevel-Uid': '51a496a9-78a8-4594-91b9-2ea9c2d3d61e', 'Date': 'Thu, 12 May 2022 16:49:43 GMT', 'Content-Length': '388'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on pods \"test-kubernetes-pod-db9eedb7885c40099dd40cd4edc62415\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"test-kubernetes-pod-db9eedb7885c40099dd40cd4edc62415","kind":"pods"},"code":409}

So if we go the error parsing route, we just have to make sure we're being targeted enough (i.e. just looking for code 409 is not sufficient but we must also verify it's a pod already exists scenario.

@michaelmicheal
Copy link
Contributor Author

I think the argument for finding the pod before cleanup is that it assures that a pod exists before attempting to delete it. This works not only for a specific edge case (like the situation where it tried to create a pod but one with that name already exists), but any situation in which the pod doesn't exist. I'm happy to implement your proposed solution @dstandish, but what do you think?

@dstandish
Copy link
Contributor

I'm ok with it. Just try to document intention with comment and test
Thanks

@dstandish
Copy link
Contributor

dstandish commented May 13, 2022

ok actually... i think there's a simpler way to fix this.

when we are calling _process_pod_deletion (i.e. here ), we could simply use remote_pod instead (if it's not None).

then it will only delete a pod that it has found already. that will solve your issue. wdyt? this is similar to #23676.

maybe we also add a remote_pod = self.find_pod(... after the get_or_create, to ensure that the variable is populated as early as possible.

@michaelmicheal
Copy link
Contributor Author

maybe we also add a remote_pod = self.find_pod(... after the get_or_create, to ensure that the variable is populated as early as possible.

Makes sense to me. If we're calling find_pod though, why not just do it in the finally block or in cleanup so that it's the most up to date?

@dstandish
Copy link
Contributor

Makes sense to me. If we're calling find_pod though, why not just do it in the finally block or in cleanup so that it's the most up to date?

because to do that we have to change the signature of cleanup and change more code.

@dstandish
Copy link
Contributor

dstandish commented May 13, 2022

oh you mention also the option of putting it in finally. i guess putting it in finally would be ok too, but the thing about finally is, we don't know how we got there and we have to be careful not to do things that could fail and introduce more exceptions beyond tho one that (potentially) brought us there. so to me it seems marginally cleaner to just do it after get_or_create. and indeed in your case it would fail.

i think maybe ideally get or create would do the find itself but for some reason, it sometimes returns only the request object.

@michaelmicheal
Copy link
Contributor Author

Fair enough, makes sense to me. I'll move the find_pod to right after the get_or_create and pass remote_pod to _process_pod_deletion if remote_pod isn't None?

@dstandish
Copy link
Contributor

Fair enough, makes sense to me. I'll move the find_pod to right after the get_or_create and pass remote_pod to _process_pod_deletion if remote_pod isn't None?

yeah that sounds good to me

@eladkal
Copy link
Contributor

eladkal commented Jun 1, 2022

@michaelmicheal there are conflicts :(

@michaelmicheal michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 83410f7 to d0fbe08 Compare June 2, 2022 20:18
@michaelmicheal
Copy link
Contributor Author

@dstandish @eladkal I resolved the conflicts, could I get the CI workflow to run?

@@ -428,16 +430,18 @@ def cleanup(self, pod: k8s.V1Pod, remote_pod: k8s.V1Pod):
with _suppress(Exception):
for event in self.pod_manager.read_pod_events(pod).items:
self.log.error("Pod Event: %s - %s", event.reason, event.message)
with _suppress(Exception):
self.process_pod_deletion(pod)
if remote_pod is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we care, but if the create succeeds but the find fails, we can leave the pod with this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fair to assume that that if the pod find fails then the pod doesn’t exist and we don’t need to delete it?

@jedcunningham
Copy link
Member

I've kicked CI off for you.

@potiuk
Copy link
Member

potiuk commented Jun 6, 2022

Looks green @dstandish @jedcunningham :)

@michaelmicheal michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 941e969 to 39a0fdc Compare June 6, 2022 15:39
@michaelmicheal michaelmicheal force-pushed the mpe-kubernetes-pod-find-before-cleanup branch from 39a0fdc to 8856aa8 Compare June 13, 2022 18:17
@michaelmicheal
Copy link
Contributor Author

@potiuk @dstandish @jedcunningham Do I need to make any other changes or is this PR good to merge?

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes. sorry i had a half completed review that was just sitting there.

@michaelmicheal
Copy link
Contributor Author

@dstandish I added the pod is None check to process_pod_deletion and removed the redundant mocking from that test. Let me know if I need to make any other changes

@michaelmicheal
Copy link
Contributor Author

@jedcunningham any suggestions for changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers okay to merge It's ok to merge this PR as it does not require more tests provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants