Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we add configable debug settings for delay pod delete when there is a Error state of pods ? #8564

Closed
gwind opened this issue Apr 26, 2020 · 9 comments
Labels
kind:feature Feature Requests

Comments

@gwind
Copy link

gwind commented Apr 26, 2020

Description

Add configable debug settings for delay pod delete when there is a Error state of pods.

Use case / motivation

In apache/airflow:1.10.10 image.

I'm deploy a airflow in k8s, want to use Kubernetes Executor for task excute.
If the pod got Error state, airflow scheduler would delete pod immediately.
So we can not see what happend, pod is deleted in some seconds.

When I add time.sleep() in kubernetes_executor.py:896 , like this:

    def _change_state(self, key, state, pod_id, namespace):
        if state != State.RUNNING:
            if self.kube_config.delete_worker_pods:
                for x in range(120):
                    self.log.info(str(x) + ": sleep 1s for...")
                    time.sleep(1)
                self.kube_scheduler.delete_pod(pod_id, namespace)
                self.log.info('Deleted pod: %s in namespace %s', str(key), str(namespace))
            try:
                self.running.pop(key)
            except KeyError:
                self.log.debug('Could not find key: %s', str(key))
        self.event_buffer[key] = state

When trigger execute manully, I can see pod got Error state soon.

➜  ~ kubectl get po
NAME                                                         READY   STATUS    RESTARTS   AGE
airflow-564c84ff46-tn5mg                                     2/2     Running   0          67s
examplebashoperatorrunme0-76fd68aa96d64e8c93c7c87904f3312a   0/1     Error     0          24s

Watch pod's log:

➜  ~ kubectl logs -f examplebashoperatorrunme0-76fd68aa96d64e8c93c7c87904f3312a
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 23, in <module>
    import argcomplete
ModuleNotFoundError: No module named 'argcomplete'

It's a error in container. It's easy to debug now.

@gwind gwind added the kind:feature Feature Requests label Apr 26, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 26, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@hcbraun
Copy link

hcbraun commented Apr 29, 2020

Hi gwind,
how did you solve the container error?
ModuleNotFoundError: No module named 'argcomplete'
I have the same issue in pods with the Kubernetes executor and the example DAGs

There is an option to keep / not delete worker pods:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "false"

@gwind
Copy link
Author

gwind commented Apr 29, 2020

Hi gwind,
how did you solve the container error?
ModuleNotFoundError: No module named 'argcomplete'

I've solved it by hack the airflow code.

I have the same issue in pods with the Kubernetes executor and the example DAGs

There is an option to keep / not delete worker pods:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "false"

👍 Indeed, there is a option in https://github.com/apache/airflow/blob/master/airflow/config_templates/default_airflow.cfg#L828

# If True, all worker pods will be deleted upon termination
delete_worker_pods = True

# If False (and delete_worker_pods is True),
# failed worker pods will not be deleted so users can investigate them.
delete_worker_pods_on_failure = False

But for the guys who use apache/airflow:1.10.10 image, should check it.

Thanks !

@hcbraun
Copy link

hcbraun commented Apr 29, 2020

AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"

@ousatov-ua
Copy link

Hi guys!

How did you solve the problem ?

ModuleNotFoundError: No module named 'argcomplete'

is there any setting etc to fix it???

@gwind
Copy link
Author

gwind commented Jun 1, 2020

Hi guys!

How did you solve the problem ?

ModuleNotFoundError: No module named 'argcomplete'

is there any setting etc to fix it???

This bug caused by wrong user environment in the airflow POD mostly.

You can use kubectl exec -it ${THE_POD} bash go to inside of the airflow POD, then run airflow command for testing. You would be found that which user is working then.

@ousatov-ua
Copy link

ousatov-ua commented Jun 1, 2020

Hi!
Thanks!

I've setup it as proposed:
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"

And it worked. :)

@ashb
Copy link
Member

ashb commented Jun 11, 2020

This has already been added to master via #7507 (and then renamed in #8312) -- we're hoping to include this in the 1.10.11 release.

@gwind Are you happy that the above two PRs would give you the behaviour you are after (once they are released of course)

@kaxil
Copy link
Member

kaxil commented Feb 27, 2021

Looks like the issues was fixed by #7507 (and then renamed in #8312)

@kaxil kaxil closed this as completed Feb 27, 2021
@kaxil kaxil added this to To Do in Kubernetes Issues - Sprint via automation Feb 27, 2021
@kaxil kaxil moved this from To Do to Done in Kubernetes Issues - Sprint Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Feature Requests
Projects
No open projects
Development

No branches or pull requests

5 participants