Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful handle for airflow application pods during K8s API brief interuption #19698

Closed
1 of 2 tasks
nayanen opened this issue Nov 19, 2021 · 4 comments
Closed
1 of 2 tasks
Assignees
Labels
kind:feature Feature Requests

Comments

@nayanen
Copy link

nayanen commented Nov 19, 2021

Description

Description

I have come across this issue where k8s customer using airflow application. The pod get terminated when there is brief API server interruption. API server interruption can happen due to lot of scenarios so ideally the application should handle it gracefully. Another key thing here is this happens only when you set the "is_delete_pod_operator" set to True.

Now if i read the code i understand that the launcher.start_pod or launcher.monitor_pod have some logic which check heartbeat to the AI server and once its find some interruption it goes to the AirflowException part. So if we have some kind of nested exception or some retry logic to handle the brief interruption of API sever unavailability that should help to avoid the pod termination, because it just takes very less time for API server to re-establish the connection.

             'airflow_version': airflow_version.replace('+', '-'),
                'kubernetes_pod_operator': 'True',
            }
        )

        self.log.debug("Starting pod:\n%s", yaml.safe_dump(self.pod.to_dict()))
        final_state = None
        try:
            launcher.start_pod(self.pod, startup_timeout=self.startup_timeout_seconds)
            final_state, remote_pod, result = launcher.monitor_pod(pod=self.pod, get_logs=self.get_logs)
        except AirflowException:

The kubernetesPodOperator is provided as open-source Airflow and more details about the behavior can be found below.

is_delete_operator_pod (bool) – What to do when the pod reaches its final state, or the execution is interrupted. If False (default): do nothing, If True: delete the pod

Use case/motivation

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nayanen nayanen added the kind:feature Feature Requests label Nov 19, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Nov 19, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@raphaelauv
Copy link
Contributor

WIP -> #19572

@uranusjr
Copy link
Member

Assigning to avoid acceidental overlap.

@raphaelauv
Copy link
Contributor

I think we can close this issue , now that the new kubernetesPodOperator is retrying.

@potiuk potiuk closed this as completed Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

4 participants