Get spark driver pod status if log stream interrupted accidentally #9081

dawany · 2020-05-31T15:56:38Z

Description

I am using airflow SparkSubmitOperator to schedule my spark jobs on kubernetes cluster.

But for some reason, kubernetes often throw 'too old resource version' exception which will interrupt spark watcher, then airflow will lost the log stream and could not get 'Exit Code' eventually. So airflow will mark job failed once log stream lost but the job is still running.

This is a solution about a simple retry mechanism which is when the log stream is interrupted, then call method 'read_namespaced_pod()', which is provided by kubernetes client api, to get spark driver pod status.

Target Github ISSUE

#8963

Make sure to mark the boxes below before creating PR: [x]

Description above provides context of the change
Unit tests coverage for changes (not needed for documentation changes)
Target Github ISSUE in description if exists
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

ashb · 2020-06-15T10:57:50Z

airflow/providers/apache/spark/hooks/spark_submit.py

-                "Cannot execute: {}. Error code is: {}.".format(
-                    self._mask_cmd(spark_submit_cmd), returncode
+            # double check by spark driver pod status (blocking function)
+            spark_driver_pod_status = self._start_k8s_pod_status_tracking()


This is going to fail hard when not in kubenetes mode.

Yes, thanks for that, I've split the 'if' conditions to remove the influence when not in k8s mode.

dawany · 2020-06-24T06:34:19Z

@ashb would you mind review the new code change? thank you!

stale · 2020-08-08T07:10:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

berglh · 2023-06-01T23:09:11Z

@dawany I know it's been a long time, but we're suffering with this exact issue and for reasons I'd rather not get into we're currently stuck on Airflow 1.10.11. Just curious, did you test this code? Are you no longer experiencing this in newer versions? This seems like a reasonable solution that could easily be patched with a config map or building it into a custom container.

Get spark driver pod status if log stream interrupted accidentally

531fae1

boring-cyborg bot added the provider:Apache label May 31, 2020

This was referenced May 31, 2020

Mark sure airflow get the right job status after log stream interrupt when schedule spark job on k8s #8964

Closed

SparkSubmitOperator could not get Exit Code after log stream interrupted by k8s old resource version exception #8963

Closed

Dylan added 2 commits June 1, 2020 00:19

Get spark driver pod status if log stream interrupted accidentally

76bfc0e

Get spark driver pod status if log stream interrupted accidentally

8318df9

ashb requested changes Jun 15, 2020

View reviewed changes

Dylan added 4 commits June 17, 2020 15:23

Remove influence when not in kubenetes mode

17ad8e3

Remove influence when not in kubenetes mode

d10b9f6

Only check when self._spark_exit_code is still default value.

1a117b3

Add whitespace

b64d3ab

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 8, 2020

stale bot closed this Aug 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get spark driver pod status if log stream interrupted accidentally #9081

Get spark driver pod status if log stream interrupted accidentally #9081

dawany commented May 31, 2020 •

edited

Loading

ashb Jun 15, 2020

dawany Jun 17, 2020

dawany commented Jun 24, 2020

stale bot commented Aug 8, 2020

berglh commented Jun 1, 2023

Get spark driver pod status if log stream interrupted accidentally #9081

Get spark driver pod status if log stream interrupted accidentally #9081

Conversation

dawany commented May 31, 2020 • edited Loading

Description

Target Github ISSUE

ashb Jun 15, 2020

Choose a reason for hiding this comment

dawany Jun 17, 2020

Choose a reason for hiding this comment

dawany commented Jun 24, 2020

stale bot commented Aug 8, 2020

berglh commented Jun 1, 2023

dawany commented May 31, 2020 •

edited

Loading