-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get spark driver pod status if log stream interrupted accidentally #9081
Conversation
"Cannot execute: {}. Error code is: {}.".format( | ||
self._mask_cmd(spark_submit_cmd), returncode | ||
# double check by spark driver pod status (blocking function) | ||
spark_driver_pod_status = self._start_k8s_pod_status_tracking() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to fail hard when not in kubenetes mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for that, I've split the 'if' conditions to remove the influence when not in k8s mode.
@ashb would you mind review the new code change? thank you! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@dawany I know it's been a long time, but we're suffering with this exact issue and for reasons I'd rather not get into we're currently stuck on Airflow 1.10.11. Just curious, did you test this code? Are you no longer experiencing this in newer versions? This seems like a reasonable solution that could easily be patched with a config map or building it into a custom container. |
#8963
Description
I am using airflow SparkSubmitOperator to schedule my spark jobs on kubernetes cluster.
But for some reason, kubernetes often throw 'too old resource version' exception which will interrupt spark watcher, then airflow will lost the log stream and could not get 'Exit Code' eventually. So airflow will mark job failed once log stream lost but the job is still running.
This is a solution about a simple retry mechanism which is when the log stream is interrupted, then call method 'read_namespaced_pod()', which is provided by kubernetes client api, to get spark driver pod status.
Target Github ISSUE
#8963
Make sure to mark the boxes below before creating PR: [x]
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.