New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47488][k8s]fix driver pod stuck when driver on k8s #45667
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why this PR is re-created? Did you find some issues of wildfly-openssl
there?
@dongjoon-hyun Yes. This PR is re-created. wildfly-openssl is ok. I made an incorrect commit, which caused code to be in a mess, so I decide to re-create this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have unresolved review comments from the previous PR.
if (SparkContext.getActive.isEmpty) { | ||
if (!DriverPodIsNormal) { | ||
logError(s"Driver Pod will exit because: $driverThrow") | ||
System.exit(EXIT_FAILURE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide the YARN code link for this case? Specifically,
- Does YARN job fail with EXIT_FAILURE ?
- If this PR is for consistency between YARN and K8s, we should have the same exit code and same error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is yarn exit code.
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
It seems applicationmaster adopts it's own failure exit code, which not suitable for driver pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the YARN exit code for your case? According to your PR description, it's not clear, @littlelittlewhite09 . It would be described in the PR description.
When spark app runs on yarn-cluster mode, everything is ok. Driver can terminate normally if encounters exception or err.
try { | ||
app.start(childArgs.toArray, sparkConf) | ||
} catch { | ||
case t: Throwable => | ||
throw findCause(t) | ||
logWarning("Some ERR/Exception happened when app is running.") | ||
if (args.master.startsWith("k8s")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to use DriverPodIsNormal
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, DriverPodIsNormal
applied here may be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, please use it.
|
Thank you for sharing, @littlelittlewhite09 . |
If this is YARN's esoteric feature, we had better not do this, @littlelittlewhite09 , because Standalone and K8s are consistent.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but let's hold on this PR a little in order to make it sure because we don't have any test coverage (in this PR or even YARN test case?) and not sure even if this is a consistent behavior across Spark resource managers, @littlelittlewhite09 . When I have some time, I can revisit this later for that perspective.
Or, you can ping someone-else while I'm away. Thank you!
What changes were proposed in this pull request?
This pr is related to SPARK-47488
The idea is that, when driver encounters exception, driver will be terminated until sparkcontext is closed if spark runs on k8s, regardless of whether the non-daemon threads stop or not.
Why are the changes needed?
We are migrating spark app on yarn-cluster mode to k8s. When spark app runs on yarn-cluster mode, everything is ok. Driver can terminate normally if encounters exception or err. But running on k8s, driver pod may get stuck when encounters exception even if sparkcontext is closed. We found this problem is caused by non-daemon threads not stopped. On yarn-cluster mode, even if non-daemon thread is not stopped, driver can still stop.
This pr may benefit to make the migration from yarn cluster mode to k8s smoother.
Does this PR introduce any user-facing change?
no
How was this patch tested?
UT
Was this patch authored or co-authored using generative AI tooling?
no