Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47488][k8s]fix driver pod stuck when driver on k8s #45667

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

littlelittlewhite09
Copy link

What changes were proposed in this pull request?

This pr is related to SPARK-47488
The idea is that, when driver encounters exception, driver will be terminated until sparkcontext is closed if spark runs on k8s, regardless of whether the non-daemon threads stop or not.

Why are the changes needed?

We are migrating spark app on yarn-cluster mode to k8s. When spark app runs on yarn-cluster mode, everything is ok. Driver can terminate normally if encounters exception or err. But running on k8s, driver pod may get stuck when encounters exception even if sparkcontext is closed. We found this problem is caused by non-daemon threads not stopped. On yarn-cluster mode, even if non-daemon thread is not stopped, driver can still stop.
This pr may benefit to make the migration from yarn cluster mode to k8s smoother.

Does this PR introduce any user-facing change?

no

How was this patch tested?

UT

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the CORE label Mar 22, 2024
@littlelittlewhite09 littlelittlewhite09 changed the title fix driver pod stuck when driver on k8s [SPARK-47488][k8s]fix driver pod stuck when driver on k8s Mar 22, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why this PR is re-created? Did you find some issues of wildfly-openssl there?

@littlelittlewhite09
Copy link
Author

@dongjoon-hyun Yes. This PR is re-created. wildfly-openssl is ok. I made an incorrect commit, which caused code to be in a mess, so I decide to re-create this PR.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (SparkContext.getActive.isEmpty) {
if (!DriverPodIsNormal) {
logError(s"Driver Pod will exit because: $driverThrow")
System.exit(EXIT_FAILURE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide the YARN code link for this case? Specifically,

  1. Does YARN job fail with EXIT_FAILURE ?
  2. If this PR is for consistency between YARN and K8s, we should have the same exit code and same error message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is yarn exit code.
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
It seems applicationmaster adopts it's own failure exit code, which not suitable for driver pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the YARN exit code for your case? According to your PR description, it's not clear, @littlelittlewhite09 . It would be described in the PR description.

When spark app runs on yarn-cluster mode, everything is ok. Driver can terminate normally if encounters exception or err.

try {
app.start(childArgs.toArray, sparkConf)
} catch {
case t: Throwable =>
throw findCause(t)
logWarning("Some ERR/Exception happened when app is running.")
if (args.master.startsWith("k8s")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to use DriverPodIsNormal here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, DriverPodIsNormal applied here may be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, please use it.

@littlelittlewhite09
Copy link
Author

We have unresolved review comments from the previous PR.

  1. [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down #45613 (review)
  2. [SPARK-47488][core][k8s] Driver stuck when thread pool is not shut down #45613 (comment)
  1. I tried to follow this doc to test Kubernetes Integration tests, but driver pod always not be launched. I do not how to fix it, because I am not familiar with k8s. Meanwhile, I tried to use workflow to test, and it seems it passed k8s test when my master branch merged into this PR. Here is the result. https://github.com/littlelittlewhite09/spark/actions/runs/8403080524/job/23013202288
  2. I am sorry that I am not familiar with spark on standalone, and we do not use standalone as resource manager.

@dongjoon-hyun
Copy link
Member

Thank you for sharing, @littlelittlewhite09 .

@dongjoon-hyun
Copy link
Member

If this is YARN's esoteric feature, we had better not do this, @littlelittlewhite09 , because Standalone and K8s are consistent.

I am sorry that I am not familiar with spark on standalone, and we do not use standalone as resource manager.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but let's hold on this PR a little in order to make it sure because we don't have any test coverage (in this PR or even YARN test case?) and not sure even if this is a consistent behavior across Spark resource managers, @littlelittlewhite09 . When I have some time, I can revisit this later for that perspective.

Or, you can ping someone-else while I'm away. Thank you!

@dongjoon-hyun dongjoon-hyun dismissed their stale review March 24, 2024 00:29

Stale review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants