Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39541][YARN]Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM #36952

Closed
wants to merge 6 commits into from

Conversation

lyy-pineapple
Copy link
Contributor

@lyy-pineapple lyy-pineapple commented Jun 22, 2022

What changes were proposed in this pull request?

If commit a job in yarn cluster mode and driver exited before registerAM,Diagnostics of yarn UI will show the actual reason instead of "EXCODE 13"

Why are the changes needed?

If commit a job in yarn cluster mode and driver exited before registerAM,Diagnostics of yarn UI did not show the exception that was throwed by driver .Yarn UI only show :
Application application_xxx failed 1 times (global limit =10; local limit is =1) due to AM Container for appattempt_xxx_000001 exited with exitCode: 13
image

We must find the real reason by spark log :
image

The issue is caused by when dirver did not registerAM before throwing exception,it would not call unregisterAM,then yarn ui could not show the real reason.For example,user did not initialize sc or the class of user could not be found.

By this changes ,the diagnostics will show the real reason when this happen,like:
image

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Manual test
  1. User jars packaging problem, throwing ClassNotFoundException
    the diagnosis show the exception throwed by driver.
    image

  2. Exception during operation
    the diagnosis show the exception throwed by driver.
    image

44c94d2455f7.png)
3. Properly run jobs
image

  1. The user submission program ended directly, and sparkContext was not initialized during operation.
    image
  2. Before initializing SC, the thread sleeps for 600000ms
    the diagnosis show the exception throwed by driver.
    image
  3. When retrying multiple times and dirver throw exception.
    Test for this problem:[SPARK-21541][YARN]: Spark Logs show incorrect job status for a job that does not create SparkContext #18741
    image
    only th last time will unregisterAM and delete staging the spark staging directory.
    spark log of the last time:
    image
    others:
    image
    And for normal jobs,they make no difference.

7.Interrupt exit in client mode
Test for this problem:#18788
yarn UI show SUCCEEDED,spark log shows that it unregisterAM and delete staging the spark staging directory.
image
image

And for normal jobs,they make no difference.

@github-actions github-actions bot added the YARN label Jun 22, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@lyy-pineapple
Copy link
Contributor Author

@tgravescs Can you review this patch,thanks.

if (registered) {
amClient.unregisterApplicationMaster(status, diagnostics, uiHistoryAddress)
def unregister(status: FinalApplicationStatus,
conf: YarnConfiguration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation should be 4 spaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

val inShutdown = ShutdownHookManager.inShutdown()
if (registered || !isClusterMode) {
exitCode = code
finalStatus = status
} else {
finalStatus = FinalApplicationStatus.FAILED
exitCode = ApplicationMaster.EXIT_SC_NOT_INITED
if(msg == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have space after if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my failed.Fixed it.

val inShutdown = ShutdownHookManager.inShutdown()
if (registered || !isClusterMode) {
exitCode = code
finalStatus = status
} else {
finalStatus = FinalApplicationStatus.FAILED
exitCode = ApplicationMaster.EXIT_SC_NOT_INITED
if(msg == null) {
errorMsg = "User did not initialize spark context!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you know user didn't initialize spark context vs something bad happening and it failing to initialize? this could be more confusing. maybe "Failed to initialize Spark context"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.I has changed it.

conf: YarnConfiguration,
diagnostics: String = ""): Unit = synchronized {
if (!registered) {
amClient = AMRMClient.createAMRMClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very odd to register during unregister, what if it was the register or start that failed?
if it timed out and you have a very high timeout I assume you would wait that again.
I'm definitely a bit hesitant about this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a timeout control be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question for me is, is this really needed vs possibly introducing other issues. while it may be convenient, user can just go look in the log to find the error. You tested a bunch of scenarios which is great, but I'm sure there are others. The main one I mention here is what if registration failed or RM is not responsive? Can you test out that scenario and see what happens?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 16, 2022
@github-actions github-actions bot closed this Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants