Skip to content

Comments

[SPARK-46920][YARN] Improve executor exit error message on YARN#44951

Closed
pan3793 wants to merge 4 commits intoapache:masterfrom
pan3793:SPARK-46920
Closed

[SPARK-46920][YARN] Improve executor exit error message on YARN#44951
pan3793 wants to merge 4 commits intoapache:masterfrom
pan3793:SPARK-46920

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Jan 30, 2024

What changes were proposed in this pull request?

Improve executor exit error message on YARN, with additional explanation of exit code defined by Spark.

Why are the changes needed?

Spark defines its own exit codes, which have overlap with exit codes defined by YARN, thus diagnostics reported by YARN may be misleading. For example, exit code 56 is defined as HEARTBEAT_FAILURE in Spark, but INVALID_DOCKER_IMAGE_NAME in Hadoop, thus the error message displayed in UI is misleading.

image

Does this PR introduce any user-facing change?

Yes, the UI displays more information when the executor runs on YARN exits without zero code.

How was this patch tested?

Because HEARTBEAT_FAILURE depends on the network and Driver's load, to simplify the test, I just use select java_method('java.lang.System', 'exit', 56) to simulate the above case.

image

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the YARN label Jan 30, 2024
@github-actions github-actions bot added the CORE label Jan 30, 2024
@pan3793
Copy link
Member Author

pan3793 commented Jan 30, 2024

cc @yaooqinn @srowen please take a look when you have time.

case OOM => "OutOfMemoryError"
case UNCAUGHT_EXCEPTION => "Uncaught exception."
case UNCAUGHT_EXCEPTION_TWICE => "Uncaught exception, and logging the exception failed."
case OOM => "OutOfMemoryError."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nit but do we need these changes? they're not sentences.
Otherwise fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for review, removed.

@mridulm
Copy link
Contributor

mridulm commented Jan 30, 2024

+CC @tgravescs

@pan3793
Copy link
Member Author

pan3793 commented Feb 5, 2024

kindly ping @tgravescs and @yaooqinn, would you please take a look?

case UNCAUGHT_EXCEPTION => "Uncaught exception"
case UNCAUGHT_EXCEPTION_TWICE => "Uncaught exception, and logging the exception failed"
case UNCAUGHT_EXCEPTION => "Uncaught exception."
case UNCAUGHT_EXCEPTION_TWICE => "Uncaught exception, and logging the exception failed."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trivial but these are not sentences. Here and below there is no reason to end with a period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm... So you mean we should remove all tailing dots in this function, and do appending on the caller side if necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they're needed at all. Separate stuff with spaces. But it doesn't matter

Copy link
Member Author

@pan3793 pan3793 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me revert changes in this function, just keep it as-is.

@github-actions github-actions bot removed the CORE label Feb 5, 2024
@pan3793
Copy link
Member Author

pan3793 commented Mar 19, 2024

kindly ping @tgravescs

@tgravescs
Copy link
Contributor

its a little unclear to me exactly what the user changes. The screen shots in the description don't make sense with the code changes made. From what I can tell you are adding in another field:

s"Possible causes: $sparkExitCodeReason "

which is the ExecutorExitCode.explainExitCode(exitStatus). This seems fine to me.
But the screen shots above look like output is quite different with the shell error output, are those screen shots correct for a before and after this change? can you please explain if I'm missing something or how this is changing a 56 error code from invalid docker to unable to send heartbeat

@pan3793
Copy link
Member Author

pan3793 commented Mar 20, 2024

@tgravescs Thanks for checking, I replaced the verified result image.

the inconsistent diagnostics message is caused by yarn.nodemanager.container-executor.class, previous test uses org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor while our production Hadoop cluster uses org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor, thus the diagnostics message is quite different.

@tgravescs
Copy link
Contributor

ok, changes seem fine to me

@LuciferYang
Copy link
Contributor

Merged into master for Spark 4.0. Thanks @pan3793 @srowen @tgravescs @mridulm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants