New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YARN-11011. Make YARN Router throw Exception to client clearly. #6211
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@goiri Can you help review this PR? Thank you very much! |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -859,9 +869,8 @@ <R> Collection<R> invokeConcurrent(ClientMethod request, Class<R> clazz) | |||
results.put(subClusterId, clazz.cast(result)); | |||
} catch (InterruptedException | ExecutionException e) { | |||
Throwable cause = e.getCause(); | |||
LOG.error("Cannot execute {} on {} : {}", request.getMethodName(), | |||
subClusterId.getId(), cause.getMessage()); | |||
exceptions.put(subClusterId, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is good to keep the map even though we just output the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your help in reviewing the code! I will improve this part of the code.
@goiri Can you help review this PR again? Thank you very much! |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@goiri Can you help review this PR again? Thank you very much! |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -125,8 +125,8 @@ public static void logAndThrowException(Throwable t, String errMsgFormat, Object | |||
public static void logAndThrowException(String errMsg, Throwable t) | |||
throws YarnException { | |||
if (t != null) { | |||
LOG.error(errMsg, t); | |||
throw new YarnException(errMsg, t); | |||
LOG.error(errMsg + "" + t.getMessage(), t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your help in reviewing the code! I will improve this part of the code.
cause = cause.getCause(); | ||
} | ||
String errMsg = (cause.getMessage() != null) ? cause.getMessage() : "UNKNOWN"; | ||
return Pair.of(subClusterId, new YarnException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract the exception for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will improve this code.
@@ -862,8 +873,11 @@ <R> Collection<R> invokeConcurrent(ClientMethod request, Class<R> clazz) | |||
Pair<SubClusterId, Object> pair = future.get(); | |||
subClusterId = pair.getKey(); | |||
Object result = pair.getValue(); | |||
if(result instanceof YarnException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
@@ -35,8 +35,7 @@ | |||
import org.apache.hadoop.test.GenericTestUtils; | |||
import org.apache.hadoop.thirdparty.com.google.common.collect.ImmutableSet; | |||
import org.apache.hadoop.yarn.api.ApplicationClientProtocol; | |||
import org.apache.hadoop.yarn.api.protocolrecords.SubmitApplicationRequest; | |||
import org.apache.hadoop.yarn.api.protocolrecords.SubmitApplicationResponse; | |||
import org.apache.hadoop.yarn.api.protocolrecords.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
🎊 +1 overall
This message was automatically generated. |
@goiri Can you help review this PR again? Thank you very much! |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@goiri Thank you very much for your help in reviewing the code! |
…he#6211) Contributed by Shilun Fan. Reviewed-by: Inigo Goiri <inigoiri@apache.org> Signed-off-by: Shilun Fan <slfan1989@apache.org>
Description of PR
JIRA: YARN-11011. Make YARN Router throw Exception to client clearly.
While using YARN Federation, users have provided feedback that the FederationClientInterceptor does not provide sufficient feedback to the client. For instance, it may return messages like "No active SubCluster available to submit the request," or it may not provide clear error messages, requiring users to check Router logs to understand the issue.
Upon careful review of the code, there are two categories of issues:
In cases such as
submitApplication
orgetNewApplication
, the lack of feedback is often due to inaccurate user configuration of the maximum retry count. For example, if there are 2 sub-clusters, the configured maximum retry count should be set to 1. If users use the default value of 3, it can lead to exceptions like "No active SubCluster available to submit the request." To address this, I will add an explanation of the maximum retry count configuration in YARN-11594 to help users better configure this parameter.In the case of
getClusterMetrics
, which involves merging results from multiple clusters, I have optimized theinvokeConcurrent
method. Now, when exceptions are thrown, the relevant error information is directly propagated to the client.These optimizations aim to improve the feedback provided to users, making error messages more informative and easier to understand.
How was this patch tested?
Add Junit Test.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?