-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4117] [YARN] Spark on Yarn handle AM being told command from RM #10129
Conversation
When RM throws ApplicationAttemptNotFoundException for allocate invocation, making the ApplicationMaster to finish immediately without any retries.
@@ -370,6 +371,12 @@ private[spark] class ApplicationMaster( | |||
failureCount = 0 | |||
} catch { | |||
case i: InterruptedException => | |||
case a: ApplicationAttemptNotFoundException => { | |||
val message = "ApplicationAttemptNotFoundException was thrown from Reporter thread."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;
is not needed for Scala, also {...}
is not necessary for this code block.
Jenkins, test this please |
Test build #47140 has finished for PR 10129 at commit
|
the compilation failed on hadoop 2.3 because It looks like ApplicationAttemptNotFoundException was introduced in hadoop 2.4. We need to support back to hadoop 2.2. |
Thanks @tgravescs for the details. I missed it before creating PR. I am thinking these ways for supporting <2.4 Apache Hadoop versions and as well as for >=2.4 Apache Hadoop versions.
And this code can be changed to refer ApplicationAttemptNotFoundException class directly when we withdraw the support for <2.4 Hadoop versions. Please provide your suggestions. |
I'm not overly concerned with hadoop < 2.4 version since they changed the api, so I say we just leave that unhandled until someone specifically requests it. So I think just change the ApplicationAttemptNotFoundException to use reflection to see if its there is good. |
6ff3840
to
636fd78
Compare
@tgravescs I have made the changes, Please have a look into this. |
338c4b2
to
636fd78
Compare
handling it as part of Throwable case.
@@ -372,7 +372,14 @@ private[spark] class ApplicationMaster( | |||
case i: InterruptedException => | |||
case e: Throwable => { | |||
failureCount += 1 | |||
if (!NonFatal(e) || failureCount >= reporterMaxFailures) { | |||
if ("org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException".equals( | |||
e.getClass().getName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do ==
here, this is scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also please add a comment to explain why we need to put this under this case, i.e. this exception was introduced in hadoop 2.x and this code would not compile otherwise
Thanks @andrewor14 for the review and comments. I have updated them, can you have look into it. |
@andrewor14 sorry I hadn't gotten back to this. yes if its fatal we should exit immediately or if we reached the max retries. That is still handled by the else if. Are you suggesting just to switch the order and have the first if by the !NonFatal check as it was and put this in the else? |
Oh I see, though in general we shouldn't even bother catching fatal errors; right now the fail message is a little strange. We can fix that separately. This patch looks OK to me. Thanks for addressing the comments quickly @devaraj-kavali |
Merging into master, thanks @devaraj-kavali. |
Spark on Yarn handle AM being told command from RM
When RM throws ApplicationAttemptNotFoundException for allocate
invocation, making the ApplicationMaster to finish immediately without any
retries.