Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4117] [YARN] Spark on Yarn handle AM being told command from RM #10129

Closed
wants to merge 3 commits into from

Conversation

devaraj-kavali
Copy link

Spark on Yarn handle AM being told command from RM

When RM throws ApplicationAttemptNotFoundException for allocate
invocation, making the ApplicationMaster to finish immediately without any
retries.

When RM throws ApplicationAttemptNotFoundException for allocate
invocation, making the ApplicationMaster to finish immediately without any
retries.
@@ -370,6 +371,12 @@ private[spark] class ApplicationMaster(
failureCount = 0
} catch {
case i: InterruptedException =>
case a: ApplicationAttemptNotFoundException => {
val message = "ApplicationAttemptNotFoundException was thrown from Reporter thread.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

; is not needed for Scala, also {...} is not necessary for this code block.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47140 has finished for PR 10129 at commit 636fd78.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

the compilation failed on hadoop 2.3 because It looks like ApplicationAttemptNotFoundException was introduced in hadoop 2.4. We need to support back to hadoop 2.2.

@devaraj-kavali
Copy link
Author

Thanks @tgravescs for the details. I missed it before creating PR.

I am thinking these ways for supporting <2.4 Apache Hadoop versions and as well as for >=2.4 Apache Hadoop versions.

  1. For supporting <2.4 versions, we can provide implementation in ApplicationMaster for handling AM_RESYNC and AM_SHUTDOWN commands. And this AM_RESYNC and AM_SHUTDOWN commands handling can be removed when these deprecated commands remove from the later versions of Apache Hadoop.

  2. For supporting >=2.4 versions(i.e. to avoid unnecessary retries when RM throws ApplicationAttemptNotFoundException), we can have exception check like "ApplicationAttemptNotFoundException".equals(t.getClass().getName()) without referring the ApplicationAttemptNotFoundException class directly in the ApplicationMaster.scala.

     case e: Throwable => {
          if ("ApplicationAttemptNotFoundException".equals(t.getClass().getName())) {
            val message = "ApplicationAttemptNotFoundException was thrown from Reporter thread.";
            logError(message, a)
            finish(FinalApplicationStatus.FAILED, ApplicationMaster.EXIT_REPORTER_FAILURE,
              message)
          }
         failureCount += 1
         if (!NonFatal(e) || failureCount >= reporterMaxFailures) {
         .......
     }
    

And this code can be changed to refer ApplicationAttemptNotFoundException class directly when we withdraw the support for <2.4 Hadoop versions.

Please provide your suggestions.

@tgravescs
Copy link
Contributor

I'm not overly concerned with hadoop < 2.4 version since they changed the api, so I say we just leave that unhandled until someone specifically requests it.

So I think just change the ApplicationAttemptNotFoundException to use reflection to see if its there is good.

@devaraj-kavali
Copy link
Author

@tgravescs I have made the changes, Please have a look into this.

@@ -372,7 +372,14 @@ private[spark] class ApplicationMaster(
case i: InterruptedException =>
case e: Throwable => {
failureCount += 1
if (!NonFatal(e) || failureCount >= reporterMaxFailures) {
if ("org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException".equals(
e.getClass().getName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do == here, this is scala

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also please add a comment to explain why we need to put this under this case, i.e. this exception was introduced in hadoop 2.x and this code would not compile otherwise

@devaraj-kavali
Copy link
Author

Thanks @andrewor14 for the review and comments. I have updated them, can you have look into it.

@tgravescs
Copy link
Contributor

@andrewor14 sorry I hadn't gotten back to this. yes if its fatal we should exit immediately or if we reached the max retries. That is still handled by the else if.

Are you suggesting just to switch the order and have the first if by the !NonFatal check as it was and put this in the else?

@andrewor14
Copy link
Contributor

sorry I hadn't gotten back to this. yes if its fatal we should exit immediately or if we reached the max retries. That is still handled by the else if.

Oh I see, though in general we shouldn't even bother catching fatal errors; right now the fail message is a little strange. We can fix that separately.

This patch looks OK to me. Thanks for addressing the comments quickly @devaraj-kavali

@andrewor14
Copy link
Contributor

Merging into master, thanks @devaraj-kavali.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants