Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-6735:[YARN] Adding properties to disable maximum number of executor failure's check or to make it relative to duration #5449

Closed

Conversation

twinkle-sachdeva
Copy link

For long running applications, user might want to disable this property or to make it relative to a duration window, so that some older failure does not cause Application to abort in long run.

Added properties 1) spark.yarn.max.executor.failures.disable: to disable maximum executor failure check 2) spark.yarn.max.executor.failures.relative: to make the maximum executor failure to be relative 3)spark.yarn.max.executor.failures.relative.window: specify relative window duration in sec , default being 600 sec

…ble: to disable maximum executor failure check 2) spark.yarn.max.executor.failures.relative to make the maximum executor failure to be relative 3)spark.yarn.max.executor.failures.relative.window : specify relative window duration in sec , default being 600 sec
@steveloughran
Copy link
Contributor

  1. If the window logic was teased out, it could be testable
  2. FWIW, in slider we track the node failures too, to try and see if recurrent failures are due to unreliable nodes. Its very hard to get the logic right here (on a small/busy cluster you may always get the same node). Again, long-lived apps need to window node unreliability.

@@ -59,6 +59,10 @@ private[spark] class ApplicationMaster(
private val maxNumExecutorFailures = sparkConf.getInt("spark.yarn.max.executor.failures",
sparkConf.getInt("spark.yarn.max.worker.failures", math.max(args.numExecutors * 2, 3)))

// Disable the maximum executor failure check
private val disableMaxExecutorFailureCheck =
sparkConf.getBoolean("spark.yarn.max.executor.failures.disable", false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a cost and weight to making a flag for everything, and I think this doesn't add value. Just set max to a high value to "disable" it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We could make special value to means disabled -> like 0 or -1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, will treat maxNumExecutorFailures = -1 as disable state for this check.

@twinkle-sachdeva
Copy link
Author

Hi @srowen ,

Please review the changes.

Thanks,

@twinkle-sachdeva
Copy link
Author

Hi,

Can somebody please review the change.

Regards,
Twinkle

On Fri, Apr 10, 2015 at 1:57 PM, UCB AMPLab notifications@github.com
wrote:

Can one of the admins verify this patch?


Reply to this email directly or view it on GitHub
#5449 (comment).

@@ -59,6 +59,9 @@ private[spark] class ApplicationMaster(
private val maxNumExecutorFailures = sparkConf.getInt("spark.yarn.max.executor.failures",
sparkConf.getInt("spark.yarn.max.worker.failures", math.max(args.numExecutors * 2, 3)))

// Disable the maximum executor failure check
private val disableMaxExecutorFailureCheck = if (maxNumExecutorFailures == -1) true else false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extra space here. Also, the right side of this equation can just be maxNumExecutorFailures == -1.

@tgravescs
Copy link
Contributor

@twinkle-sachdeva do you have time to address at Sandy's comments

@twinkle-sachdeva
Copy link
Author

Hi Tom,

I will do that in one or two days.

Thanks,
Twinkle

On Mon, Jun 29, 2015 at 7:44 PM, Tom Graves notifications@github.com
wrote:

@twinkle-sachdeva https://github.com/twinkle-sachdeva do you have time
to address at Sandy's comments


Reply to this email directly or view it on GitHub
#5449 (comment).

@twinkle-sachdeva
Copy link
Author

Hi Tom,

It will be good, if somebody else can take up. I am not doing well, it
might get delayed a bit more.

Sorry for the inconvenience caused.

Regards,
Twinkle

On Tue, Jun 30, 2015 at 9:41 AM, twinkle sachdeva <
twinkle.sachdeva@gmail.com> wrote:

Hi Tom,

I will do that in one or two days.

Thanks,
Twinkle

On Mon, Jun 29, 2015 at 7:44 PM, Tom Graves notifications@github.com
wrote:

@twinkle-sachdeva https://github.com/twinkle-sachdeva do you have time
to address at Sandy's comments


Reply to this email directly or view it on GitHub
#5449 (comment).

@tgravescs
Copy link
Contributor

Not a problem. Can you close this when you have a chance and I'll post comment to see if someone can take over. thanks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@andrewor14
Copy link
Contributor

@twinkle-sachdeva can you close this PR?

@asfgit asfgit closed this in 804a012 Sep 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants