[SPARK-20079][Core][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17480

witgo · 2017-03-30T14:34:16Z

When there is some need of task scheduling, ExecutorAllocationManager instances do not reset the initializing field

How was this patch tested?

Unit tests.

SparkQA · 2017-03-30T17:24:58Z

Test build #75391 has finished for PR 17480 at commit b91dfeb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-03-31T05:29:43Z

Would you please help to elaborate the problem you met? That would be better to understand your scenario and fix.

witgo · 2017-03-31T07:09:38Z

The ExecutorAllocationManager.reset method is called when re-registering AM, which sets the ExecutorAllocationManager.initializing field true. When this field is true, the Driver does not start a new executor from the AM request. The following two cases will cause the field to False

A executor idle for some time.
There are new stages to be submitted

After the a stage was submitted, the AM was killed and restart ,the above two cases will not appear.

When AM is killed, the yarn will kill all running containers. All execuotr will be lost and no executor will be idle.
No surviving executor, resulting in the current stage will never be completed, DAG will not submit a new stage.

jerryshao · 2017-03-31T12:45:29Z

@witgo thanks for your explanation. But AFAIK if AM get restarted, it will honor initial executor number to launch executors, so after executors are launched, stage should be able to get executed.

Is you initial executor number set to 0?

witgo · 2017-04-01T02:10:47Z

@jerryshao Yes.

jerryshao

The fix seems reasonable to me. But I'm also wondering is there any way to handle this in the scheduler side.

jerryshao · 2017-04-01T10:09:34Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

@@ -249,7 +249,9 @@ private[spark] class ExecutorAllocationManager(
   * yarn-client mode when AM re-registers after a failure.
   */
  def reset(): Unit = synchronized {
-    initializing = true
+    if (maxNumExecutorsNeeded() == 0) {


Can you please add some comments about the purpose of this change.

SparkQA · 2017-04-04T04:37:34Z

Test build #75499 has started for PR 17480 at commit f54c9ae.

SparkQA · 2017-04-04T16:40:39Z

Test build #75508 has finished for PR 17480 at commit 69f623f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-04-05T09:31:51Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

-    initializing = true
+    /**
+     * When some tasks need to be scheduled, resetting the initializing field may cause
+     * it to not be set to false in yarn.


I think it is not a yarn only issue, the description is not precise.

Also can you elaborate more, I think this issue only exists in initial executor = 0 and stages are running scenario

Currently this method will only be called in yarn-client mode when AM re-registers after a failure.

jerryshao · 2017-04-05T09:34:46Z

Also CC @tgravescs @vanzin to help to review, they may have more thoughts :).

SparkQA · 2017-04-07T13:21:22Z

Test build #75601 has finished for PR 17480 at commit 38f3c77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-04-10T17:38:30Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

+     * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
+     */
+    if (maxNumExecutorsNeeded() == 0) {
+      initializing = true


This kinda raises the question. Is it ever correct to set this to true here?

This method is only called when the YARN client-mode AM is restarted, and at that point I'd expect initialization to have already happened (so I don't see a need to reset the field in any situation).

@jerryshao Can you explain the following comments? I do not understand.

if (initializing) { // Do not change our target while we are still initializing, // Otherwise the first job may have to ramp up unnecessarily 0 } else if (maxNeeded < numExecutorsTarget) {

In the original design of dynamic executor allocation, this flag is set to "true" to avoid sudden executor number ramp up (because of first job submission) during initializing. You could check the comment.

And for the AM restart scenario, because all the executors will be re-spawned, this is similar to AM first start scenario (if the job is submitted during restart), so we set this flag to true.

Sorry but that doesn't really explain much. Why is it bad to ramp up quickly? At which point are things not "initializing" anymore?

Isn't the AM restarting the definition of "I should ramp up quickly because I might be in the middle of a big job being run"?

@vanzin sorry I think I didn't explain well.

If this flag initializing is set to false during initialization, updateAndSyncNumExecutorsTarget will recalculate the required executor number and ramp down the executors if there's no job in the current time. And then if first job is submitted, it still requires to ramp up executors to meet the requirement.

For the AM restart scenario I think it is similar during initializing. One exception is the scenario mentioned here, for the case here should ramp up soon to meet the requirement.

One downside could be:

During running tasks, when the total number of executors is the value of spark.dynamicAllocation.maxExecutors and the AM is failed. Then a new AM restarts. Because in ExecutorAllocationManager, the total number of executors does not changed, driver does not send RequestExecutors to AM to ask executors. Then the total number of executors is the value of spark.dynamicAllocation.initialExecutors . So the total number of executors in driver and AM is different.

Because when AM is restarted, it will change it's state to the initial state, whereas if ExecutorAllocationManager's state is still the current state, then the state maintained in two sides will be out of sync, and required executor number calculated in AM side will be wrong.

You're just saying that when a new AM registers the driver needs to tell it how many executors it wants. So, basically, instead of the driver doing that, currently the driver just resets itself to the initial state, hurting any running jobs.

when a new AM registers the driver needs to tell it how many executors it wants.

When a AM registers, it leverages configuration to decide the initial number of executors should be created, not driver who told him how many executors it wants. That's why in the driver side if we don't change the executor number to match the AM side, we will meet the problem as mentioned above (because driver hasn't yet told AM the executor number).

You're explaining what the code does as a justification for why a hacky fix should be applied to this issue. I'm asking why the code needs to behave like that. If there's no actual need for the code to behave like that, it should be fixed.

Basically, imagine that at t1 the AM dies, and at t2 a new AM comes up and registers. What should happen from the driver's point of view? (Note, what should happen, not what the code does.)

In my view, the answer is "nothing". The driver knows what it needs, so the new AM should start as closely as possible to the state of the previous AM. Doing that might be hard (e.g. caching the complete list of known containers somewhere, probably the driver), so some things are sub-optimal (containers will be re-started). But as far as numbers go, the new AM should basically start up with the same number of containers the previous AM was managing (ignoring the time needed to start them up).

If the AM doesn't do that currently, then why is that? It asks the driver for state related to the previous AM already (see RetrieveLastAllocatedExecutorId call). Why can't that call return more state needed for the new AM to sync up to what the driver needs?

Well, I understood your thinking.

This actually comes from the definition of reset, should it be the initial state or the last state before failure. In our previous commit we chose the former to roll back to the initial state. But here you suggest the latter is better. I agree with you the latter looks more reasonable, also could address the problem here. Thanks for the clarification.

vanzin · 2017-04-17T21:26:40Z

@witgo are you planning to update this PR to fix the behavior of reset in call cases?

The biggest problem I have with this patch is that reading the code does not give you any insight into why initializing has to be true or false, and why that's related to this bug. And that's the main source of my previous comments.

So in my view the right path here is to fix reset() so that it does the right thing in all cases. And it seems to me the right thing is not to mess with initialize or the driver's current idea of how many executors it needs.

witgo · 2017-04-18T06:15:41Z

@vanzin
Sorry, I do not understand what you mean. Do you submit a new PR to your own ideas? If you can, I will close this PR.

vanzin · 2017-04-18T16:30:36Z

I probably won't have time to look at a proper fix for this anytime soon, but I don't think your current patch is the right fix.

witgo · 2017-04-20T01:46:07Z

OK, I will do the work at weekends.

SparkQA · 2017-04-23T05:02:35Z

Test build #76076 has started for PR 17480 at commit 17a7757.

witgo · 2017-04-23T05:02:56Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

@@ -249,7 +249,6 @@ private[spark] class ExecutorAllocationManager(
   * yarn-client mode when AM re-registers after a failure.
   */
  def reset(): Unit = synchronized {
-    initializing = true


@jerryshao @vanzin
I think that deleting the initializing = true is a good idea.

… mode.

SparkQA · 2017-04-24T03:37:28Z

Test build #76089 has finished for PR 17480 at commit d3e69cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

witgo changed the title ~~SPARK-20079: Re registration of AM hangs spark cluster in yarn-client mode.~~ [SPARK-20079][Core] [yarn] Re registration of AM hangs spark cluster in yarn-client mode. Mar 30, 2017

witgo changed the title ~~[SPARK-20079][Core] [yarn] Re registration of AM hangs spark cluster in yarn-client mode.~~ [SPARK-20079][Core][yarn] Re registration of AM hangs spark cluster in yarn-client mode. Mar 30, 2017

jerryshao reviewed Apr 1, 2017

View reviewed changes

witgo force-pushed the SPARK-20079 branch from b91dfeb to f54c9ae Compare April 4, 2017 04:33

witgo force-pushed the SPARK-20079 branch from f54c9ae to 69f623f Compare April 4, 2017 13:51

jerryshao reviewed Apr 5, 2017

View reviewed changes

witgo force-pushed the SPARK-20079 branch from 69f623f to 38f3c77 Compare April 7, 2017 10:33

vanzin reviewed Apr 10, 2017

View reviewed changes

witgo force-pushed the SPARK-20079 branch from 38f3c77 to 17a7757 Compare April 23, 2017 04:58

witgo commented Apr 23, 2017

View reviewed changes

witgo added 4 commits April 24, 2017 08:48

SPARK-20079: Re registration of AM hangs spark cluster in yarn-client…

d1fe96a

… mode.

review commits

2cc4dce

review commits

013474a

Delete "initializing = true" in ExecutorAllocationManager.reset

d3e69cf

witgo force-pushed the SPARK-20079 branch from 17a7757 to d3e69cf Compare April 24, 2017 00:48

witgo mentioned this pull request May 6, 2017

[WIP][SPARK-20079][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17882

Closed

vanzin mentioned this pull request Jun 7, 2017

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in b771fed Jun 8, 2017

witgo deleted the SPARK-20079 branch June 14, 2017 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20079][Core][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17480

[SPARK-20079][Core][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17480

witgo commented Mar 30, 2017

SparkQA commented Mar 30, 2017

jerryshao commented Mar 31, 2017

witgo commented Mar 31, 2017 •

edited

jerryshao commented Mar 31, 2017

witgo commented Apr 1, 2017

jerryshao left a comment

jerryshao Apr 1, 2017

witgo Apr 4, 2017

SparkQA commented Apr 4, 2017

SparkQA commented Apr 4, 2017

jerryshao Apr 5, 2017

jerryshao Apr 5, 2017

witgo Apr 7, 2017

jerryshao commented Apr 5, 2017

SparkQA commented Apr 7, 2017

vanzin Apr 10, 2017

witgo Apr 11, 2017

jerryshao Apr 11, 2017

vanzin Apr 11, 2017

jerryshao Apr 11, 2017

jerryshao Apr 13, 2017

vanzin Apr 13, 2017 •

edited

jerryshao Apr 13, 2017

vanzin Apr 13, 2017

jerryshao Apr 14, 2017 •

edited

vanzin commented Apr 17, 2017

witgo commented Apr 18, 2017

vanzin commented Apr 18, 2017

witgo commented Apr 20, 2017

SparkQA commented Apr 23, 2017

witgo Apr 23, 2017

SparkQA commented Apr 24, 2017

[SPARK-20079][Core][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17480

[SPARK-20079][Core][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17480

Conversation

witgo commented Mar 30, 2017

How was this patch tested?

SparkQA commented Mar 30, 2017

jerryshao commented Mar 31, 2017

witgo commented Mar 31, 2017 • edited

jerryshao commented Mar 31, 2017

witgo commented Apr 1, 2017

jerryshao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 4, 2017

SparkQA commented Apr 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao commented Apr 5, 2017

SparkQA commented Apr 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin Apr 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao Apr 14, 2017 • edited

Choose a reason for hiding this comment

vanzin commented Apr 17, 2017

witgo commented Apr 18, 2017

vanzin commented Apr 18, 2017

witgo commented Apr 20, 2017

SparkQA commented Apr 23, 2017

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2017

witgo commented Mar 31, 2017 •

edited

vanzin Apr 13, 2017 •

edited

jerryshao Apr 14, 2017 •

edited