[SPARK-9552] Return "false" while nothing to kill in killExecutors #9796

GraceH · 2015-11-18T07:08:14Z

In discussion (SPARK-9552), we proposed a force kill in killExecutors. But if there is nothing to kill, it will return back with true (acknowledgement). And then, it causes the certain executor(s) (which is not eligible to kill) adding to pendingToRemove list for further actions.

In this patch, we'd like to change the return semantics. If there is nothing to kill, we will return "false". and therefore all those non-eligible executors won't be added to the pendingToRemove list.

@vanzin @andrewor14 As the follow up of PR#7888, please let me know your comments.

andrewor14 · 2015-11-18T08:54:17Z

@GraceH please put `[SPARK-9552] in JIRA title

SparkQA · 2015-11-18T09:55:29Z

Test build #46186 has finished for PR 9796 at commit 589083b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-11-18T11:28:15Z

Test build #46187 has finished for PR 9796 at commit 0d8e6c1.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-11-18T18:21:22Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

s/should/will

thanks a lot. i will change that.

vanzin · 2015-11-18T18:25:17Z

retest this please

SparkQA · 2015-11-18T21:43:48Z

Test build #46224 has finished for PR 9796 at commit 0d8e6c1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-11-18T21:45:14Z

ah, pyspark. retest this please

andrewor14 · 2015-11-18T23:15:36Z

core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala

just do !sc.killExecutor

why is there actually nothing to kill?

because this one is killed in replacement part.

assert(executors.size === 2) // kill executor 1, and replace it assert(sc.killAndReplaceExecutor(executors.head)) //executors.head is killed here with replace = true.

SparkQA · 2015-11-19T00:43:22Z

Test build #46253 has finished for PR 9796 at commit 0d8e6c1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-11-19T18:58:22Z

retest this please

andrewor14 · 2015-11-19T19:00:45Z

core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala

I think this may be a bigger problem. This is supposed to be true because it's a new executor so we should be able to kill it. I think it's the way this test is set up that makes this false, because the driver doesn't wait for new executors to come up. We might need to do some more mocking here to simulate the new executor coming up.

@andrewor14 The executors.head is assigned beforehand. for example, you have two executor ID {27,28}. Then, the first one(id 27) is killed with replacement. But I guess the newly created executor cannot be with the same ID. After that, you try to kill the header executor (id 27), it should return empty list (since 27 has been in the pendingToRemove list). Am I right?

no, the idea is more like the following:

you start with executors {27, 28}

you kill and replace 27, so you end up with executors {28, 29}

now you want to kill 28, this should succeed (but currently it doesn't in the tests)

if so, we should not kill excutors.head(27). it should be excutor(1). am I right?

according to my understanding, the 1st case tries to kill 27. the 2nd one is to kill 28. that is why the first one causes nothing to happen. the latter case actually kills the executor successfully.

btw, we donot change the 'val executors' after the first assignment.

@andrewor14 you can find there are two test cases. I guess the second one is that you want.

let's move the discussion to the main thread

SparkQA · 2015-11-19T22:12:23Z

Test build #46340 has finished for PR 9796 at commit 4a6d06e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-25T03:14:29Z

Test build #46653 has finished for PR 9796 at commit 657849d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-11-25T20:11:43Z

@GraceH continuing our discussion on the "pending replacement" test:

The problem is actually how the test framework is set up. To speed up the tests we don't wait for new executors to register. Actually, this particular test was incorrectly written even before this patch. This is because:

First we start with 2 executors, let's say {27, 28}
Then we kill and replace 1. What we should end up with is {28, 29}, but in this test we don't wait for executor 29 to come up, so we still have {27, 28}.
Then we try to do sc.killExecutor(executors.head), and this kills 27 again, which fails
Instead, we should be killing 28 in that call. We could just do sc.killExecutor(executors(1)) to workaround it as you suggest, but this is brittle and confusing to people who aren't familiar with the code.

The right fix would be to add an eventually block in the test to wait until we have 2 executors but with different IDs. We can do this by comparing executorIdsBefore != executorIdsAfter before and after the call to sc.killAndReplace. Then after that we need to fix some of the asserts that follow.

Fixing this test may be fairly involved. Will you have time to look into this? If not, I can take over later after the 1.6 release.

GraceH · 2015-11-26T01:26:56Z

@andrewor14 Yes. you are so right. Meanwhile it seems the original implementation has waited for a while to check if the replacement is there. According to you suggestion, I can add the executor id comparison here. And it is tested locally. What do you think?

val executors = getExecutorIds(sc)
    assert(executors.size === 2)
    // kill executor 1, and replace it
    assert(sc.killAndReplaceExecutor(executors.head))
    eventually(timeout(10.seconds), interval(10.millis)) {
      val apps = getApplications()
      assert(apps.head.executors.size === 2)
+     // make sure the old executors head has been killedAndReplaced
+     assert(executors.head != getExecutorIds(sc).head)
    }

GraceH · 2015-11-26T01:41:45Z

I have added the test case GraceH@2e4884c. Please let me know your comments.

andrewor14 · 2015-11-26T01:47:31Z

core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala

why not just compare the Seq? E.g. val executorIdsAfter = getExecutorIds(sc)

andrewor14 · 2015-11-26T01:48:11Z

Do the tests pass locally?

GraceH · 2015-11-26T01:50:26Z

Yes. The replacement is finished.

SparkQA · 2015-11-26T03:50:19Z

Test build #46726 has finished for PR 9796 at commit 2e4884c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-26T06:18:39Z

Test build #46728 has finished for PR 9796 at commit 154ab31.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

GraceH · 2015-11-26T06:49:20Z

retest this please

SparkQA · 2015-11-26T11:11:26Z

Test build #46752 has finished for PR 9796 at commit 154ab31.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2015-12-11T00:51:17Z

retest this please

SparkQA · 2015-12-11T02:51:51Z

Test build #47557 has finished for PR 9796 at commit 154ab31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

GraceH · 2015-12-11T04:52:00Z

Thanks @zsxwing. The patch seems to pass all tests.

andrewor14 · 2015-12-16T02:44:45Z

This is not ready for merge yet, please see the unresolved comments thread.

Previously we continued to use the old executors list even after killing and adding a new executor to replace the old one. This commit ensures subsequent asserts refresh the list.

andrewor14 · 2015-12-16T03:41:50Z

@GraceH The "pending replacement" test is still not correct; we never actually updated the executors list after killing and replacing an executor. The test currently passes but it's very brittle. I've opened GraceH#2 as a pull request against your branch as a suggestion on how to fix it.

GraceH · 2015-12-16T04:34:23Z

I leave my thoughts under GraceH#2. Thanks.

Fix kill and replace test

SparkQA · 2015-12-18T03:07:08Z

Test build #47965 has finished for PR 9796 at commit fd5f435.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

1. add test case to check if it is false when killing non-existing executor(killed and replaced) 2. add test case to check if we can kill newly appended executor

SparkQA · 2015-12-18T05:29:07Z

Test build #47983 has finished for PR 9796 at commit bf2edd3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-18T07:25:10Z

Test build #47984 has finished for PR 9796 at commit 0305815.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-12-18T21:11:40Z

core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala

why not just kill newExecutors.head? It should still pass. Right now it's kind of arbitrary why we pick the second one.

andrewor14 · 2015-12-19T00:04:24Z

LGTM thanks @GraceH and @vanzin I'm merging this into master.

GraceH · 2015-12-19T07:26:29Z

@andreor14 thanks.

GraceH force-pushed the emptyPendingToRemove branch from 589083b to 0d8e6c1 Compare November 18, 2015 07:13

return false is nothing to kill in killExecutors

0d8e6c1

GraceH changed the title ~~Return "false" is nothing to kill in killExecutors~~ Return "false" while nothing to kill in killExecutors Nov 18, 2015

GraceH changed the title ~~Return "false" while nothing to kill in killExecutors~~ [SPARK-9552] Return "false" while nothing to kill in killExecutors Nov 18, 2015

vanzin reviewed Nov 18, 2015
View reviewed changes

andrewor14 reviewed Nov 18, 2015
View reviewed changes

addressing the comments

de6e47b

addressing the comments

4a6d06e

andrewor14 reviewed Nov 19, 2015
View reviewed changes

GraceH force-pushed the emptyPendingToRemove branch from 657849d to 2e4884c Compare November 26, 2015 01:40

check if the old executor head is killed and replaced

2e4884c

andrewor14 reviewed Nov 26, 2015
View reviewed changes

to compare the seq instead of head only

154ab31

Fix kill and replace test

61f567e

Previously we continued to use the old executors list even after killing and adding a new executor to replace the old one. This commit ensures subsequent asserts refresh the list.

Merge pull request #2 from andrewor14/pr-9796-suggestion

fd5f435

Fix kill and replace test

Add kill test cases for killAndReplace.

bf2edd3

1. add test case to check if it is false when killing non-existing executor(killed and replaced) 2. add test case to check if we can kill newly appended executor

eliminate space in blank line

0305815

andrewor14 reviewed Dec 18, 2015
View reviewed changes

asfgit closed this in 60da0e1 Dec 19, 2015

[SPARK-9552] Return "false" while nothing to kill in killExecutors #9796

[SPARK-9552] Return "false" while nothing to kill in killExecutors #9796

Uh oh!

Conversation

GraceH commented Nov 18, 2015

Uh oh!

andrewor14 commented Nov 18, 2015

Uh oh!

SparkQA commented Nov 18, 2015

Uh oh!

SparkQA commented Nov 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Nov 18, 2015

Uh oh!

SparkQA commented Nov 18, 2015

Uh oh!

vanzin commented Nov 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

andrewor14 commented Nov 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

andrewor14 commented Nov 25, 2015

Uh oh!

GraceH commented Nov 26, 2015

Uh oh!

GraceH commented Nov 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Nov 26, 2015

Uh oh!

GraceH commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

GraceH commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

zsxwing commented Dec 11, 2015

Uh oh!

SparkQA commented Dec 11, 2015

Uh oh!

GraceH commented Dec 11, 2015

Uh oh!

andrewor14 commented Dec 16, 2015

Uh oh!

andrewor14 commented Dec 16, 2015

Uh oh!