Skip to content

Conversation

@GraceH
Copy link
Contributor

@GraceH GraceH commented Nov 18, 2015

In discussion (SPARK-9552), we proposed a force kill in killExecutors. But if there is nothing to kill, it will return back with true (acknowledgement). And then, it causes the certain executor(s) (which is not eligible to kill) adding to pendingToRemove list for further actions.

In this patch, we'd like to change the return semantics. If there is nothing to kill, we will return "false". and therefore all those non-eligible executors won't be added to the pendingToRemove list.

@vanzin @andrewor14 As the follow up of PR#7888, please let me know your comments.

@GraceH GraceH force-pushed the emptyPendingToRemove branch from 589083b to 0d8e6c1 Compare November 18, 2015 07:13
@GraceH GraceH changed the title Return "false" is nothing to kill in killExecutors Return "false" while nothing to kill in killExecutors Nov 18, 2015
@andrewor14
Copy link
Contributor

@GraceH please put `[SPARK-9552] in JIRA title

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46186 has finished for PR 9796 at commit 589083b.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@GraceH GraceH changed the title Return "false" while nothing to kill in killExecutors [SPARK-9552] Return "false" while nothing to kill in killExecutors Nov 18, 2015
@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46187 has finished for PR 9796 at commit 0d8e6c1.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/should/will

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot. i will change that.

@vanzin
Copy link
Contributor

vanzin commented Nov 18, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46224 has finished for PR 9796 at commit 0d8e6c1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Nov 18, 2015

ah, pyspark. retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do !sc.killExecutor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there actually nothing to kill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this one is killed in replacement part.

assert(executors.size === 2)
    // kill executor 1, and replace it
    assert(sc.killAndReplaceExecutor(executors.head)) //executors.head is killed here with replace = true.

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46253 has finished for PR 9796 at commit 0d8e6c1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be a bigger problem. This is supposed to be true because it's a new executor so we should be able to kill it. I think it's the way this test is set up that makes this false, because the driver doesn't wait for new executors to come up. We might need to do some more mocking here to simulate the new executor coming up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewor14 The executors.head is assigned beforehand. for example, you have two executor ID {27,28}. Then, the first one(id 27) is killed with replacement. But I guess the newly created executor cannot be with the same ID. After that, you try to kill the header executor (id 27), it should return empty list (since 27 has been in the pendingToRemove list). Am I right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the idea is more like the following:

  • you start with executors {27, 28}
  • you kill and replace 27, so you end up with executors {28, 29}
  • now you want to kill 28, this should succeed (but currently it doesn't in the tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, we should not kill excutors.head(27). it should be excutor(1). am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to my understanding, the 1st case tries to kill 27. the 2nd one is to kill 28. that is why the first one causes nothing to happen. the latter case actually kills the executor successfully.

btw, we donot change the 'val executors' after the first assignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewor14 you can find there are two test cases. I guess the second one is that you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move the discussion to the main thread

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46340 has finished for PR 9796 at commit 4a6d06e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46653 has finished for PR 9796 at commit 657849d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@GraceH continuing our discussion on the "pending replacement" test:

The problem is actually how the test framework is set up. To speed up the tests we don't wait for new executors to register. Actually, this particular test was incorrectly written even before this patch. This is because:

  • First we start with 2 executors, let's say {27, 28}
  • Then we kill and replace 1. What we should end up with is {28, 29}, but in this test we don't wait for executor 29 to come up, so we still have {27, 28}.
  • Then we try to do sc.killExecutor(executors.head), and this kills 27 again, which fails
  • Instead, we should be killing 28 in that call. We could just do sc.killExecutor(executors(1)) to workaround it as you suggest, but this is brittle and confusing to people who aren't familiar with the code.

The right fix would be to add an eventually block in the test to wait until we have 2 executors but with different IDs. We can do this by comparing executorIdsBefore != executorIdsAfter before and after the call to sc.killAndReplace. Then after that we need to fix some of the asserts that follow.

Fixing this test may be fairly involved. Will you have time to look into this? If not, I can take over later after the 1.6 release.

@GraceH
Copy link
Contributor Author

GraceH commented Nov 26, 2015

@andrewor14 Yes. you are so right. Meanwhile it seems the original implementation has waited for a while to check if the replacement is there. According to you suggestion, I can add the executor id comparison here. And it is tested locally. What do you think?

val executors = getExecutorIds(sc)
    assert(executors.size === 2)
    // kill executor 1, and replace it
    assert(sc.killAndReplaceExecutor(executors.head))
    eventually(timeout(10.seconds), interval(10.millis)) {
      val apps = getApplications()
      assert(apps.head.executors.size === 2)
+     // make sure the old executors head has been killedAndReplaced
+     assert(executors.head != getExecutorIds(sc).head)
    }

@GraceH GraceH force-pushed the emptyPendingToRemove branch from 657849d to 2e4884c Compare November 26, 2015 01:40
@GraceH
Copy link
Contributor Author

GraceH commented Nov 26, 2015

I have added the test case GraceH@2e4884c. Please let me know your comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just compare the Seq? E.g. val executorIdsAfter = getExecutorIds(sc)

@andrewor14
Copy link
Contributor

Do the tests pass locally?

@GraceH
Copy link
Contributor Author

GraceH commented Nov 26, 2015

Yes. The replacement is finished.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46726 has finished for PR 9796 at commit 2e4884c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46728 has finished for PR 9796 at commit 154ab31.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GraceH
Copy link
Contributor Author

GraceH commented Nov 26, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46752 has finished for PR 9796 at commit 154ab31.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Dec 11, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47557 has finished for PR 9796 at commit 154ab31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GraceH
Copy link
Contributor Author

GraceH commented Dec 11, 2015

Thanks @zsxwing. The patch seems to pass all tests.

@andrewor14
Copy link
Contributor

This is not ready for merge yet, please see the unresolved comments thread.

Previously we continued to use the old executors list even after
killing and adding a new executor to replace the old one. This
commit ensures subsequent asserts refresh the list.
@andrewor14
Copy link
Contributor

@GraceH The "pending replacement" test is still not correct; we never actually updated the executors list after killing and replacing an executor. The test currently passes but it's very brittle. I've opened GraceH#2 as a pull request against your branch as a suggestion on how to fix it.

@GraceH
Copy link
Contributor Author

GraceH commented Dec 16, 2015

I leave my thoughts under GraceH#2. Thanks.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47965 has finished for PR 9796 at commit fd5f435.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

1. add test case to check if it is false when killing non-existing executor(killed and replaced)
2. add test case to check if we can kill newly appended executor
@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47983 has finished for PR 9796 at commit bf2edd3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47984 has finished for PR 9796 at commit 0305815.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just kill newExecutors.head? It should still pass. Right now it's kind of arbitrary why we pick the second one.

@andrewor14
Copy link
Contributor

LGTM thanks @GraceH and @vanzin I'm merging this into master.

@asfgit asfgit closed this in 60da0e1 Dec 19, 2015
@GraceH
Copy link
Contributor Author

GraceH commented Dec 19, 2015

@andreor14 thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants