-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9552] Return "false" while nothing to kill in killExecutors #9796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
589083b to
0d8e6c1
Compare
|
@GraceH please put `[SPARK-9552] in JIRA title |
|
Test build #46186 has finished for PR 9796 at commit
|
|
Test build #46187 has finished for PR 9796 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/should/will
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot. i will change that.
|
retest this please |
|
Test build #46224 has finished for PR 9796 at commit
|
|
ah, pyspark. retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do !sc.killExecutor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is there actually nothing to kill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because this one is killed in replacement part.
assert(executors.size === 2)
// kill executor 1, and replace it
assert(sc.killAndReplaceExecutor(executors.head)) //executors.head is killed here with replace = true.|
Test build #46253 has finished for PR 9796 at commit
|
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be a bigger problem. This is supposed to be true because it's a new executor so we should be able to kill it. I think it's the way this test is set up that makes this false, because the driver doesn't wait for new executors to come up. We might need to do some more mocking here to simulate the new executor coming up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 The executors.head is assigned beforehand. for example, you have two executor ID {27,28}. Then, the first one(id 27) is killed with replacement. But I guess the newly created executor cannot be with the same ID. After that, you try to kill the header executor (id 27), it should return empty list (since 27 has been in the pendingToRemove list). Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the idea is more like the following:
- you start with executors {27, 28}
- you kill and replace 27, so you end up with executors {28, 29}
- now you want to kill 28, this should succeed (but currently it doesn't in the tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if so, we should not kill excutors.head(27). it should be excutor(1). am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to my understanding, the 1st case tries to kill 27. the 2nd one is to kill 28. that is why the first one causes nothing to happen. the latter case actually kills the executor successfully.
btw, we donot change the 'val executors' after the first assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 you can find there are two test cases. I guess the second one is that you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move the discussion to the main thread
|
Test build #46340 has finished for PR 9796 at commit
|
|
Test build #46653 has finished for PR 9796 at commit
|
|
@GraceH continuing our discussion on the "pending replacement" test: The problem is actually how the test framework is set up. To speed up the tests we don't wait for new executors to register. Actually, this particular test was incorrectly written even before this patch. This is because:
The right fix would be to add an Fixing this test may be fairly involved. Will you have time to look into this? If not, I can take over later after the 1.6 release. |
|
@andrewor14 Yes. you are so right. Meanwhile it seems the original implementation has waited for a while to check if the replacement is there. According to you suggestion, I can add the val executors = getExecutorIds(sc)
assert(executors.size === 2)
// kill executor 1, and replace it
assert(sc.killAndReplaceExecutor(executors.head))
eventually(timeout(10.seconds), interval(10.millis)) {
val apps = getApplications()
assert(apps.head.executors.size === 2)
+ // make sure the old executors head has been killedAndReplaced
+ assert(executors.head != getExecutorIds(sc).head)
} |
657849d to
2e4884c
Compare
|
I have added the test case GraceH@2e4884c. Please let me know your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just compare the Seq? E.g. val executorIdsAfter = getExecutorIds(sc)
|
Do the tests pass locally? |
|
Yes. The replacement is finished. |
|
Test build #46726 has finished for PR 9796 at commit
|
|
Test build #46728 has finished for PR 9796 at commit
|
|
retest this please |
|
Test build #46752 has finished for PR 9796 at commit
|
|
retest this please |
|
Test build #47557 has finished for PR 9796 at commit
|
|
Thanks @zsxwing. The patch seems to pass all tests. |
|
This is not ready for merge yet, please see the unresolved comments thread. |
Previously we continued to use the old executors list even after killing and adding a new executor to replace the old one. This commit ensures subsequent asserts refresh the list.
|
I leave my thoughts under GraceH#2. Thanks. |
Fix kill and replace test
|
Test build #47965 has finished for PR 9796 at commit
|
1. add test case to check if it is false when killing non-existing executor(killed and replaced) 2. add test case to check if we can kill newly appended executor
|
Test build #47983 has finished for PR 9796 at commit
|
|
Test build #47984 has finished for PR 9796 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just kill newExecutors.head? It should still pass. Right now it's kind of arbitrary why we pick the second one.
|
@andreor14 thanks. |
In discussion (SPARK-9552), we proposed a force kill in
killExecutors. But if there is nothing to kill, it will return back with true (acknowledgement). And then, it causes the certain executor(s) (which is not eligible to kill) adding to pendingToRemove list for further actions.In this patch, we'd like to change the return semantics. If there is nothing to kill, we will return "false". and therefore all those non-eligible executors won't be added to the pendingToRemove list.
@vanzin @andrewor14 As the follow up of PR#7888, please let me know your comments.