[SPARK-24266][k8s] Restart the watcher when we receive a version changed from k8s #28423

stijndehaes · 2020-04-30T13:54:37Z

What changes were proposed in this pull request?

Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed.

For more relevant information see here: fabric8io/kubernetes-client#1075

Why are the changes needed?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Running spark-submit to a k8s cluster.

Not sure how to make an automated test for this. If someone can help me out that would be great.

dongjoon-hyun · 2020-04-30T15:04:38Z

ok to test

dongjoon-hyun · 2020-04-30T15:07:20Z

Thank you for your contribution, @stijndehaes .

SparkQA · 2020-04-30T15:25:22Z

Test build #122141 has finished for PR 28423 at commit a9ce548.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-30T15:54:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26813/

SparkQA · 2020-04-30T16:12:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26813/

holdenk · 2020-04-30T16:42:24Z

How do we feel about backporting this to Spark 2.4.6?

stijndehaes · 2020-04-30T17:14:09Z

How do we feel about backporting this to Spark 2.4.6?

I would very much like that, we ran into this using spark 2.4.x and aws eks 1.15, I ran some preliminary tests but will test this fix into production on some long running spark jobs on monday.

dongjoon-hyun · 2020-04-30T17:57:30Z

Do you think we can have a unit test case for this, @stijndehaes ?

holdenk · 2020-04-30T18:15:09Z

...tes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala

+              .pods()
+              .withName(driverPodName)
+              .watch(watcher)
+            watcher.watchOrStop(sId)


Would there be a race condition here if the application completes while were cycling the pod watcher?

If so, could we fix it by constructing the watcher then checking the pod status manually with the K8s client then calling watchOrStop?

I think you get the latest version as first message without anything needing to change, but I will double check this.

The explanation for how watch works with resource version set can be found here:

https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter

I am also checking if we can't automatically schedule the reconnect in the k8s client, then we don't have to put this into spark

I think you are right with the race condition, since a watch is started with the resource version from the java k8s client, we start watching changes from that resource version.
So we are in the case: resourceVersion=“{value other than 0}”

Ok the code has been changed to sent a modified event with the current pod state.

SparkQA · 2020-05-04T06:33:34Z

Test build #122248 has finished for PR 28423 at commit 2a4cbb6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-04T06:54:10Z

Test build #122249 has finished for PR 28423 at commit f832acf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-04T07:16:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26919/

SparkQA · 2020-05-04T07:23:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26920/

SparkQA · 2020-05-04T07:35:35Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26919/

SparkQA · 2020-05-04T07:42:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26920/

stijndehaes · 2020-05-04T09:02:19Z

Do you think we can have a unit test case for this, @stijndehaes ?

The current tests completely mock out this behavior, see org.apache.spark.deploy.k8s.submit.ClientSuite, I think writing a test for this where we manipulate and fake the HTTP gone with a mock is not that useful. Maybe I can look into an integration test, but then I have to be able to trigger that a resource version changes. Not sure if I will be able to

stijndehaes · 2020-05-04T10:11:29Z

@holdenk Maybe we should refactor this behavior using the sharedinformers.
See the comment made here: fabric8io/kubernetes-client#1075 (comment)

I can make an example implementation of this, maybe best to do that in another PR. What do you think?

stijndehaes · 2020-05-05T07:59:25Z

Ok I have tested this in production, there is something wrong with the code, went ahead and tried the sharedinformers approach. Will try that in production today. You can see the code here: https://github.com/stijndehaes/spark/tree/test/shared-informers

stijndehaes · 2020-05-05T16:09:48Z

Ok reverting back to the old approach found the missing piece I think testing that out.

Shared informers have the problem that you have to watch every pod in the whole cluster atm.

SparkQA · 2020-05-05T17:07:20Z

Test build #122321 has finished for PR 28423 at commit dad7ea2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-05T17:35:16Z

Test build #122322 has finished for PR 28423 at commit f05db8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-05T17:40:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26991/

SparkQA · 2020-05-05T17:54:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26991/

SparkQA · 2020-05-05T18:13:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26992/

SparkQA · 2020-05-05T18:33:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/26992/

stijndehaes · 2020-05-08T06:25:00Z

@holdenk @dongjoon-hyun I have tested this code in production and it works. I have a couple of jobs that take roughly 4 hours to finish, these all failed without the fix and are now succeeding.

Could you take the time to review the code again?

jkleckner · 2020-06-18T22:27:47Z

+1 for this. Hit this in GKE today.

dongjoon-hyun · 2020-06-18T22:34:00Z

Retest this please.

SparkQA · 2020-07-21T18:30:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/30887/

SparkQA · 2020-07-21T18:49:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/30887/

holdenk · 2020-07-21T19:41:41Z

Lets focus on 3.1 and then explore backporting after.

holdenk · 2020-07-21T19:42:58Z

LGTM pending Jenkins

holdenk · 2020-07-21T20:36:00Z

Oh wait it has passed Jenkins, excellent. If @dongjoon-hyun is ok with this PR I'll merge it by the end of the week.

dongjoon-hyun · 2020-07-21T21:43:01Z

@holdenk . Thank you for pinging me. Feel free to merge if you think it's okay. I don't want to be a blocker for the community PR. ;)

holdenk · 2020-07-21T23:55:28Z

Whats your JIRA username @stijndehaes ?

stijndehaes · 2020-07-22T04:50:10Z

@holdenk my JIRA username if sdehaes

jkleckner · 2020-07-22T22:02:55Z

I took the commits from master and made a partial attempt to rebase this onto branch-2.4 [1].

However, the k8s api has evolved from 2.4 quite a bit so the watchOrStop function needs to be backported [2].

You can see the error message in this gitlab build [3].

Would it be useful to make a WIP pull request from [1] ?

[1] https://github.com/jkleckner/spark/tree/SPARK-24266-on-branch2.4
[2] https://github.com/jkleckner/spark/blob/SPARK-24266-on-branch2.4/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala#L193
[3] https://gitlab.com/jkleckner/spark/-/jobs/651515950

jkleckner · 2020-07-30T16:56:57Z

@stijndehaes In private discussions about the hang we are seeing, there appears to be another watcher [1] for the driver watching executors that also may lose notifications.

Have you run into any situations like this?

[1] https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala

stijndehaes · 2020-08-05T05:39:26Z

@jkleckner I have never had a problem with the driver watching the executors. I think there was already a fallback mechanism there, but I never looked into the code for that one.

jkleckner · 2020-08-10T15:46:38Z

@liyinan926 Do you think there is an adequate existing fallback mechanism or do you still believe that there is a need to create a similar patch for ExecutorPodsWatchSnapshotSource ?

puneetloya · 2020-08-10T19:44:59Z

I see this error a lot in the batch jobs:

{"level":"WARN","timestamp":"2020-08-10 19:17:35,985","thread":"OkHttp https://kubernetes.default.svc/...","source":"io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager", "line":"209","message":"Exec Failure"}
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:61)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
^C

I do think its related to the above issue. The batch job starts, Driver is able to spin up new executors, communicate with them and get the job done, but cannot clean them up.

This is with Spark 2.4.5 and Kubernetes Version: 1.15 and 1.16 with Multi Kubernetes Masters.

The above message repeats every 10 seconds. Let me know if its not related

jkleckner · 2020-08-11T19:55:28Z

It looks a bit different from what I see. For me, it appears to get stuck at the very end of writing data to Bigtable in the very last task of a job. Our partner is working to back port the fix I mentioned and I will let you know if that addresses the hang.

…ged from k8s ### What changes were proposed in this pull request? Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>

…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control

…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control Rebase on branch-3.0 to fix SparkR integration test.

[SPARK-21040][CORE] Speculate tasks which are running on decommission executors This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. Added UT. Closes apache#28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made apache#28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes apache#28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Attila Zsolt Piros <attilazsoltpiros@apiros-mbp16.lan> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor This PR is a giant plumbing PR that plumbs an `ExecutorDecommissionInfo` along with the DecommissionExecutor message. The primary motivation is to know whether a decommissioned executor would also be loosing shuffle files -- and thus it is important to know whether the host would also be decommissioned. In the absence of this PR, the existing code assumes that decommissioning an executor does not loose the whole host with it, and thus does not clear the shuffle state if external shuffle service is enabled. While this may hold in some cases (like K8s decommissioning an executor pod, or YARN container preemption), it does not hold in others like when the cluster is managed by a Standalone Scheduler (Master). This is similar to the existing `workerLost` field in the `ExecutorProcessLost` message. In the future, this `ExecutorDecommissionInfo` can be embellished for knowing how long the executor has to live for scenarios like Cloud spot kills (or Yarn preemption) and the like. No Tweaked an existing unit test in `AppClientSuite` Closes apache#29032 from agrawaldevesh/plumb_decom_info. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning This PR reduces the prospect of a job loss during decommissioning. It fixes two holes in the current decommissioning framework: - (a) Loss of decommissioned executors is not treated as a job failure: We know that the decommissioned executor would be dying soon, so its death is clearly not caused by the application. - (b) Shuffle files on the decommissioned host are cleared when the first fetch failure is detected from a decommissioned host: This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) -- too soon and it could lead to some wastage and too late would lead to fetch failures. The approach here is to do this clearing when the very first fetch failure is observed on the decommissioned block manager, without waiting for other blocks to also signal a failure. Without them decommissioning a lot of executors at a time leads to job failures. The task scheduler tracks the executors that were decommissioned along with their `ExecutorDecommissionInfo`. This information is used by: (a) For handling a `ExecutorProcessLost` error, or (b) by the `DAGScheduler` when handling a fetch failure. No Added a new unit test `DecommissionWorkerSuite` to test the new behavior by exercising the Master-Worker decommissioning. I chose to add a new test since the setup logic was quite different from the existing `WorkerDecommissionSuite`. I am open to changing the name of the newly added test suite :-) - Should I add a feature flag to guard these two behaviors ? They seem safe to me that they should only get triggered by decommissioning, but you never know :-). Closes apache#29014 from agrawaldevesh/decom_harden. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite This test tries to fix the flakyness of BlockManagerDecommissionIntegrationSuite. Make the block manager decommissioning test be less flaky An interesting failure happens when migrateDuring = true (and persist or shuffle is true): - We schedule the job with tasks on executors 0, 1, 2. - We wait 300 ms and decommission executor 0. - If the task is not yet done on executor 0, it will now fail because the block manager won't be able to save the block. This condition is easy to trigger on a loaded machine where the github checks run. - The task with retry on a different executor (1 or 2) and its shuffle blocks will land there. - No actual block migration happens here because the decommissioned executor technically failed before it could even produce a block. To remove the above race, this change replaces the fixed wait for 300 ms to wait for an actual task to succeed. When a task has succeeded, we know its blocks would have been written for sure and thus its executor would certainly be forced to migrate those blocks when it is decommissioned. The change always decommissions an executor on which a real task finished successfully instead of picking the first executor. Because the system may choose to schedule nothing on the first executor and instead run the two tasks on one executor. I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has failed several times on my PRs. So fixing it. No, unit test only change. Github checks. Ran this test 100 times, 10 at a time in parallel in a script. Closes apache#29226 from agrawaldevesh/block-manager-decom-flaky. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-31197][CORE] Shutdown executor once we are done decommissioning Exit the executor when it has been asked to decommission and there is nothing left for it to do. This is a rebase of apache#28817 If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished. Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible. The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet. I changed the unit test to not send the executor exit message and still wait on the executor exited message. Closes apache#29211 from holdenk/SPARK-31197-exit-execs-redone. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> Connect decommissioning to dynamic scaling Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads Make Spark's dynamic allocation use decommissioning Track the decommissioning executors in the core dynamic scheduler so we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile Fix our exiting and cleanup thread for better debugging next time. Cleanup the locks we use in decommissioning and clarify some more bits. Verify executors decommissioned, then killed by external external cluster manager are re-launched Verify some additional calls are not occuring in the executor allocation manager suite. Dont' close the watcher until the end of the test Use decommissionExecutors and set adjustTargetNumExecutors to false so that we can match the pattern for killExecutor/killExecutors bump numparts up to 6 Revert "bump numparts up to 6" This reverts commit daf96dd. Small coment & visibility cleanup CR feedback/cleanup Cleanup the merge [SPARK-21040][CORE] Speculate tasks which are running on decommission executors This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. Added UT. Closes apache#28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made apache#28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes apache#28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Attila Zsolt Piros <attilazsoltpiros@apiros-mbp16.lan> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor This PR is a giant plumbing PR that plumbs an `ExecutorDecommissionInfo` along with the DecommissionExecutor message. The primary motivation is to know whether a decommissioned executor would also be loosing shuffle files -- and thus it is important to know whether the host would also be decommissioned. In the absence of this PR, the existing code assumes that decommissioning an executor does not loose the whole host with it, and thus does not clear the shuffle state if external shuffle service is enabled. While this may hold in some cases (like K8s decommissioning an executor pod, or YARN container preemption), it does not hold in others like when the cluster is managed by a Standalone Scheduler (Master). This is similar to the existing `workerLost` field in the `ExecutorProcessLost` message. In the future, this `ExecutorDecommissionInfo` can be embellished for knowing how long the executor has to live for scenarios like Cloud spot kills (or Yarn preemption) and the like. No Tweaked an existing unit test in `AppClientSuite` Closes apache#29032 from agrawaldevesh/plumb_decom_info. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning This PR reduces the prospect of a job loss during decommissioning. It fixes two holes in the current decommissioning framework: - (a) Loss of decommissioned executors is not treated as a job failure: We know that the decommissioned executor would be dying soon, so its death is clearly not caused by the application. - (b) Shuffle files on the decommissioned host are cleared when the first fetch failure is detected from a decommissioned host: This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) -- too soon and it could lead to some wastage and too late would lead to fetch failures. The approach here is to do this clearing when the very first fetch failure is observed on the decommissioned block manager, without waiting for other blocks to also signal a failure. Without them decommissioning a lot of executors at a time leads to job failures. The task scheduler tracks the executors that were decommissioned along with their `ExecutorDecommissionInfo`. This information is used by: (a) For handling a `ExecutorProcessLost` error, or (b) by the `DAGScheduler` when handling a fetch failure. No Added a new unit test `DecommissionWorkerSuite` to test the new behavior by exercising the Master-Worker decommissioning. I chose to add a new test since the setup logic was quite different from the existing `WorkerDecommissionSuite`. I am open to changing the name of the newly added test suite :-) - Should I add a feature flag to guard these two behaviors ? They seem safe to me that they should only get triggered by decommissioning, but you never know :-). Closes apache#29014 from agrawaldevesh/decom_harden. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite This test tries to fix the flakyness of BlockManagerDecommissionIntegrationSuite. Make the block manager decommissioning test be less flaky An interesting failure happens when migrateDuring = true (and persist or shuffle is true): - We schedule the job with tasks on executors 0, 1, 2. - We wait 300 ms and decommission executor 0. - If the task is not yet done on executor 0, it will now fail because the block manager won't be able to save the block. This condition is easy to trigger on a loaded machine where the github checks run. - The task with retry on a different executor (1 or 2) and its shuffle blocks will land there. - No actual block migration happens here because the decommissioned executor technically failed before it could even produce a block. To remove the above race, this change replaces the fixed wait for 300 ms to wait for an actual task to succeed. When a task has succeeded, we know its blocks would have been written for sure and thus its executor would certainly be forced to migrate those blocks when it is decommissioned. The change always decommissions an executor on which a real task finished successfully instead of picking the first executor. Because the system may choose to schedule nothing on the first executor and instead run the two tasks on one executor. I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has failed several times on my PRs. So fixing it. No, unit test only change. Github checks. Ran this test 100 times, 10 at a time in parallel in a script. Closes apache#29226 from agrawaldevesh/block-manager-decom-flaky. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-31197][CORE] Shutdown executor once we are done decommissioning Exit the executor when it has been asked to decommission and there is nothing left for it to do. This is a rebase of apache#28817 If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished. Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible. The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet. I changed the unit test to not send the executor exit message and still wait on the executor exited message. Closes apache#29211 from holdenk/SPARK-31197-exit-execs-redone. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> Connect decommissioning to dynamic scaling Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads Make Spark's dynamic allocation use decommissioning Track the decommissioning executors in the core dynamic scheduler so we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile Fix our exiting and cleanup thread for better debugging next time. Cleanup the locks we use in decommissioning and clarify some more bits. Verify executors decommissioned, then killed by external external cluster manager are re-launched Verify some additional calls are not occuring in the executor allocation manager suite. Dont' close the watcher until the end of the test Use decommissionExecutors and set adjustTargetNumExecutors to false so that we can match the pattern for killExecutor/killExecutors bump numparts up to 6 Revert "bump numparts up to 6" This reverts commit daf96dd. Small coment & visibility cleanup CR feedback/cleanup Fix up the merge CR feedback, move adjustExecutors to a common utility function Exclude some non-public APIs Remove junk More CR feedback Fix adjustExecutors backport This test fails for me locally and from what I recall it's because we use a different method of resolving the bind address than upstream so disabling the test This test fails for me locally and from what I recall it's because we use a different method of resolving the bind address than upstream so disabling the test Cleanup and drop watcher changes from the backport

…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control Rebase on branch-3.0 to fix SparkR integration test.

… changed from k8s ### What changes were proposed in this pull request? This is a straight application of #28423 onto branch-3.0 Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This was tested in #28423 by running spark-submit to a k8s cluster. Closes #29533 from jkleckner/backport-SPARK-24266-to-branch-3.0. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… changed from k8s ### What changes were proposed in this pull request? This patch processes the HTTP Gone event and restarts the pod watcher. ### Why are the changes needed? This is a backport of PR #28423 to branch-2.4. The reasons are explained in SPARK-24266 that spark jobs using the k8s resource scheduler may hang. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually. Closes #30283 from jkleckner/shockdm-2.4.6-spark-submit-fix. Lead-authored-by: Jim Kleckner <jim@cloudphysics.com> Co-authored-by: Dmitriy Drinfeld <dmitriy.drinfeld@ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

probot-autolabeler bot added the KUBERNETES label Apr 30, 2020

holdenk reviewed Apr 30, 2020

View reviewed changes

dawany mentioned this pull request May 22, 2020

SparkSubmitOperator could not get Exit Code after log stream interrupted by k8s old resource version exception apache/airflow#8963

Closed

asfgit closed this in 0432379 Jul 21, 2020

stijndehaes deleted the bugfix/k8s-submit-resource-version-change branch August 5, 2020 05:38

jkleckner mentioned this pull request Aug 10, 2020

Spark batch jobs are not terminating kubeflow/spark-operator#627

Closed

This was referenced Aug 20, 2020

[SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s #29496

Closed

[SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s #29533

Closed

jkleckner mentioned this pull request Nov 6, 2020

[SPARK-24266][K8S][2.4] Restart the watcher when we receive a version changed from k8s #30283

Closed

[SPARK-24266][k8s] Restart the watcher when we receive a version changed from k8s #28423

[SPARK-24266][k8s] Restart the watcher when we receive a version changed from k8s #28423

Conversation

stijndehaes commented Apr 30, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Apr 30, 2020

dongjoon-hyun commented Apr 30, 2020

SparkQA commented Apr 30, 2020

SparkQA commented Apr 30, 2020

SparkQA commented Apr 30, 2020

holdenk commented Apr 30, 2020

stijndehaes commented Apr 30, 2020 • edited Loading

dongjoon-hyun commented Apr 30, 2020

holdenk Apr 30, 2020

Choose a reason for hiding this comment

holdenk Apr 30, 2020

Choose a reason for hiding this comment

stijndehaes May 4, 2020

Choose a reason for hiding this comment

stijndehaes May 4, 2020

Choose a reason for hiding this comment

stijndehaes May 4, 2020

Choose a reason for hiding this comment

stijndehaes May 4, 2020

Choose a reason for hiding this comment

SparkQA commented May 4, 2020

SparkQA commented May 4, 2020

SparkQA commented May 4, 2020

SparkQA commented May 4, 2020

SparkQA commented May 4, 2020

SparkQA commented May 4, 2020

stijndehaes commented May 4, 2020

stijndehaes commented May 4, 2020

stijndehaes commented May 5, 2020 • edited Loading

stijndehaes commented May 5, 2020

SparkQA commented May 5, 2020

SparkQA commented May 5, 2020

SparkQA commented May 5, 2020

SparkQA commented May 5, 2020

SparkQA commented May 5, 2020

SparkQA commented May 5, 2020

stijndehaes commented May 8, 2020

jkleckner commented Jun 18, 2020

dongjoon-hyun commented Jun 18, 2020

SparkQA commented Jul 21, 2020

SparkQA commented Jul 21, 2020

holdenk commented Jul 21, 2020

holdenk commented Jul 21, 2020

holdenk commented Jul 21, 2020

dongjoon-hyun commented Jul 21, 2020

holdenk commented Jul 21, 2020

stijndehaes commented Jul 22, 2020

jkleckner commented Jul 22, 2020

jkleckner commented Jul 30, 2020

stijndehaes commented Aug 5, 2020

jkleckner commented Aug 10, 2020

puneetloya commented Aug 10, 2020

jkleckner commented Aug 11, 2020

stijndehaes commented Apr 30, 2020 •

edited

Loading

stijndehaes commented Apr 30, 2020 •

edited

Loading

stijndehaes commented May 5, 2020 •

edited

Loading