[SPARK-28487][k8s] More responsive dynamic allocation with K8S. #25236

vanzin · 2019-07-23T21:34:07Z

This change implements a few changes to the k8s pod allocator so
that it behaves a little better when dynamic allocation is on.

(i) Allow the application to ramp up immediately when there's a
change in the target number of executors. Without this change,
scaling would only trigger when a change happened in the state of
the cluster, e.g. an executor going down, or when the periodical
snapshot was taken (default every 30s).

(ii) Get rid of pending pod requests, both acknowledged (i.e. Spark
knows that a pod is pending resource allocation) and unacknowledged
(i.e. Spark has requested the pod but the API server hasn't created it
yet), when they're not needed anymore. This avoids starting those
executors to just remove them after the idle timeout, wasting resources
in the meantime.

(iii) Re-work some of the code to avoid unnecessary logging. While not
bad without dynamic allocation, the existing logging was very chatty
when dynamic allocation was on. With the changes, all the useful
information is still there, but only when interesting changes happen.

(iv) Gracefully shut down executors when they become idle. Just deleting
the pod causes a lot of ugly logs to show up, so it's better to ask pods
to exit nicely. That also allows Spark to respect the "don't delete
pods" option when dynamic allocation is on.

Tested on a small k8s cluster running different TPC-DS workloads.

This change implements a few changes to the k8s pod allocator so that it behaves a little better when dynamic allocation is on. (i) Allow the application to ramp up immediately when there's a change in the target number of executors. Without this change, scaling would only trigger when a change happened in the state of the cluster, e.g. an executor going down, or when the periodical snapshot was taken (default every 30s). (ii) Get rid of pending pod requests, both acknowledged (i.e. Spark knows that a pod is pending resource allocation) and unacknowledged (i.e. Spark has requested the pod but the API server hasn't created it yet), when they're not needed anymore. This avoids starting those executors to just remove them after the idle timeout, wasting resources in the meantime. (iii) Re-work some of the code to avoid unnecessary logging. While not bad without dynamic allocation, the existing logging was very chatty when dynamic allocation was on. With the changes, all the useful information is still there, but only when interesting changes happen. (iv) Gracefully shut down executors when they become idle. Just deleting the pod causes a lot of ugly logs to show up, so it's better to ask pods to exit nicely. That also allows Spark to respect the "don't delete pods" option when dynamic allocation is on. Tested on a small k8s cluster running different TPC-DS workloads.

SparkQA · 2019-07-23T21:47:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13169/

SparkQA · 2019-07-23T22:04:32Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13169/

SparkQA · 2019-07-23T23:38:38Z

Test build #108062 has finished for PR 25236 at commit 3d20f08.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-24T00:02:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13172/

SparkQA · 2019-07-24T00:18:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13172/

SparkQA · 2019-07-24T02:11:37Z

Test build #108066 has finished for PR 25236 at commit fd0a1d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-07-24T16:07:13Z

retest this please

SparkQA · 2019-07-24T16:19:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13218/

SparkQA · 2019-07-24T16:36:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13218/

SparkQA · 2019-07-24T19:38:40Z

Test build #108113 has finished for PR 25236 at commit fd0a1d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-25T03:13:03Z

Retest this please.

SparkQA · 2019-07-25T03:31:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13250/

SparkQA · 2019-07-25T03:46:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13250/

SparkQA · 2019-07-25T06:27:40Z

Test build #108148 has finished for PR 25236 at commit fd0a1d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2019-07-30T05:45:12Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

+    executorService.schedule(killTask, conf.get(KUBERNETES_DYN_ALLOC_KILL_GRACE_PERIOD),
+      TimeUnit.MILLISECONDS)
+
+    // Return an immediate success, since we can't confirm or deny that executors have bee


bee -> been?

felixcheung · 2019-07-30T05:52:11Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+      }
+    }
+
+    // Update the flag that helps the setTotalExpectedExecutors() callback avoid trigerring this


trigerring -> trigering

SparkQA · 2019-07-30T17:44:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13511/

SparkQA · 2019-07-30T18:01:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13511/

SparkQA · 2019-07-30T19:21:27Z

Test build #108415 has finished for PR 25236 at commit 4d059f5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-08-01T14:33:50Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+      lastSnapshot = snapshots.last
+    }
+
+    val currentRunningExecutors = lastSnapshot.executorPods.values.count {


minor: on a quick read, the naming of these variables is a bit confusing on whether its a list of execs or just a count -- would be nice to have the counts consistently use ...Count or num...

squito · 2019-08-01T14:42:29Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+            .withField("status.phase", "Pending")
+            .withLabel(SPARK_APP_ID_LABEL, applicationId)
+            .withLabel(SPARK_ROLE_LABEL, SPARK_POD_EXECUTOR_ROLE)
+            .withLabelIn(SPARK_EXECUTOR_ID_LABEL, toDelete.sorted.map(_.toString): _*)


does sorting matter here? don't see it mentioned on the k8s api, and you're not doing it above.

Normally it doesn't matter, but it matters when using mocks in the tests (since this is a varargs call, not a parameter that takes a Set).

squito · 2019-08-01T15:01:36Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

+
+    // Return an immediate success, since we can't confirm or deny that executors have been
+    // actually shut down without waiting too long and blocking the allocation thread.
+    Future.successful(true)


this seems bad. If we get the response wrong, then the ExecutorAllocationManager will mistakenly update its internal state to think the executors have been removed, when they haven't been:

spark/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

Lines 448 to 455 in b29829e

val executorsRemoved = if (testing) {

executorIdsToBeRemoved

} else {

// We don't want to change our target number of executors, because we already did that

// when the task backlog decreased.

client.killExecutors(executorIdsToBeRemoved, adjustTargetNumExecutors = false,

countFailures = false, force = false)

}

which means we're expecting that call to kubernetes to delete the pods to be foolproof.

Why is it so bad to wait here? Is it because we are holding locks when making this call in CoarseGrainedSchedulerBackend? could that be avoided?

I added a longer comment explaining this.

The gist is:

it's bad to wait because it blocks the EAM thread (in this case for a really long time)

it's ok to return "true" because these executors will all die eventually, whether because of the shutdown message or because of the explicit kill.

The return value, to the best of my understanding, is not meant to say "yes all executors have been killed", but rather "an attempt has been made to remove all of these executors, and they'll die eventually".

(Otherwise there would be no need for the EAM to track which executors are pending removal, since it would know immediately from this return value.)

ok, thanks, I buy that explanation

SparkQA · 2019-08-05T18:33:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13763/

SparkQA · 2019-08-05T18:48:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13763/

SparkQA · 2019-08-05T20:50:49Z

Test build #108678 has finished for PR 25236 at commit 635326a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

lgtm, though I would really like somebody more familiar w/ k8s integration to take a look as well

squito · 2019-08-06T18:42:43Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala

+
+    // Return an immediate success, since we can't confirm or deny that executors have been
+    // actually shut down without waiting too long and blocking the allocation thread.
+    Future.successful(true)


ok, thanks, I buy that explanation

vanzin · 2019-08-06T22:13:23Z

Let's see if @mccheah has anything to add, otherwise I'll end up pushing before EOW.

vanzin · 2019-08-13T17:13:00Z

retest this please

SparkQA · 2019-08-13T17:31:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/14112/

SparkQA · 2019-08-13T17:45:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/14112/

SparkQA · 2019-08-13T19:03:15Z

Test build #109050 has finished for PR 25236 at commit 635326a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-08-13T20:39:16Z

retest this please

SparkQA · 2019-08-13T20:53:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/14126/

SparkQA · 2019-08-13T21:07:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/14126/

SparkQA · 2019-08-13T23:28:45Z

Test build #109062 has finished for PR 25236 at commit 635326a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-08-14T00:29:31Z

Merging to master.

olharuban · 2022-02-15T05:05:25Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

@@ -330,6 +330,12 @@ private[spark] object Config extends Logging {
      .booleanConf
      .createWithDefault(true)

+  val KUBERNETES_DYN_ALLOC_KILL_GRACE_PERIOD =
+    ConfigBuilder("spark.kubernetes.dynamicAllocation.deleteGracePeriod")
+      .doc("How long to wait for executors to shut down gracefully before a forceful kill.")


@vanzin Does it only work for dynamicAllocation mode? is there any way to delete executors with Grace Period for non dynamic allocation mode?

D'oh NPE.

fd0a1d2

dongjoon-hyun added the KUBERNETES label Jul 24, 2019

felixcheung reviewed Jul 30, 2019

View reviewed changes

Typos.

4d059f5

squito reviewed Aug 1, 2019

View reviewed changes

Marcelo Vanzin added 3 commits August 5, 2019 11:03

Merge branch 'master' into SPARK-28487

297427f

Rename a variable.

44fece6

Expand a comment.

635326a

squito reviewed Aug 6, 2019

View reviewed changes

vanzin closed this in 0343854 Aug 14, 2019

vanzin deleted the SPARK-28487 branch August 15, 2019 21:09

olharuban reviewed Feb 15, 2022

View reviewed changes

holdenk mentioned this pull request Feb 6, 2023

[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825

Closed

	val executorsRemoved = if (testing) {
	executorIdsToBeRemoved
	} else {
	// We don't want to change our target number of executors, because we already did that
	// when the task backlog decreased.
	client.killExecutors(executorIdsToBeRemoved, adjustTargetNumExecutors = false,
	countFailures = false, force = false)
	}

[SPARK-28487][k8s] More responsive dynamic allocation with K8S. #25236

[SPARK-28487][k8s] More responsive dynamic allocation with K8S. #25236

Conversation

vanzin commented Jul 23, 2019

SparkQA commented Jul 23, 2019

SparkQA commented Jul 23, 2019

SparkQA commented Jul 23, 2019

SparkQA commented Jul 24, 2019

SparkQA commented Jul 24, 2019

SparkQA commented Jul 24, 2019

vanzin commented Jul 24, 2019

SparkQA commented Jul 24, 2019

SparkQA commented Jul 24, 2019

SparkQA commented Jul 24, 2019

dongjoon-hyun commented Jul 25, 2019

SparkQA commented Jul 25, 2019

SparkQA commented Jul 25, 2019

SparkQA commented Jul 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2019

SparkQA commented Jul 30, 2019

SparkQA commented Jul 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 5, 2019

SparkQA commented Aug 5, 2019

SparkQA commented Aug 5, 2019

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Aug 6, 2019

vanzin commented Aug 13, 2019

SparkQA commented Aug 13, 2019

SparkQA commented Aug 13, 2019

SparkQA commented Aug 13, 2019

vanzin commented Aug 13, 2019

SparkQA commented Aug 13, 2019

SparkQA commented Aug 13, 2019

SparkQA commented Aug 13, 2019

vanzin commented Aug 14, 2019

Choose a reason for hiding this comment