[SPARK-24248][K8S] Use level triggering and state reconciliation in scheduling and lifecycle #21366

mccheah · 2018-05-18T21:24:15Z

What changes were proposed in this pull request?

Previously, the scheduler backend was maintaining state in many places, not only for reading state but also writing to it. For example, state had to be managed in both the watch and in the executor allocator runnable. Furthermore, one had to keep track of multiple hash tables.

We can do better here by:

Consolidating the places where we manage state. Here, we take inspiration from traditional Kubernetes controllers. These controllers tend to follow a level-triggered mechanism. This means that the controller will continuously monitor the API server via watches and polling, and on periodic passes, the controller will reconcile the current state of the cluster with the desired state. We implement this by introducing the concept of a pod snapshot, which is a given state of the executors in the Kubernetes cluster. We operate periodically on snapshots. To prevent overloading the API server with polling requests to get the state of the cluster (particularly for executor allocation where we want to be checking frequently to get executors to launch without unbearably bad latency), we use watches to populate snapshots by applying observed events to a previous snapshot to get a new snapshot. Whenever we do poll the cluster, the polled state replaces any existing snapshot - this ensures eventual consistency and mirroring of the cluster, as is desired in a level triggered architecture.
Storing less specialized in-memory state in general. Previously we were creating hash tables to represent the state of executors. Instead, it's easier to represent state solely by the snapshots.

How was this patch tested?

Integration tests should test there's no regressions end to end. Unit tests to be updated, in particular focusing on different orderings of events, particularly accounting for when events come in unexpected ordering.

…for scheduling Previously, the scheduler backend was maintaining state in many places, not only for reading state but also writing to it. For example, state had to be managed in both the watch and in the executor allocator runnable. Furthermore one had to keep track of multiple hash tables. We can do better here by: (1) Consolidating the places where we manage state. Here, we take inspiration from traditional Kubernetes controllers. These controllers tend to implement an event queue which is populated by two sources: a watch connection, and a periodic poller. Controllers typically use both mechanisms for redundancy; the watch connection may drop, so the periodic polling serves as a backup. Both sources write pod updates to a single event queue and then a processor periodically processes the current state of pods as reported by the two sources. (2) Storing less specialized in-memory state in general. Previously we were creating hash tables to represent the state of executors. Instead, it's easier to represent state solely by the event queue, which has predictable read/write patterns and is more or less just a local up-to-date cache of the cluster's status.

mccheah · 2018-05-18T21:29:37Z

Needs tests. @foxish @liyinan926 for initial comments on the design.

SparkQA · 2018-05-18T21:29:50Z

Test build #90810 has finished for PR 21366 at commit 310263c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

mccheah · 2018-05-18T21:40:29Z

...es/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsEventHandler.scala

+  // We could use CoarseGrainedSchedulerBackend#totalRegisteredExecutors here for tallying the
+  // executors that are running. But, here we choose instead to maintain all state within this
+  // class from the persecptive of the k8s API. Therefore whether or not this scheduler loop
+  // believes a scheduler is running is dictated by the K8s API rather than Spark's RPC events.


believes an executor is running*

mccheah · 2018-05-18T21:48:37Z

...es/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsEventHandler.scala

+
+  private def findExitReason(pod: Pod, execId: Long): ExecutorExited = {
+    val exitCode = findExitCode(pod)
+    val (exitCausedByApp, exitMessage) = if (isDeleted(pod)) {


Not sure if this is 100% accurate - the pod may be evicted by the Kubernetes API if the pod misbehaves, so we should introspect if Kubernetes kicked out the pod because the pod itself did something wrong or if the pod was just deleted by a user or by this Spark application.

mccheah · 2018-05-18T21:52:24Z

...e/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingEventSource.scala

+
+import org.apache.spark.deploy.k8s.Constants._
+
+private[spark] class ExecutorPodsPollingEventSource(


It's noteworthy that the resync polls can also be done in ExecutorPodsEventHandler#processEvents. The reason we don't is because we probably want the resync polls to occur on a different interval than the event handling passes. You may, for example, ask for the event handler to trigger very frequently so that pod updates are dealt with promptly. But you don't want to be polling the API server every 5 seconds.

mccheah · 2018-05-18T21:53:43Z

...e/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingEventSource.scala

+  def start(applicationId: String): Unit = {
+    require(pollingFuture == null, "Cannot start polling more than once.")
+    pollingFuture = pollingExecutor.scheduleWithFixedDelay(
+      new PollRunnable(applicationId), 0L, 30L, TimeUnit.SECONDS)


Should make these and other intervals like it configurable.

SparkQA · 2018-05-18T22:06:01Z

Test build #90813 has finished for PR 21366 at commit 60990f1.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-18T22:24:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3244/

SparkQA · 2018-05-18T22:43:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3244/

SparkQA · 2018-05-18T23:03:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3245/

SparkQA · 2018-05-18T23:09:39Z

Test build #90818 has finished for PR 21366 at commit 3343ba6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-18T23:28:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3245/

SparkQA · 2018-05-18T23:28:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3247/

SparkQA · 2018-05-18T23:53:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3247/

SparkQA · 2018-05-18T23:53:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3248/

SparkQA · 2018-05-18T23:53:52Z

Test build #90820 has finished for PR 21366 at commit 600e25f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-19T00:18:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3248/

mccheah · 2018-05-19T00:23:37Z

retest this please

SparkQA · 2018-05-19T00:43:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3249/

SparkQA · 2018-05-19T01:02:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3249/

SparkQA · 2018-05-19T02:49:58Z

Test build #90819 has finished for PR 21366 at commit 522b079.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-19T03:09:55Z

Test build #90815 has finished for PR 21366 at commit f3bb80a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-12T20:55:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3807/

SparkQA · 2018-06-12T23:40:29Z

Test build #91728 has finished for PR 21366 at commit 9e0b758.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-12T23:44:57Z

Test build #91727 has finished for PR 21366 at commit 108181d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-13T00:05:16Z

Test build #91731 has finished for PR 21366 at commit 8b0a211.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

skonto · 2018-06-14T09:01:30Z

@mccheah could you add a design doc for future reference and so that new contributors can understand better the rationale behind this. There is some description in the JIRA ticket but not enough to describe the final solution.

skonto · 2018-06-14T09:07:27Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+          kubernetesClient
+            .pods()
+            .withLabel(SPARK_EXECUTOR_ID_LABEL, execId.toString)
+            .delete()


Shouldn't removeExecutorFromSpark be called here as well? Couldn't be the case that the executor exists at a higher level but K8s backend missed it?

That's handled by the lifecycle manager already, because the lifecycle manager looks at what the scheduler backend believes are its executors and reconciles them with what's in the snapshot.

skonto · 2018-06-14T09:12:45Z

...e/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala

+
+import org.apache.spark.util.{ThreadUtils, Utils}
+
+private[spark] class ExecutorPodsSnapshotsStoreImpl(subscribersExecutor: ScheduledExecutorService)


Could you add a description of the class here.

skonto · 2018-06-14T10:51:27Z

...es/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala

+      .newDaemonSingleThreadScheduledExecutor("kubernetes-executor-snapshots-buffer")
+    val snapshotsStore = new ExecutorPodsSnapshotsStoreImpl(bufferSnapshotsExecutor)
+    val removedExecutorsCache = CacheBuilder.newBuilder()
+      .expireAfterWrite(3, TimeUnit.MINUTES)


Why 3 minutes? Should this be configurable?

Don't think it has to be configurable. Basically we should only receive the removed executor events multiple times for a short period of time, then we should settle into steady state.

The cache is only for a best effort attempt to not remove the same executor from the scheduler backend multiple times, but at the end of the day even if we do accidentally remove multiple times the only noticeable result is noisy logs. The scheduler backend properly handles multiple attempts to remove but we'd prefer it if we didn't have to.

skonto · 2018-06-14T10:54:14Z

...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala

+    snapshots.foreach { snapshot =>
+      snapshot.executorPods.foreach { case (execId, state) =>
+        state match {
+          case deleted@PodDeleted(pod) =>


s/succeeded@PodSucceeded(pod)/succeeded@PodSucceeded(_)

skonto · 2018-06-14T10:54:29Z

...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala

+          case deleted@PodDeleted(pod) =>
+            removeExecutorFromSpark(schedulerBackend, deleted, execId)
+            execIdsRemovedInThisRound += execId
+          case failed@PodFailed(pod) =>


same as above.

skonto · 2018-06-14T10:54:41Z

...ore/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala

+            execIdsRemovedInThisRound += execId
+          case failed@PodFailed(pod) =>
+            onFinalNonDeletedState(failed, execId, schedulerBackend, execIdsRemovedInThisRound)
+          case succeeded@PodSucceeded(pod) =>


same as above.

skonto · 2018-06-14T11:06:18Z

...e/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala

+        new LinkedBlockingQueue[ExecutorPodsSnapshot](), onNewSnapshots)
+    subscribers += newSubscriber
+    pollingTasks += subscribersExecutor.scheduleWithFixedDelay(
+      toRunnable(() => callSubscriber(newSubscriber)),


toRunnable is not needed with lambdas in Java 8. Just pass there: () => callSubscriber(newSubscriber)

Just tried that and it doesn't work - I think that requires the scala-java8-compat module which I don't think is worth pulling in for just this case.

skonto · 2018-06-14T11:43:41Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala

+
+  private class PollRunnable(applicationId: String) extends Runnable {
+    override def run(): Unit = {
+      snapshotsStore.replaceSnapshot(kubernetesClient


Do you start with an empty state to trigger executor creation at the very beginning when the driver starts?

Not strictly why that's done here but a side-effect I suppose. Really the snapshots store should push an initial empty snapshot to all subscribers when it starts, and the unit tests do check for that - it's the responsibility of the snapshots store.

Yes you need to trigger the initial creation of executors somehow and yes I saw that in the tests, my only concern is that this should be explicit not implicit to make code more obvious anyway.

I see - I think what we actually want is ExecutorPodsSnapshotStoreImpl to initialize the subscriber with its current snapshot. That creates the semantics where the new subscriber will first receive the most up to date state immediately.

And though we don't allow for this right now, the above would allow subscribers to be added midway through to receive the most recent snapshot immediately. But again we don't do this right now - we setup all subscribers on startup before starting pushing snapshots.

You could add some comment saying this is where we create executors and by what way.
I mean on mesos you start executors when you get offers from agents and that is straightforward and makes sense. Here you want to start them ASAP, you have no restrictions, so then you can send Spark tasks to them right?

But polling isn't where we start to create executors - that's done on the subscriber rounds. Polling here populates the snapshots store, but processing the snapshots happens on the subscriber thread(s). Furthermore with the scheme proposed above you never have to even poll for snapshots once before we begin requesting executors, because the pods allocator subscriber will trigger immediately with an empty snapshot.

For example if we changed the initialDelay here to stall before the first snapshots sync then with the above scheme we'd still try to request executors immediately, because the subscriber thread kicks off an allocation round immediately.

Yeah I see. I guess this is done by ExecutorPodsAllocator as a subscriber when it gets the empty snapshot.

skonto · 2018-06-14T12:25:07Z

...rc/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala

+    conf: SparkConf,
+    kubernetesClient: KubernetesClient,
+    snapshotsStore: ExecutorPodsSnapshotsStore,
+    pollingExecutor: ScheduledExecutorService) {


Could you add some debug logging here. In general it would be good to be able to trace what is happening in case of a an issue with debug mode, this applies to all classes introduced for both watching and polling.

mccheah · 2018-06-14T17:46:44Z

@mccheah could you add a design doc for future reference and so that new contributors can understand better the rationale behind this. There is some description in the JIRA ticket but not enough to describe the final solution.

I can do that, but would we consider that blocking the merge of this PR? I'd like to get this in soon, it's been open for awhile.

dvogelbacher · 2018-06-14T18:32:39Z

Agree with @mccheah on not blocking this on a design doc. This PR strictly improves the management of executor states in k8s compared to how it was done before. So we really should get this merged soon.

foxish · 2018-06-14T18:39:41Z

If last round's comments are addressed, LGTM from me. Important behavior to check is - the snapshot, and creating replacement executors based on captured snapshot.

mccheah · 2018-06-14T21:02:59Z

Ok, addressed comments. The latest patch also makes it so that the subscribers run in a thread pool instead of just on a single thread. We have two subscribers so now they can run concurrently, if that ever comes up.

Not much else besides addressing the comments. @skonto if you're +1 then I'll merge.

skonto · 2018-06-14T21:04:29Z

@mccheah thanx a lot better with the comments. +1

SparkQA · 2018-06-14T21:12:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3895/

SparkQA · 2018-06-14T21:43:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3895/

mccheah · 2018-06-14T21:55:15Z

retest this please

SparkQA · 2018-06-14T22:11:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3898/

SparkQA · 2018-06-14T22:22:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3898/

mccheah · 2018-06-14T22:30:21Z

Ok, I'm merging to master. Thanks everyone for contributing to review - @foxish, @liyinan926 , @skonto , @dvogelbacher, @erikerlandson. As discussed earlier, I will post a design document for how this all works to the JIRA ticket.

SparkQA · 2018-06-15T01:11:50Z

Test build #91866 has finished for PR 21366 at commit 1a99dce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-15T02:11:20Z

Test build #91870 has finished for PR 21366 at commit 1a99dce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mccheah commented May 18, 2018

View reviewed changes

Fix build

60990f1

Dependencies

f3bb80a

Adding some logging.

3343ba6

mccheah added 2 commits May 18, 2018 16:17

Specifically initialize things with null. More logs.

30b7f17

Fix scalastyle

522b079

mccheah added 2 commits May 18, 2018 16:38

Actually create the pods

600e25f

Fix build

931529a

skonto reviewed Jun 14, 2018

View reviewed changes

Address comments. Make subscriber thread pool instead of single thread

1a99dce

asfgit closed this in 270a9a3 Jun 14, 2018

robert3005 deleted the event-queue-driven-scheduling branch June 14, 2018 23:57


		import org.apache.spark.deploy.k8s.Constants._

		private[spark] class ExecutorPodsPollingEventSource(


		import org.apache.spark.util.{ThreadUtils, Utils}

		private[spark] class ExecutorPodsSnapshotsStoreImpl(subscribersExecutor: ScheduledExecutorService)

[SPARK-24248][K8S] Use level triggering and state reconciliation in scheduling and lifecycle #21366

[SPARK-24248][K8S] Use level triggering and state reconciliation in scheduling and lifecycle #21366

Conversation

mccheah commented May 18, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

mccheah commented May 18, 2018

SparkQA commented May 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 18, 2018

SparkQA commented May 19, 2018

mccheah commented May 19, 2018

SparkQA commented May 19, 2018

SparkQA commented May 19, 2018

SparkQA commented May 19, 2018

SparkQA commented May 19, 2018

SparkQA commented Jun 12, 2018

SparkQA commented Jun 12, 2018

SparkQA commented Jun 12, 2018

SparkQA commented Jun 13, 2018

skonto commented Jun 14, 2018

skonto Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Jun 14, 2018

dvogelbacher commented Jun 14, 2018

foxish commented Jun 14, 2018

mccheah commented Jun 14, 2018

skonto commented Jun 14, 2018 • edited Loading

SparkQA commented Jun 14, 2018

SparkQA commented Jun 14, 2018

mccheah commented Jun 14, 2018

SparkQA commented Jun 14, 2018

SparkQA commented Jun 14, 2018

mccheah commented Jun 14, 2018

SparkQA commented Jun 15, 2018

SparkQA commented Jun 15, 2018

mccheah commented May 18, 2018 •

edited

Loading

skonto Jun 14, 2018 •

edited

Loading

mccheah Jun 14, 2018 •

edited

Loading

skonto Jun 14, 2018 •

edited

Loading

skonto Jun 14, 2018 •

edited

Loading

skonto Jun 14, 2018 •

edited

Loading

skonto Jun 14, 2018 •

edited

Loading

skonto commented Jun 14, 2018 •

edited

Loading