Spark on Kubernetes - basic submission client #545

liyinan926 · 2017-11-07T23:50:02Z

Second draft upstreaming PR that contains the basic submission client implementation and unit tests. Branch spark-kubernetes-3-updated is a clone of spark-kubernetes-3 with latest changes from upstream/master merged in. spark-kubernetes-4 includes all our changes in spark-kubernetes-3.

cc @foxish @mccheah @apache-spark-on-k8s/contributors

foxish · 2017-11-08T00:03:50Z

This will be the follow-up to apache#19468

mccheah · 2017-11-08T00:55:36Z

This seems like a large diff, but a quick scan shows everything included as necessary. We need the driver service bootstrap because of changes to master. I think we can reduce the fanciness of the credentials step but that doesn't reduce the complexity by a significant amount.

foxish · 2017-11-08T00:59:25Z

One TODO: Add the unit test in #542 to this PR

liyinan926 · 2017-11-08T05:53:45Z

Changed from #542 merged in.

mccheah · 2017-11-08T22:35:13Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+private[spark] object ClientArguments {
+  def fromCommandLineArgs(args: Array[String]): ClientArguments = {
+    var mainAppResource: Option[MainAppResource] = None
+    val otherPyFiles = Seq.empty[String]


Don't think we're using this here.

mccheah · 2017-11-08T22:36:11Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+          addDriverOwnerReference(createdDriverPod, otherKubernetesResources)
+          kubernetesClient.resourceList(otherKubernetesResources: _*).createOrReplace()
+        }
+      } catch {


Don't think we want to catch Throwable here - look into NonFatal.

Changed to NonFatal(e).

mccheah · 2017-11-08T22:39:08Z

...re/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BaseDriverConfigurationStep.scala

+      .addAllToEnv(driverCustomEnvs.asJava)
+      .addToEnv(driverExtraClasspathEnv.toSeq: _*)
+      .addNewEnv()
+        .withName(ENV_DRIVER_MEMORY)


These environment variable keys don't make much sense without the DockerFile which describes the contract the submission client must fulfill.

It might make sense to include the Dockerfile for the driver and the executor in this PR. We shouldn't add the poms that build them - that would make this diff unnecessarily large.

Should we still put them under docker-minimum-bundle?

Yes. Though, for awhile now I've been thinking that there's probably a better name for this submodule =)

Should we just simply call it docker?

Think that should be fine.

mccheah · 2017-11-08T22:42:23Z

A few comments but otherwise this captures the spirit of what we want to have upstream.

liyinan926 · 2017-11-08T23:45:47Z

resource-managers/kubernetes/docker/src/main/spark-base/Dockerfile

+COPY bin /opt/spark/bin
+COPY sbin /opt/spark/sbin
+COPY conf /opt/spark/conf
+COPY dockerfiles/spark-base/entrypoint.sh /opt/


@mccheah @foxish dockerfiles/spark-base doesn't make sense if the distribution does not include spark-k8s stuffs. Does this matter?

Think we still want to include it to show the projected contents of the image.

mccheah · 2017-11-09T00:10:26Z

resource-managers/kubernetes/docker/src/main/spark-base/Dockerfile

@@ -0,0 +1,43 @@
+#


Think we want src/main/dockerfiles as the top level directory.

felixcheung · 2017-11-09T07:00:49Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/config.scala

+      masterWithoutK8sPrefix
+    } else {
+      val resolvedURL = s"https://$masterWithoutK8sPrefix"
+      logDebug(s"No scheme specified for kubernetes master URL, so defaulting to https. Resolved" +


do not s" plain text...

perhaps logInfo? this sounds useful

felixcheung · 2017-11-09T07:01:25Z

...src/main/scala/org/apache/spark/deploy/k8s/submit/DriverConfigurationStepsOrchestrator.scala

+      submissionSparkConf,
+      KUBERNETES_DRIVER_LABEL_PREFIX,
+      "label")
+    require(!driverCustomLabels.contains(SPARK_APP_ID_LABEL), s"Label with key " +


do not s" plain text...

felixcheung · 2017-11-09T07:01:33Z

...src/main/scala/org/apache/spark/deploy/k8s/submit/DriverConfigurationStepsOrchestrator.scala

+      "label")
+    require(!driverCustomLabels.contains(SPARK_APP_ID_LABEL), s"Label with key " +
+      s" $SPARK_APP_ID_LABEL is not allowed as it is reserved for Spark bookkeeping" +
+      s" operations.")


do not s" plain text...

felixcheung · 2017-11-09T07:02:15Z

...re/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BaseDriverConfigurationStep.scala

+        "annotation")
+    require(!driverCustomAnnotations.contains(SPARK_APP_NAME_ANNOTATION),
+      s"Annotation with key $SPARK_APP_NAME_ANNOTATION is not allowed as it is reserved for" +
+        s" Spark bookkeeping operations.")


do not s" plain text...
and align this with the previous line

felixcheung · 2017-11-09T07:02:47Z

...rc/main/scala/org/apache/spark/deploy/k8s/submit/steps/DriverKubernetesCredentialsStep.scala

+        }
+    val caCertDataBase64 = safeFileConfToBase64(
+        s"$KUBERNETES_AUTH_DRIVER_CONF_PREFIX.$CA_CERT_FILE_CONF_SUFFIX",
+        s"Driver CA cert file provided at %s does not exist or is not a file.")


do not s" plain text...

felixcheung · 2017-11-09T08:02:57Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+import scala.collection.mutable
+import scala.util.control.NonFatal
+
+import io.fabric8.kubernetes.api.model.{ContainerBuilder, EnvVar, EnvVarBuilder, HasMetadata, OwnerReferenceBuilder, Pod, PodBuilder}


more than 6, use import ._

felixcheung · 2017-11-09T08:04:41Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+    val resolvedDriverPod = new PodBuilder(currentDriverSpec.driverPod)
+      .editSpec()
+        .addToContainers(resolvedDriverContainer)
+        .endSpec()


align these?

The indention probably makes sense as editSpec returns an object different than PodBuilder.

felixcheung · 2017-11-09T08:22:19Z

...re/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BaseDriverConfigurationStep.scala

+          .build())
+        .endEnv()
+      .withNewResources()
+        .addToRequests("cpu", driverCpuQuantity)


what happens when we require for more than what the agent has? eg. more cores than number of cores in the agent node?

This will cause the driver to not be scheduled by the k8s scheduler onto a node until a node with that many cores becomes available in the cluster.

In general YARN doesn't handle this case nicely either, but I wonder in k8s we could do better if it is hooked up to an autoscaler or something?

felixcheung · 2017-11-09T08:25:47Z

...re/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BaseDriverConfigurationStep.scala

+        .addToAnnotations(allDriverAnnotations.asJava)
+      .endMetadata()
+      .withNewSpec()
+        .withRestartPolicy("Never")


this might be configurable in the future? some sort of driver HA in cluster mode?

Good point. I think we can make it configurable.

We should have an issue to track this. HA driver would be useful in a streaming context in the future.

felixcheung · 2017-11-09T08:30:48Z

...managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala

+import org.apache.spark.deploy.k8s.constants._
+import org.apache.spark.deploy.k8s.submit.steps.{DriverConfigurationStep, KubernetesDriverSpec}
+
+private[spark] class ClientSuite extends SparkFunSuite with BeforeAndAfter {


don't do private[spark] with the Suite classes - it won't run by jenkins
see c052212

liyinan926 · 2017-11-09T17:46:18Z

If there's no objection, I will squash the commits and push to upstream for review by EOD today. @apache-spark-on-k8s/contributors

foxish · 2017-11-09T17:55:02Z

SGTM! We should see how we can make it less confusing for reviewers - because this PR encompasses changes in spark-kubernetes-3.

liyinan926 · 2017-11-09T17:57:59Z

When pushing upstream, I'm gonna remove code for the first PR so this is less confusing.

kimoonkim

Initial batch of comments, mostly about readability. I still need to look at the various submission/configuration steps.

kimoonkim · 2017-11-09T20:48:50Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+      case (KUBERNETES, CLIENT) =>
+        printErrorAndExit("Client mode is currently not supported for Kubernetes.")
+      case (KUBERNETES, CLUSTER) if args.isPython =>
+        printErrorAndExit("Cluster deploy mode is currently not supported for python " +


I wonder if this message could mislead users to think python client mode is supported. Users will try python client mode and will get the error in line 302, which is not a good experience. Is it possible to pattern match case (KUBERNEETES, _) if args.isPython instead, before line 301? Then, the error message can say "python is not supported for Kubernetes".

kimoonkim · 2017-11-09T20:49:37Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+      case (KUBERNETES, CLUSTER) if args.isPython =>
+        printErrorAndExit("Cluster deploy mode is currently not supported for python " +
+          "applications on Kubernetes clusters.")
+      case (KUBERNETES, CLUSTER) if args.isR =>


Ditto. Wonder if we can match case (KUBERNETES, _) if args.isR before line 301 and say "R is not supported in Kubernetes".

kimoonkim · 2017-11-09T20:55:44Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+      childMainClass = "org.apache.spark.deploy.k8s.submit.Client"
+      childArgs ++= Array("--primary-java-resource", args.primaryResource)
+      childArgs ++= Array("--main-class", args.mainClass)
+      args.childArgs.foreach { arg =>


I see a null check on args.childArgs is done at line 695 and a few others:

if (args.childArgs != null) { args.childArgs.foreach { arg => childArgs += ("--arg", arg) } }

Maybe we should do the same?

kimoonkim · 2017-11-09T21:00:45Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala

@@ -466,6 +473,9 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S
      case USAGE_ERROR =>
        printUsageAndExit(1)

+      case KUBERNETES_NAMESPACE =>


Nit. I would expect a significant parameter like this to come before line 464. Line 464 - 473 are mostly about help and usage errors, it seems.

kimoonkim · 2017-11-09T21:07:35Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+     mainClass: String,
+     driverArgs: Array[String])
+
+private[spark] object ClientArguments {


I personally prefer one empty line between class/object header and the body, but I don't know if this is standard.

kimoonkim · 2017-11-09T21:17:45Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+
+    val resolvedDriverJavaOpts = currentDriverSpec
+      .driverSparkConf
+      // We don't need this anymore since we just set the JVM options on the environment


Where is the code that this comment is referring to?

Good catch. This comment and the line below it should be removed. Cc @foxish @mccheah to confirm.

Actually it seems it was added in #365. I think we should keep the comment but rephrase it to make it clearer.

Comment still applies because the lines below take the driver JVM options and append them to the JVM options of the SparkConf. For example, we don't want to have SPARK_DRIVER_JAVA_OPTS have a value of -Dspark.driver.extraJavaOptions=-XX:HeapDumpOnOutOfMemoryError - we just want the -XX:HeapDumpOn... to be set directly on the driver process.

The comment should be rephrased to clarify the intent of removing extraJavaOptions, along the lines of what Matt said.

kimoonkim · 2017-11-09T21:33:37Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+}
+
+private[spark] object Client {
+  def run(sparkConf: SparkConf, clientArguments: ClientArguments): Unit = {


Same comment about one empty line before the class body.

kimoonkim · 2017-11-09T21:34:45Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+    val waitForAppCompletion = sparkConf.get(WAIT_FOR_APP_COMPLETION)
+    val appName = sparkConf.getOption("spark.app.name").getOrElse("spark")
+    val master = getK8sMasterUrl(sparkConf.get("spark.master"))
+    val loggingInterval = Option(sparkConf.get(REPORT_INTERVAL)).filter( _ => waitForAppCompletion)


Nit. Don't need a space before _?

kimoonkim · 2017-11-09T21:43:19Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+      clientArguments.driverArgs,
+      sparkConf)
+
+    Utils.tryWithResource(SparkKubernetesClientFactory.createKubernetesClient(


This particular line does not read well because the word "KubernetesClient" appears here twice meaning two different things. The reader may fail to distinguish "Spark client" (SparkKubernetesClientFactory), vs "K8s API client" (createKubernetesClient).

Renamed SparkKubernetesClientFactory to KubernetesClientFactory and renamed the method to create.

I see. I myself misread. The two clients meant the same thing :-)

kimoonkim · 2017-11-09T23:05:20Z

The latest commit seems to address my comments so far. Thanks!

liyinan926 · 2017-11-10T00:32:47Z

Squashed the commit and removed scheduler backend code and relevant changes in Yarn-related code.

liyinan926 · 2017-11-10T05:50:06Z

@kimoonkim any more comments on the submission steps?

liyinan926 · 2017-11-10T17:36:26Z

This is under review at apache#19717.

kimoonkim

Submission steps look good to me. Left a few minor comments. PTAL.

kimoonkim · 2017-11-10T17:33:50Z

...managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/ConfigurationUtils.scala

+
+import org.apache.spark.SparkConf
+
+private[spark] object ConfigurationUtils {


Put an empty line before the body?

kimoonkim · 2017-11-10T17:39:24Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala

+          kubernetesClient,
+          waitForAppCompletion,
+          appName,
+          loggingPodStatusWatcher).run()


I almost missed .run() at the end. Maybe we can use aval here:

kubernetesClient => val sparkClient = new Client( configurationStepsOrchestrator.getAllConfigurationSteps(), sparkConf, kubernetesClient, waitForAppCompletion, appName, loggingPodStatusWatcher) sparkClient.run()

## What changes were proposed in this pull request? apache#19696 replaced the deprecated usages for `Date` and `Waiter`, but a few methods were missed. The PR fixes the forgotten deprecated usages. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Closes apache#19875 from mgaido91/SPARK-22473_FOLLOWUP.

…n the RDD commit protocol I have modified SparkHadoopWriter so that executors and the driver always use consistent JobIds during the hadoop commit. Before SPARK-18191, spark always used the rddId, it just incorrectly named the variable stageId. After SPARK-18191, it used the rddId as the jobId on the driver's side, and the stageId as the jobId on the executors' side. With this change executors and the driver will consistently uses rddId as the jobId. Also with this change, during the hadoop commit protocol spark uses actual stageId to check whether a stage can be committed unlike before that it was using executors' jobId to do this check. In addition to the existing unit tests, a test has been added to check whether executors and the driver are using the same JobId. The test failed before this change and passed after applying this fix. Author: Reza Safi <rezasafi@cloudera.com> Closes apache#19848 from rezasafi/stagerddsimple.

The main goal of this change is to allow multiple cluster-mode submissions from the same JVM, without having them end up with mixed configuration. That is done by extending the SparkApplication trait, and doing so was reasonably trivial for standalone and mesos modes. For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE" system property to control behavior indirectly in a whole bunch of places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes. Most of the changes here are removing that. Since we removed support for Hadoop 1.x, some methods that lived in YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining methods don't need to be part of the class, and can be called directly from the YarnSparkHadoopUtil object, so now there's a single implementation of SparkHadoopUtil. There were two places in the code that relied on SPARK_YARN_MODE to make decisions about YARN-specific functionality, and now explicitly check the master from the configuration for that instead: * fetching the external shuffle service port, which can come from the YARN configuration. * propagation of the authentication secret using Hadoop credentials. This also was cleaned up a little to not need so many methods in `SparkHadoopUtil`. With those out of the way, actually changing the YARN client to extend SparkApplication was easy. Tested with existing unit tests, and also by running YARN apps with auth and kerberos both on and off in a real cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#19631 from vanzin/SPARK-22372.

… more comments

liyinan926 · 2017-12-11T23:46:46Z

The PR has been merged upstream. Closing this.

kimoonkim self-requested a review November 8, 2017 22:24

mccheah reviewed Nov 8, 2017

View reviewed changes

liyinan926 commented Nov 8, 2017

View reviewed changes

mccheah reviewed Nov 9, 2017

View reviewed changes

felixcheung reviewed Nov 9, 2017

View reviewed changes

liyinan926 force-pushed the spark-kubernetes-4 branch from a461c37 to ac03b7e Compare November 9, 2017 20:29

kimoonkim reviewed Nov 9, 2017

View reviewed changes

liyinan926 force-pushed the spark-kubernetes-4 branch from 6f065a9 to af3f315 Compare November 9, 2017 23:00

liyinan926 force-pushed the spark-kubernetes-4 branch 3 times, most recently from 3ad6d7b to 37c7ad6 Compare November 10, 2017 00:31

foxish mentioned this pull request Nov 10, 2017

Upstreaming and pull request strategy for Spark on Kubernetes #441

Open

kimoonkim reviewed Nov 10, 2017

View reviewed changes

liyinan926 force-pushed the spark-kubernetes-4 branch from 37c7ad6 to 6d50369 Compare November 10, 2017 17:44

liyinan926 force-pushed the spark-kubernetes-4 branch 2 times, most recently from f38144b to 60234a2 Compare November 22, 2017 22:26

mgaido91 and others added 2 commits December 4, 2017 11:07

liyinan926 force-pushed the spark-kubernetes-4 branch from 05f528a to cfcf2a7 Compare December 4, 2017 18:17

Marcelo Vanzin and others added 9 commits December 4, 2017 11:05

Spark on Kubernetes - basic submission client

dcaac45

Addressed first round of review comments

27c67ff

Made Client implement the SparkApplication trait

6d597d0

Addressed the second round of comments

5b9fa39

Added missing step for supporting local:// dependencies and addressed…

5ccadb5

… more comments

Fixed Scala style check errors

12f2797

Addressed another round of comments

c35fe48

Rebased on master and added a constant val for the Client class

faa2849

liyinan926 force-pushed the spark-kubernetes-4 branch from cfcf2a7 to faa2849 Compare December 4, 2017 20:08

Addressed another major round of comments

347ed69

liyinan926 force-pushed the spark-kubernetes-4 branch 3 times, most recently from 51844cc to 0936fbe Compare December 5, 2017 23:19

Addressed one more round of comments

0e8ca01

liyinan926 force-pushed the spark-kubernetes-4 branch from 0936fbe to 0e8ca01 Compare December 6, 2017 05:35

liyinan926 added 8 commits December 6, 2017 10:07

Removed mentioning of kubernetes-namespace

3a0b8e3

Fixed a couple of bugs found during manual tests

83d0b9c

Guard against client mode in SparkContext

44c40b1

Added libc6-compat into the base docker image

67bc847

Addressed latest comments

7d2b303

Addressed docs comments

caf2206

Fixed a comment

2e7810b

Addressed latest comments

cbcd30e

liyinan926 closed this Dec 11, 2017

foxish deleted the spark-kubernetes-4 branch December 13, 2017 16:48


		import org.apache.spark.SparkConf

		private[spark] object ConfigurationUtils {

Spark on Kubernetes - basic submission client #545

Spark on Kubernetes - basic submission client #545

Conversation

liyinan926 commented Nov 7, 2017 • edited

foxish commented Nov 8, 2017

mccheah commented Nov 8, 2017

foxish commented Nov 8, 2017

liyinan926 commented Nov 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Nov 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liyinan926 commented Nov 9, 2017 • edited

foxish commented Nov 9, 2017

liyinan926 commented Nov 9, 2017

kimoonkim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimoonkim commented Nov 9, 2017

liyinan926 commented Nov 10, 2017

liyinan926 commented Nov 10, 2017

liyinan926 commented Nov 10, 2017

liyinan926 commented Nov 7, 2017 •

edited

liyinan926 commented Nov 9, 2017 •

edited