Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22839][K8S] Remove the use of init-container for downloading remote dependencies #20669

Closed
wants to merge 8 commits into from

Conversation

ifilonenko
Copy link
Contributor

What changes were proposed in this pull request?

Removal of the init-container for downloading remote dependencies. Built off of the work done by @vanzin in an attempt to refactor driver/executor configuration elaborated in this ticket.

How was this patch tested?

This patch was tested with unit and integration tests.

@SparkQA
Copy link

SparkQA commented Feb 25, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1026/

@ssuchter
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1026/

@SparkQA
Copy link

SparkQA commented Feb 25, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1027/

@ssuchter
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1027/

@SparkQA
Copy link

SparkQA commented Feb 25, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1028/

@SparkQA
Copy link

SparkQA commented Feb 25, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1028/

@ifilonenko ifilonenko changed the title [SPARK-22839][K8S] Remove the use of init-container for downloading remote dependencies [SPARK-22839][K8S][WIP] Remove the use of init-container for downloading remote dependencies Feb 26, 2018
@vanzin
Copy link
Contributor

vanzin commented Feb 26, 2018

I'm not sure there was a consensus at the end of that discussion, and @mccheah has been working on some internal refactoring here also. @foxish

@mccheah
Copy link
Contributor

mccheah commented Feb 26, 2018

My impression was that we did reach a consensus that this was the right idea from the mailing list discussion. Specifically http://apache-spark-developers-list.1001551.n3.nabble.com/Kubernetes-why-use-init-containers-tp23113p23140.html.

We want to get this in before we start refactoring because it reduces the size of the code that actually needs to be refactored by a sizable amount.

@vanzin
Copy link
Contributor

vanzin commented Feb 26, 2018

If that's the consensus, then great! I thought some people still wanted to keep the init container around for some reason.

@foxish
Copy link
Contributor

foxish commented Feb 26, 2018

+1, the benefits outweigh the desire to be k8s-like in the approach. We should have tests that validate that no behavior changes though. I'm wondering if we should move the integration tests into the repo first, before indulging in any large scale refactors.

@SparkQA
Copy link

SparkQA commented Feb 26, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1061/

@SparkQA
Copy link

SparkQA commented Feb 26, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1061/

@vanzin
Copy link
Contributor

vanzin commented Feb 26, 2018

I think tests are failing because you're missing changes to run the driver through spark-submit in the entry point script:
vanzin@7e1dd8f#diff-c12f609c87a8093c2902284459fbe84b

…d by resolving the location of the downloaded file
@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1066/

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1066/

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1105/

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1105/

@vanzin
Copy link
Contributor

vanzin commented Feb 27, 2018

@ifilonenko you can't just remove the init container without replacing its functionality. In my patch that was done by using spark-submit to run the driver. But here you have nothing. That's why the tests are failing.

@ifilonenko
Copy link
Contributor Author

ifilonenko commented Feb 27, 2018

@vanzin you are definitely right and that is what needs to happen. However, one of the functions of the init container was resolving the file location which atm isn’t supported unless the spark-application that is testing this service uses SparkFiles.get(). This is something I am looking into right now either on the integration testing side or maybe there could be some way to bring back this resolution (i.e. the current file is currently being stored in /tmp)

@vanzin
Copy link
Contributor

vanzin commented Feb 27, 2018

SparkFiles.get() is the official way of retrieving anything that is distributed using --files. So if the application is not using that, it's relying on undocumented behavior.

In the error logs from jenkins, the path of the file ("/var/spark-data/spark-files/pagerank_data.txt") seems to be provided by the test code as part of running spark-submit. That seems wrong, since the test code doesn't really have control of where files will show up. If the file is already in the image somehow, then it's probably a case of the test and the image not agreeing about what the path is. If the file is uploaded using --files, then providing an absolute path to the app is wrong (or, at worst, the app should be using just the file name and using SparkFiles.get() to find its actual location).

@vanzin
Copy link
Contributor

vanzin commented Feb 27, 2018

BTW I wonder why are we still relying on integration tests that do not live in the Spark repo...

@ssuchter
Copy link
Contributor

It's just because we haven't PR'd them yet. There's some cleanups that I wanted to do before doing that.

@ssuchter
Copy link
Contributor

@vanzin I just opened this PR for putting the integration tests in the Spark repo: #20697

@@ -89,26 +56,16 @@ private[spark] object KubernetesUtils {
def getOnlyRemoteFiles(uris: Iterable[String]): Iterable[String] = {
uris.filter { uri =>
val scheme = Utils.resolveURI(uri).getScheme
scheme != "file" && scheme != "local"
scheme != "local"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually is this function still needed?

@@ -202,6 +221,10 @@ private[spark] class KubernetesClientApplication extends SparkApplication {
val launchTime = System.currentTimeMillis()
val waitForAppCompletion = sparkConf.get(WAIT_FOR_APP_COMPLETION)
val appName = sparkConf.getOption("spark.app.name").getOrElse("spark")
val kubernetesResourceNamePrefix = {
val uuid = UUID.nameUUIDFromBytes(Longs.toByteArray(launchTime)).toString.replaceAll("-", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a uuid with - removed is what we already do.

.addAllToEnv(driverJavaOptsEnvs.asJava)
.addNewEnv()
.withName(SPARK_CONF_DIR_ENV)
.withValue(SPARK_CONF_PATH)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one concern on using a different path for SPARK_CONF_DIR than /opt/spark/conf which is the one under the installation path. Because the environment variable SPARK_CONF_DIR is set, people who want to customize the image, e.g., using a custom spark-env.sh has to know to put the file into this custom path instead of /opt/spark/conf. Why don't we use /opt/spark/conf instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am flexible to that, anyone else have thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of what the behavior is if we try to mount to a directory that already exists on the image itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would suggest removing COPY conf /opt/spark/conf from the Dockerfile as it is no effect anyway if SPARK_CONF_DIR points to somewhere else. Then set SPARK_CONF_DIR to point to /opt/spark/conf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the executors require a SPARK_CONF_DIR directory to be defined as well?

Copy link
Contributor

@liyinan926 liyinan926 Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the discussion on Slack, please remove COPY conf /opt/spark/conf in the Docker file, put the properties file under /opt/spark/conf, and then unset SPARK_CONF_DIR as it defaults to /opt/spark/conf anyway.

if (masterURL.startsWith("k8s") && sc.deployMode == "client") {
if (masterURL.startsWith("k8s") &&
sc.deployMode == "client" &&
!sc.conf.contains(KUBERNETES_EXECUTOR_POD_NAME_PREFIX)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient to prevent use the client mode from end users? What about adding a special key when calling spark-submit in the driver and test that key here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that logic might be beyond the scope of this PR. But I could add that if it seems appropriate.

Copy link
Contributor

@liyinan926 liyinan926 Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safer to have a new internal config key that is only used for this purpose. I'm not sure if checking the presence of KUBERNETES_EXECUTOR_POD_NAME_PREFIX is sufficient or not depending on if the submission client code is called in client mode.

@SparkQA
Copy link

SparkQA commented Mar 14, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1495/

@SparkQA
Copy link

SparkQA commented Mar 14, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1495/

import org.apache.spark.sql.SparkSession

/** Usage: SparkRemoteFileTest [file] */
object SparkRemoteFileTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what is the purpose of this specific example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test the presence of a remote file being mounted on the executors via the spark-submit being run by the driver. Should I add a Javadoc (HDFSTest.scala didn't include one) but I can if necessary

.withName(configMapName)
.withNamespace(namespace)
.endMetadata()
.addToData(SPARK_CONF_FILE_NAME, propertiesWriter.toString + driverJavaOps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to write two config map data items:

  • one data item with the properties file name,
  • One data item with the JVM options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words I'd expect two calls to addToData here with the appropriate item names corresponding to the file names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also capture this with an integration test. Some test that has a call to System.getProperty(property) with a property that isn't a Spark-specific option should suffice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One data item with the JVM options.

Why? spark-submit handles that already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't put spark.driver.extraJavaOptions on the launch of the spark-submit command in client mode, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to be on the command line and not in the properties file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is for spark-submit to pick it up you want a jvm-opts file in the Spark configuration directory. This is specifically for JVM options that aren't spark.* options, e.g. GC, out of memory settings, yourkit agent loading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really should try this out to see that it works the way I've been saying since the beginning. Just let spark-submit do its thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I was misreading the code here - just putting spark.driver.extraJavaOptions in the properties works. See

Sorry about that - @ifilonenko think just adding spark.driver.extraJavaOptions as-is to the properties file will suffice.

@ifilonenko
Copy link
Contributor Author

Results from integration testing:

Discovery starting.
Discovery completed in 123 milliseconds.
Run starting. Expected test count is: 8
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Test extraJVMProprties being present on Driver
- Run FileCheck using a Remote Data File
Run completed in 3 minutes, 22 seconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1558/

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1558/

Copy link
Contributor

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An open question for us to consider, otherwise looks fine.

@@ -79,6 +79,12 @@ private[spark] object Config extends Logging {
.stringConf
.createOptional

val KUBERNETES_DRIVER_SUBMIT_CHECK =
ConfigBuilder("spark.kubernetes.submitInDriver")
.internal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The indentation looks slightly off here.

"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--deploy-mode client
"$@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an argument that can be made for enforcing what arguments can be passed here. For example, we can instead have the following command:

      "$SPARK_HOME/bin/spark-submit"
      --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
      --deploy-mode client
      --properties-file $SPARK_DRIVER_PROPERTIES_FILE
      --class $SPARK_DRIVER_MAIN_CLASS
      spark-internal

And then instead of passing args to the Kubernetes container command, we pass everything as environment variables. I think this is clearer and makes the contract more obvious, in terms of what this driver container is expecting as input. Thoughts @vanzin @ifilonenko?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem of passing arguments as arguments? I really dislike using environment variables in general, but this would be an especially nasty use for them. And I'm not sure how they make the contract more obvious.

Copy link
Contributor

@mccheah mccheah Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one is going to override the docker image and provide their own implementation, the customizing developer would like to know all of the parameters that they are being passed up-front. Receiving "$@" makes it less obvious that we should be expecting a properties file and the SparkLauncher.NO_RESOURCE from spark-submit.

We could do argument parsing here to verify that --properties-file etc. are explicitly provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on explicit envs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one is going to override the docker image and provide their own implementation

There is no documentation about that, so how about defining what that contract is supposed to be first? Otherwise we're all just talking about hypotheticals.

My current understanding is that regardless of what's in the image, entrypoint.sh defines the entry point's contract, so this is an internal decision between spark-submit code and that shell script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've seen some users override the entry point to do a bunch of extra bootstrapping before calling into the actual command to run the driver in custom images. I vaguely recall something related to Openshift doing this - @erikerlandson.

The point is, even if entrypoint.sh defines the contract, what arguments would a custom implementation pass it? Something like this would fail because we didn't pass the appropriate arguments to the entrypoint:

#!/bin/bash
# my-entrypoint.sh
source /opt/my-custom.script.sh
/opt/spark/entrypoint.sh # Notice we don't pass arguments

So the contract may be, that whatever arguments the user's entrypoint is passed, they must eventually call entrypoint.sh with all of those arguments forwarded through. That might be an acceptable contract. With environment variables we can say that it doesn't matter what arguments the downstream entrypoint is passed, just that you eventually need to run that code and it knows what to do if the application submission code set up the container's environment properly.

Having defined these ideas more precisely, I think either contract is acceptable here, with a slight preference towards having Spark interpret the environment rather than the arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is, even if entrypoint.sh defines the contract, what arguments would a custom implementation pass it?

entrypoint.sh doesn't really define the contract, it implements the contract. Basically the entry point has to respect whatever the Spark submit code tells it to do.

So define the contract first. If you want users to be able to write their own custom entry points, then document that contract in a public doc, so that it cannot change in future versions of Spark. That was the main reason (well, one of them) why I asked this whole thing to be marked "experimental" in 2.3 - nothing is defined, but there seems to be a lot of tribal knowledge about what people want to do.

That doesn't really work in the long run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that makes sense re: defining specifically what a custom implementation has to do. We can follow up on that separately. Also agree that we haven't been too precise about what a custom implementation would look like. There's simple things like adding or modifying the existing content, but in terms of modifying logic we haven't given that enough thought.

For now I think the code we have here will suffice for the immediate need of removing the init-container, and we can follow up later on which path we want to take in this part of the discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @mccheah mentioned, I included some logic on the current entrypoint.sh to allow Spark to work in cases such as an anonymous uid (another way to look at it is managing Spark's long-standing quirk of failing when it can't find a passwd entry). Putting it in entrypoint.sh was a good way to make sure this happened regardless of how the actual CMD evolved. A kind of defensive future-proofing, which is important for reference Dockerfiles. It also provides execution via tini as "pid 1", which is considered good standard practice.

All of this is done in part with the expectation that most users are liable to just want to customize their images by building the reference dockerfiles and using those as base images for their own, without modifying the CMD or entrypoint.sh

That said, I think that in terms of formally documenting a container API, entrypoint.sh may be a red herring. In theory, a user should be able to build their own custom container from the ground up, up to and including a different entrypoint, or default entrypoint, etc.

Part of the reason we went with an externalized CMD (instead of creating one in the backend code) was to allow maximum flexibility in how images were constructed. The back-end provides certain information to the pod. The "API" is a catalogue of this information, combined with any behaviors that the user's container must implement. API doc shouldn't assume the existence of entrypoint.sh

Copy link
Contributor

@liyinan926 liyinan926 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mccheah
Copy link
Contributor

mccheah commented Mar 19, 2018

@vanzin anything other feedback before merging this?

@vanzin
Copy link
Contributor

vanzin commented Mar 19, 2018

I probably won't have time to review this carefully, so if you're happy with it, don't wait for me.

@mccheah
Copy link
Contributor

mccheah commented Mar 19, 2018

Think someone with permissions to merge has to do so here.

@vanzin
Copy link
Contributor

vanzin commented Mar 19, 2018

@mccheah you're a committer...

@mccheah
Copy link
Contributor

mccheah commented Mar 19, 2018

Merge button doesn't appear for me in the UI =( will need to look into that.

@erikerlandson
Copy link
Contributor

@mccheah workflow is to use dev/merge_spark_pr.py

@vanzin
Copy link
Contributor

vanzin commented Mar 19, 2018

@mccheah you should have gotten an e-mail from Matei explaining the basics of how to merge PRs.

@foxish
Copy link
Contributor

foxish commented Mar 19, 2018

There's a section explaining it at the bottom of https://spark.apache.org/committers.html

@mccheah
Copy link
Contributor

mccheah commented Mar 19, 2018

Thanks - merging shortly.

@asfgit asfgit closed this in f15906d Mar 19, 2018
@ifilonenko ifilonenko deleted the remove-init-container branch March 19, 2018 21:08
mccheah pushed a commit to palantir/spark that referenced this pull request Mar 29, 2018
…emote dependencies

Removal of the init-container for downloading remote dependencies. Built off of the work done by vanzin in an attempt to refactor driver/executor configuration elaborated in [this](https://issues.apache.org/jira/browse/SPARK-22839) ticket.

This patch was tested with unit and integration tests.

Author: Ilan Filonenko <if56@cornell.edu>

Closes apache#20669 from ifilonenko/remove-init-container.
gitfy pushed a commit to gitfy/spark that referenced this pull request May 21, 2018
…emote dependencies

## What changes were proposed in this pull request?

Removal of the init-container for downloading remote dependencies. Built off of the work done by vanzin in an attempt to refactor driver/executor configuration elaborated in [this](https://issues.apache.org/jira/browse/SPARK-22839) ticket.

## How was this patch tested?

This patch was tested with unit and integration tests.

Author: Ilan Filonenko <if56@cornell.edu>

Closes apache#20669 from ifilonenko/remove-init-container.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants