Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23153][K8s] Support client dependencies with a Hadoop Compatible File System #23546

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
9 participants
@skonto
Copy link
Contributor

commented Jan 15, 2019

What changes were proposed in this pull request?

  • solves the current issue with --packages in cluster mode (there is no ticket for it). Also note of some issues of the past here when hadoop libs are used at the spark submit side.
  • supports spark.jars, spark.files, app jar.

It works as follows:
Spark submit uploads the deps to the HCFS. Then the driver serves the deps via the Spark file server.
No hcfs uris are propagated.

The related design document is here. the next option to add is the RSS but has to be improved given the discussion in the past about it (Spark 2.3).

How was this patch tested?

  • Run integration test suite.
  • Run an example using S3:
 ./bin/spark-submit \
...
 --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 \
 --deploy-mode cluster \
 --name spark-pi \
 --class org.apache.spark.examples.SparkPi \
 --conf spark.executor.memory=1G \
 --conf spark.kubernetes.namespace=spark \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
 --conf spark.driver.memory=1G \
 --conf spark.executor.instances=2 \
 --conf spark.sql.streaming.metricsEnabled=true \
 --conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp" \
 --conf spark.kubernetes.container.image.pullPolicy=Always \
 --conf spark.kubernetes.container.image=skonto/spark:k8s-3.0.0 \
 --conf spark.kubernetes.file.upload.path=s3a://fdp-stavros-test \
 --conf spark.hadoop.fs.s3a.access.key=... \
 --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
 --conf spark.hadoop.fs.s3a.fast.upload=true \
 --conf spark.kubernetes.executor.deleteOnTermination=false \
 --conf spark.hadoop.fs.s3a.secret.key=... \
 --conf spark.files=client:///...resolv.conf \
file:///my.jar **

Added integration tests based on Ceph nano. Looks very active.
Unfortunately minio needs hadoop >= 2.8.

@skonto skonto changed the title [SPARK-23153][K8s] Support client dependencies for HCFS [SPARK-23153][K8s] Support client dependencies with a HCFS Jan 15, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

Test build #101225 has finished for PR 23546 at commit 20603a4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

@skonto skonto force-pushed the skonto:support-client-deps branch Jan 15, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

Test build #101228 has finished for PR 23546 at commit 7597e03.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@skonto skonto force-pushed the skonto:support-client-deps branch Jan 15, 2019

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

Integration tests failed with:

Run SparkRemoteFileTest using a remote data file *** FAILED ***
The code passed to eventually never returned normally. Attempted 70 times over 2.00060388465 minutes. Last failure message: false was not true. (KubernetesSuite.scala:276)

This is most likely due to dns on the test node. I saw this with ubuntu 18.04 and contents of /etc/resolv.conf causing issues. Workaround was to use with minikube start: --extra-config=kubelet.resolv-conf=/run/systemd/resolve/resolv.conf

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

@erikerlandson, @liyinan926 , @felixcheung pls review if the approach is approved I can add integration tests etc.

@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

Test build #101229 has finished for PR 23546 at commit bf5f3b1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@skonto

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

jenkins test this please

@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

Integration tests still fail for the same issue.

Show resolved Hide resolved docs/running-on-kubernetes.md Outdated
Show resolved Hide resolved ...es/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala Outdated
Show resolved Hide resolved ...es/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala Outdated
Show resolved Hide resolved ...es/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala Outdated
Show resolved Hide resolved ...es/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala Outdated
Show resolved Hide resolved ...cala/org/apache/spark/deploy/k8s/features/DriverCommandFeatureStep.scala Outdated
Show resolved Hide resolved core/src/main/scala/org/apache/spark/util/Utils.scala Outdated
Show resolved Hide resolved core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Outdated
Show resolved Hide resolved core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Outdated
Show resolved Hide resolved core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Outdated
@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

Test build #101254 has finished for PR 23546 at commit bf5f3b1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jan 15, 2019

Test build #101255 has finished for PR 23546 at commit bf5f3b1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@skonto

This comment has been minimized.

Copy link
Contributor Author

commented Jan 15, 2019

One python test is failing which is irrelevant:

" File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 263, in condition
self.assertAlmostEqual(rel, 0.1, 1)
AssertionError: 0.25749106949322637 != 0.1 within 1 places"

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Outdated
@@ -330,6 +347,7 @@ private[spark] class SparkSubmit extends Logging {
}
}

// Fill in all spark properties of SparkConf

This comment has been minimized.

Copy link
@skonto

skonto Jan 15, 2019

Author Contributor

I underline this here because it should be a few lines above but not sure if it matters. The issue is a few lines above I have to use args.sparkProperties instead. The other problem is that at line 802 we re-fill the properties if missing... not sure why.

This comment has been minimized.

Copy link
@vanzin

vanzin Feb 27, 2019

Contributor

I'd just move this initialization to the top of the function, then code can be consistent in always using sparkConf.

And yes, this code is very confusing and probably does a lot of redundant things. Which I why I asked you to start with the plugin idea in the other PR, in the hopes that we can clean this up little by little.

This comment has been minimized.

Copy link
@skonto

skonto Mar 20, 2019

Author Contributor

I agree we should clean it up.

confKey = "spark.jars.repositories"),
OptionAssigner(args.ivyRepoPath, STANDALONE | MESOS, CLUSTER, confKey = "spark.jars.ivy"),
OptionAssigner(args.packagesExclusions, STANDALONE | MESOS,
OptionAssigner(args.packages, STANDALONE | MESOS | KUBERNETES,

This comment has been minimized.

Copy link
@skonto

skonto Jan 15, 2019

Author Contributor

Note: Packages in containers maybe slow if your net is slow since ivy cache will be empty. Users in practice should build their dependencies in the image or use a pre-populated cache.

This comment has been minimized.

Copy link
@erikerlandson

erikerlandson Feb 14, 2019

Contributor

This point might be a good addition to the docs

This comment has been minimized.

Copy link
@skonto

skonto Feb 21, 2019

Author Contributor

I will add it.

@@ -400,7 +415,6 @@ There are several Spark on Kubernetes features that are currently being worked o
Some of these include:

* Dynamic Resource Allocation and External Shuffle Service
* Local File Dependency Management

This comment has been minimized.

Copy link
@erikerlandson

erikerlandson Jan 15, 2019

Contributor

is the fact that this is an hadoop (compatible) FS based solution imply there are use cases for local deps that aren't served by this PR?

This comment has been minimized.

Copy link
@skonto

skonto Jan 15, 2019

Author Contributor

Yes there might be cases like the RSS server where users want to upload to a file server within the cluster. I am covering the cases mentioned in the design document which provide an API to use out of the box. The RSS implementation AFAIK needs improvements so its open for now, but we can work on it next. I could add a note there instead of removing that part of the doc saying ("partially done"), but since it is a working solution, I thought I could remove that from future work.

This comment has been minimized.

Copy link
@erikerlandson

erikerlandson Jan 15, 2019

Contributor

Out of curiosity, does any s3 object store fit the HCFS category?

This comment has been minimized.

Copy link
@skonto

skonto Jan 15, 2019

Author Contributor

A quick example is minio another is swift.

This comment has been minimized.

Copy link
@erikerlandson

erikerlandson Jan 15, 2019

Contributor

I'd say that's broadly applicable enough to call it "done"

This comment has been minimized.

Copy link
@rvesse

rvesse Jan 17, 2019

Member

We have environments where there is nothing remotely HDFS like available and the systems are typically air-gapped so using external services like S3 isn't an option either. Primary storage is usually a high performance parallel file system (Lustre or IBM Spectrum Scale) which is just a POSIX compliant file system mounted to all nodes over the system interconnect.

Using hostPath volume mounts isn't a realistic option either because these environments have strict security requirements.

This comment has been minimized.

Copy link
@skonto

skonto Jan 22, 2019

Author Contributor

In my opinion we need both options: a) upload to some dfs/obect store service b) a file server.
We cannot possibly support all systems out there if they have their own client libs.

@rvesse
Copy link
Member

left a comment

Looks like a good first step to better supporting client dependencies, thanks for the hard work @skonto

Show resolved Hide resolved core/src/main/scala/org/apache/spark/util/Utils.scala Outdated
Show resolved Hide resolved docs/running-on-kubernetes.md Outdated
docs/running-on-kubernetes.md Outdated
The app jar file will be uploaded to the S3 and then when the driver is launched it will be downloaded
to the driver pod and will be added to its classpath.

The client scheme is supported for the application jar, and dependencies specified by proeprties `spark.jars` and `spark.files`.

This comment has been minimized.

Copy link
@rvesse

rvesse Jan 17, 2019

Member

Typo: proeprties -> properties

@@ -400,7 +415,6 @@ There are several Spark on Kubernetes features that are currently being worked o
Some of these include:

* Dynamic Resource Allocation and External Shuffle Service
* Local File Dependency Management

This comment has been minimized.

Copy link
@rvesse

rvesse Jan 17, 2019

Member

We have environments where there is nothing remotely HDFS like available and the systems are typically air-gapped so using external services like S3 isn't an option either. Primary storage is usually a high performance parallel file system (Lustre or IBM Spectrum Scale) which is just a POSIX compliant file system mounted to all nodes over the system interconnect.

Using hostPath volume mounts isn't a realistic option either because these environments have strict security requirements.

Show resolved Hide resolved core/src/main/scala/org/apache/spark/util/Utils.scala Outdated
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.hadoop.fs.s3a.secret.key=....
client:///full/path/to/app.jar
```

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

How does the submission client the user's intention is to upload to S3 instead of say an HDFS cluster? I don't think this can be determined 100% sure only based on the present of those s3a options.

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

I saw you have spark.kubernetes.file.upload.path below, which should also be added here as an example.

This comment has been minimized.

Copy link
@skonto

skonto Jan 22, 2019

Author Contributor

The code is agnostic of the protocol. I am just using S3 as an example in the docs. If they dont put the properties submit will fail.

This comment has been minimized.

Copy link
@skonto

skonto Jan 22, 2019

Author Contributor

ok

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala Outdated
@@ -289,6 +289,12 @@ private[spark] object Config extends Logging {
.booleanConf
.createWithDefault(true)

val KUBERNETES_FILE_UPLOAD_PATH =
ConfigBuilder("spark.kubernetes.file.upload.path")
.doc("HCFS path to upload files to, using the client scheme:// in cluster mode.")

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

HCFS path where files with the client:// scheme will be uploded to in cluster mode.

...ce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala Outdated
if (fileScheme == "client") {
if (conf.get(KUBERNETES_FILE_UPLOAD_PATH).isDefined) {
val uploadPath = conf.get(KUBERNETES_FILE_UPLOAD_PATH).get
s"${uploadPath}/${fileUri.getPath.split("/").last}"

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

So a file client://path/to/app1.jar will be uploaded to ${uploadPath}/app1.jar? What if two client-local files at different local paths have the same file name?

This comment has been minimized.

Copy link
@skonto

skonto Jan 22, 2019

Author Contributor

I am currently not supporting that. I thought of that. People should make sure they dont create a conflict. Otherwise I will have to create random paths.

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 22, 2019

Contributor

Please document this clearly, i.e., all client-side dependencies will be uploaded to the given path with a flat directory structure.

This comment has been minimized.

Copy link
@skonto

skonto Feb 4, 2019

Author Contributor

will do thanx.

...ce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala Outdated
val uploadPath = sConf.get(KUBERNETES_FILE_UPLOAD_PATH).get
val fs = getHadoopFileSystem(Utils.resolveURI(uploadPath), hadoopConf)
val storePath = new Path(s"${uploadPath}/${fileUri.getPath.split("/").last}")
log.info(s"Uploading file: ${fileUri.getPath}...")

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

We should also mention the destination path in the log message.

...ce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala Outdated
def resolveFileUri(uri: String): String = {
/**
* Get the final path for a client file, if not return the uri as is.
*

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

This empty comment line can be removed.

docs/running-on-kubernetes.md Outdated
@@ -1010,6 +1024,15 @@ See the below table for the full list of pod specifications that will be overwri
Spark will add additional labels specified by the spark configuration.
</td>
</tr>
<tr>

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

This should be added to the table under the section Spark Properties.

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala Outdated
@@ -373,6 +407,49 @@ private[spark] class SparkSubmit extends Logging {
localPyFiles = Option(args.pyFiles).map {
downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr)
}.orNull

if (isKubernetesClient &&
sparkConf.getBoolean("spark.kubernetes.submitInDriver", false)) {

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 18, 2019

Contributor

What's the purpose of checking spark.kubernetes.submitInDriver, which AFAIK is used to indicate the cluster mode.

This comment has been minimized.

Copy link
@skonto

skonto Jan 22, 2019

Author Contributor

I need to make sure i run this only in cluster mode at the second submit time where we are using client mode.

This comment has been minimized.

Copy link
@liyinan926

liyinan926 Jan 22, 2019

Contributor

Got it.

This comment has been minimized.

Copy link
@vanzin

vanzin Feb 27, 2019

Contributor

See my previous comment about these names not being optimal in the context of how k8s works.

@skonto skonto changed the title [SPARK-23153][K8s] Support client dependencies with a HCFS [SPARK-23153][K8s] Support client dependencies with a Hadoop Compatible File System Feb 4, 2019

@skonto skonto force-pushed the skonto:support-client-deps branch Feb 4, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Feb 4, 2019

Test build #102043 has finished for PR 23546 at commit c885132.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Feb 4, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Feb 4, 2019

@skonto skonto force-pushed the skonto:support-client-deps branch Feb 4, 2019

@srowen

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Yeah I think you've recounted the difference of opinion here: should the user specify a unique subdirectory or should Spark just make one? if you agree that subdirs are good, why not automate it, if there's no downside? On this minor technical point I'd agree with @vanzin based on my limited understanding. Because I don't want to hold up the whole change, I'd suggest either implementing that approach or saying why it is harmful. Because besides that looks ready to go.

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

@srowen there was also a debate about the deletion of the subdir. In my view, the user provides it and he may want to re-use the contents of it, because in a consecutive submission he may dont want to re-upload the jar and can just point to it at the S3 bucket for example.
Only the user knows what he wants to do with it as only the user knows when to delete a checkpointLocation in streaming eg. start from scratch for whatever reason (exception are temp locs via spark.sql.streaming.forceDeleteTempCheckpointLocation).
Now for creating the names in an automated fashion its possible but not sure if this is what @vanzin was saying anyway. If that is the problem then why Spark does not handle checkpoint dirs in an automated way? I dont see a difference. But again ok I will automate it.
So is spark.kubernetes.file.upload.path +auto-generated subdir ok as the final full path?
If that is the case I will proceed with that.

@srowen

This comment has been minimized.

Copy link
Member

commented May 14, 2019

I see, the point is that multiple jobs might want to share some files. That's the flipside to the problem that isolation solves -- you don't worry about old files lying around. It's a use case though. Do we do that elsewhere in Spark? I really dont' know. Like if you add .py files to a job and they get shipped off to YARN, are they reusable or referenceable? Consistency might be the best argument one way or the other. Well, I end up neutral on it. I'd probably suggest budging towards making the subdir to reach consensus here, absent any other information or input.

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

@srowen ok will do that and someone can always change the code with a new PR.

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

jenkins test this please

@SparkQA

This comment has been minimized.

Copy link

commented May 15, 2019

@SparkQA

This comment has been minimized.

Copy link

commented May 15, 2019

@SparkQA

This comment has been minimized.

Copy link

commented May 15, 2019

Test build #105426 has finished for PR 23546 at commit 29ecdeb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@skonto skonto force-pushed the skonto:support-client-deps branch from 29ecdeb to ec2d66c May 16, 2019

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

jenkins test this please

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

@erikerlandson @srowen I added support for random dirs, pls let me know if this is ok now to be merged. Updated the doc to indicate that user will not see conflicts when spark apps are run in parallel. Hope that helps.

@SparkQA

This comment has been minimized.

Copy link

commented May 16, 2019

@SparkQA

This comment has been minimized.

Copy link

commented May 16, 2019

@SparkQA

This comment has been minimized.

Copy link

commented May 16, 2019

Test build #105450 has finished for PR 23546 at commit ec2d66c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

@skonto that looks acceptable to me. Supporting additional directory semantics is not precluded by future PR.
@srowen github is showing unresolved reviews, are any of those still outstanding or can we mark them resolved?

@srowen
Copy link
Member

left a comment

Just a few nits here; I'd defer to @vanzin for comment on directory handling

if (isLocalAndResolvable(resource)) {
SparkLauncher.NO_RESOURCE
} else {
resource

This comment has been minimized.

Copy link
@srowen

srowen May 17, 2019

Member

Nit: indent

case e: Exception =>
throw new SparkException(s"Uploading file ${fileUri.getPath} failed...", e)
}
}

This comment has been minimized.

Copy link
@srowen

srowen May 17, 2019

Member

Nit: pull the else up to this line

@srowen srowen self-requested a review May 17, 2019

@srowen
Copy link
Member

left a comment

I resolved my previous comments that were out of date

@skonto skonto force-pushed the skonto:support-client-deps branch from ec2d66c to 3c58f7b May 17, 2019

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

@erikerlandson @srowen I fixed the two pending comments pls resolve the review.

@SparkQA

This comment has been minimized.

Copy link

commented May 17, 2019

@SparkQA

This comment has been minimized.

Copy link

commented May 17, 2019

@SparkQA

This comment has been minimized.

Copy link

commented May 17, 2019

Test build #105503 has finished for PR 23546 at commit 3c58f7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 20, 2019

@srowen @erikerlandson gentle ping

@srowen

srowen approved these changes May 20, 2019

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 20, 2019

@erikerlandson @vanzin are you ok with the current status of things? Should it be merged?

@erikerlandson

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

If there are no further requests I will merge.

@skonto

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

@erikerlandson from what I see there is no more activity, could you merge please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.