Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Apache Spark Docker Official Image #13089

Merged
merged 2 commits into from Jul 19, 2023
Merged

Conversation

Yikun
Copy link
Contributor

@Yikun Yikun commented Sep 2, 2022

This patch is adding Apache Spark Docker Official Image.

Checklist for Review

NOTE: This checklist is intended for the use of the Official Images maintainers both to track the status of your PR and to help inform you and others of where we're at. As such, please leave the "checking" of items to the repository maintainers. If there is a point below for which you would like to provide additional information or note completion, please do so by commenting on the PR. Thanks! (and thanks for staying patient with us ❤️)

@Yikun
Copy link
Contributor Author

Yikun commented Sep 2, 2022

Finally, will move from Yikun/spark-docker to apache/spark-docker as suggested in #13057 .

Currently, the dockerfile is maintained by apache spark. This dockerfile(https://github.com/Yikun/spark-docker/tree/master/3.3.0/scala2.12-java11-focal) is changed from link.

We have made some improvements as requested by offciail image, but don't know how many gaps there are. Then next step, we will have further discussions in the spark community to make it happen.

So, could you help with a preliminary review on dockerfile and entrypoint? @tianon

@Yikun
Copy link
Contributor Author

Yikun commented Sep 9, 2022

Friendly ping @tianon @yosifkit

@Yikun Yikun changed the title Add spark Docker Official Image Add Apache Spark Docker Official Image Sep 21, 2022
@Yikun
Copy link
Contributor Author

Yikun commented Sep 21, 2022

@Yikun
Copy link
Contributor Author

Yikun commented Oct 11, 2022

A quick update (TL;DR: docker official image vote passed in Apache Spark community):

@tianon @yosifkit So, it's ready for review now.

I know it would be slow for new images approved, would you mind giving a rough start time of reviewing? We (The Apache Spark Community) will spare no effort to help DOI support, many thanks!

also cc @HyukjinKwon @zhengruifeng

@Yikun
Copy link
Contributor Author

Yikun commented Oct 17, 2022

Quick update (TL;DR: Adding more test and script to ensure official image dockerfiles quality):

  1. Add Spark on K8s integration test to ensure the docker image quality: [SPARK-40783][INFRA] Enable Spark on K8s integration test apache/spark-docker#9
  2. Use template to generate dockerfiles to avoid low level typo: [SPARK-40528] Support dockerfile template apache/spark-docker#12
  3. Apply some suggestion according to https://github.com/docker-library/official-images#review-guidelines

I believe we are ready. @tianon @yosifkit

BTW, review note: https://github.com/apache/spark-docker/tree/master/3.3.0/scala2.12-java11-python3-r-ubuntu
This is main dockerfile, left 3 dockerfiles are just part of this.

@Yikun Yikun force-pushed the patch-1 branch 2 times, most recently from 0f3653c to eb180a2 Compare October 20, 2022 07:40
@emiliofernandes
Copy link

+1 for an official Docker image for Apache Spark!

@github-actions

This comment has been minimized.

@julien-faye
Copy link

Any ETA when this could be reviewed and merged ?
An official image for Apache Spark is much needed!

@yosifkit
Copy link
Member

Hello! ✨

Thanks for your interest in contributing to the official images program. 💭

As you may have noticed, we've usually got a pretty decently sized queue of new images (not to mention image updates and maintenance of images under @docker-library which are maintained by the core official images team). As such, it may be some time before we get to reviewing this image (image updates get priority both because users expect them and because reviewing new images is a more involved process than reviewing updates), so we apologize in advance! Please be patient with us -- rest assured, we've seen your PR and it's in the queue. ❤️

We do try to proactively add and update the "new image checklist" on each PR, so if you haven't looked at it yet, that's a good use of time while you wait. ☔

Thanks! 💖 💙 💚 ❤️

@Yikun
Copy link
Contributor Author

Yikun commented Jan 10, 2023

@yosifkit Thank you very much for your reply! We (Apache Spark Community) will also actively address comments once getting review! Looking forward to your review!

@gatorsmile
Copy link

gatorsmile commented Apr 15, 2023

@yosifkit Today, we just published the latest release spark 3.4 https://spark.apache.org/releases/spark-release-3-4-0.html I am wondering when we can have our own official docker image?

@yosifkit
Copy link
Member

yosifkit commented May 3, 2023

Ok, I finally have some feedback ready🎉😻. Sorry for the delay 🙇 Let me know if you have any questions. And thank you for your patience 🙇🥰💖

  1. Would it be useful to save space by sharing layers by having one image from another? 🤔 Something like the *java11-ubuntu as the "base" with r and python variants FROM that and the r-python being FROM, probably, the larger one of those?

    • Rough example Dockerfiles
      FROM eclipse-temurin:11-jre-focal
      # user stuff, install common deps, etc
      ...
      # download/extract spark (maybe keeping python and R files too? they seem relatively small compared to the rest)
      # other images in separate Dockerfiles
      FROM spark:3.3.0-scala2.12-java11-ubuntu
      # get "/opt/spark/{python,R}/" contents if not kept in base
      # install python or R (and things like R_HOME)

rm /bin/sh && ln -sv /bin/bash /bin/sh
  1. Does spark expect /bin/sh to be bash (in posix mode) and not just a generic posix compliant shell? Is there a bug with using dash? Minimally, this should do a dpkg --divert to ensure that users in dependent images don't accidentally revert it with package updates.

echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&
  1. I am unsure what this is for? 😕 As far as I can tell, this means that only members of the administrative group wheel (or 0 if there is no wheel) can switch to another user using the su command. That might make sense on a regular multi-user system, but I am unsure why it would matter for a container.

chgrp root /etc/passwd && chmod ug+rw /etc/passwd
  1. Wider permissions on /etc/passwd is concerning. What use case is broken if the running user id doesn't exist?
echo ... >> /etc/passwd
  1. Having the entrypoint itself modify /etc/passwd is fragile. Are there features that are broken if the user doesn't exist in /etc/passwd (like PostgreSQL's initdb that refuses to run)? Minimally, this should probably use useradd and usermod rather than hand editing.

env | grep SPARK_JAVA_OPT_ |...
  1. This is susceptible to a few bugs particularly around newlines in values. I see two ways around it.
    • ensuring the matching name is at the start, -E '^SPARK_JAVA_OPT_', and running all the commands with null-terminated input and output: env -0 and -z on the other commands
    • or bash variable prefix expansion
      for v in "${!SPARK_JAVA_OPT_@}"; do
          SPARK_EXECUTOR_JAVA_OPTS+=( "${!v}" )
      done
      

  1. switch_spark_if_root and tini should only happen when running a spark driver or executor and not for something like docker run -it spark bash.

  2. To minimize duplication across layers, chmod's should be done in the layer that creates the file/folder (or in the case of a file from the context via COPY, it should have the +x committed to git)


non-blocking:

  1. using set -ex means you can use ; instead of && (really only matters for complex expressions, like the || in the later RUN that does use ;)

@Yikun
Copy link
Contributor Author

Yikun commented May 4, 2023

@yosifkit Thanks for your reply! We will address all comments as soon as possible!

  1. Would it be useful to save space by sharing layers by having one image from another? 🤔 Something like the *java11-ubuntu as the "base" with r and python variants FROM that and the r-python being FROM, probably, the larger one of those?

Good suggestion

  • should we put the base image and r, python image into separate PR? (first base and left python and r, because after change that, r, python depends on base).
  • We are also consider support java17. Just wanna sure java11 / java17 should be two base image, right?

Fixing in: apache/spark-docker#36

  1. Does spark expect /bin/sh to be bash (in posix mode) and not just a generic posix compliant shell? Is there a bug with using dash? Minimally, this should do a dpkg --divert to ensure that users in dependent images don't accidentally revert it with package updates.

It was introduced with question 6, https://github.com/apache-spark-on-k8s/spark/pull/444/files#r134075892 , we will try to fix it after question 6 addressed.

6 This is susceptible to a few bugs particularly around newlines in values. I see two ways around it.

export SPARK_JAVA_OPT_0="foo=bar"
export SPARK_JAVA_OPT_1="foo1=bar1"
env -0 | grep -z -E "^SPARK_JAVA_OPT_" | sort -z -t_ -k4 -n | sed -z 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt
# readarray seems not support `\0`, even if use the `-d '\0'`, it's also not working
readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt

for v in ${SPARK_EXECUTOR_JAVA_OPTS[@]}; do
    echo $v
done

# foo=bar

For bash variable prefix expansion, it works but can it also support sh? (I try and it works, but I'm not sure it should the best way or not)?

  1. As far as I can tell, this means that only members of the administrative group wheel (or 0 if there is no wheel) can switch to another user using the su command. That might make sense on a regular multi-user system, but I am unsure why it would matter for a container.

I found it was introduced in SPARK-25275 for security improvement. It should also for OpenShift, so if it's not a blocker, we prefer to keep.

Question 4 Wider permissions on /etc/passwd is concerning. What use case is broken if the running user id doesn't exist?
Question 5 Having the entrypoint itself modify /etc/passwd is fragile

It was introduced for OpenShift, apache-spark-on-k8s/spark#404 . ~But I'm not sure it's still a problem after 6 years? ~cc @erikerlandson If it's not a blcoker, we prefer to keep.

BTW, a case might related, would mind share some idea about how we would switch username in entrypoint.sh like apache/spark#40831 to meet the spark on k8s user switch requirement? such as something like usemod -l is it allows in DOI? (how to change the username when we already useradd in dockerfile)

switch_spark_if_root and tini should only happen when running a spark driver or executor and not for something like docker run -it spark bash.

Actually, I tried others apache offcial image like solr / flink:

# docker run -ti solr bash
solr@0d356d1b66f7:/opt/solr-9.2.1$
# docker run -ti flink bash
flink@89bf6453d82e:~$

It use the specific username, did I missed something?

question 8, 9

We will address soon!

@erikerlandson
Copy link

erikerlandson commented May 4, 2023

Regarding the modification of /etc/passwd - OpenShift runs its pods with anonymous random uid - and so you cannot assume any particular uid in the image. And there will be no such entry in the passwd file, which Spark used to require to run.

Does Spark still require an entry to exist in /etc/passwd ? If that is no longer required, you may be able to remove this logic. Otherwise, it would still be needed for OpenShift.

see: https://cloud.redhat.com/blog/a-guide-to-openshift-and-uids

@Yikun
Copy link
Contributor Author

Yikun commented May 4, 2023

@erikerlandson Really much thanks for your reply, we added useradd so seems not a problem now?

https://github.com/apache/spark-docker/blob/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-ubuntu/Dockerfile#L21

And any more background about question 3 (/etc/pam.d/su) SPARK-25275, is it also for OpenShift?

@erikerlandson
Copy link

@Yikun I don't think that will work. It adds an entry for uid=185 but the uid will be different when the image is run in the pod, and you will never be able to predict what the uid is before hand.

@erikerlandson
Copy link

Regarding SPARK-25275, I no longer recall what the underlying issue was. Unless you can run CI testing on OpenShift, I'd recommend you leave these things in.

Yikun added a commit to apache/spark-docker that referenced this pull request May 6, 2023
### What changes were proposed in this pull request?
This PR changes Dockerfile and workflow based on base image to save space by sharing layers by having one image from another.

After this PR:
- The spark / PySpark / SparkR related files extract into base image
- Install PySpark / SparkR deps in PySpark / SparkR images.
- Add the base image build step
- Apply changes to template: `./add-dockerfiles.sh 3.4.0` to make it work.
- This PR didn't contain changes on 3.3.X Dockerfiles to make PR more clear, the 3.3.x changes will be a separate PR when we address all comments for 3.4.0.

[1] docker-library/official-images#13089

### Why are the changes needed?
Address DOI comments, and also to save space by sharing layers by having one image from another.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed.

Closes #36 from Yikun/official.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Yikun added a commit to apache/spark-docker that referenced this pull request Jun 27, 2023
### What changes were proposed in this pull request?
This patch change `apt` to `apt-get` and also remove useless `rm -rf /var/cache/apt/*; \`.
And also apply the change to 3.4.0 and 3.4.1

### Why are the changes needed?
Address comments from DOI:
- `apt install ...`, This should be apt-get (apt is not intended for unattended use, as the warning during build makes clear).
- `rm -rf /var/cache/apt/*; \` This is harmless, but should be unnecessary (the base image configuration already makes sure this directory stays empty).

See more in:
[1] docker-library/official-images#13089 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #47 from Yikun/apt-get.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Yikun added a commit to apache/spark-docker that referenced this pull request Jun 27, 2023
### What changes were proposed in this pull request?
Add 'set -eo pipefail' to entrypoint and quote variables

### Why are the changes needed?
Address DOI comments:
1. Have you considered a set -eo pipefail on the entrypoint script to help prevent any errors from being silently ignored?
2. You probably want to quote this (and many of the other variables in this execution); ala --driver-url "$SPARK_DRIVER_URL"

[1] docker-library/official-images#13089 (comment)
[2] docker-library/official-images#13089 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #49 from Yikun/quote.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Yikun added a commit to apache/spark-docker that referenced this pull request Jun 27, 2023
### What changes were proposed in this pull request?
Remove useless lib64 path

### Why are the changes needed?
Address comments: docker-library/official-images#13089 (comment)

It was introduced by apache/spark@f13ea15 to address the issue about snappy on alpine OS, but we already switch the OS to ubuntu, so `/lib64` hack can be cleanup.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #48 from Yikun/rm-lib64-hack.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
@github-actions

This comment has been minimized.

@Yikun
Copy link
Contributor Author

Yikun commented Jun 27, 2023

Will those be added to the Docker Hub readme or maybe your Kubernetes page or configuration page?

Added env descirption and configurtation link to Docker Hub readme doc PR.

Have you considered a set -eo pipefail on the entrypoint script to help prevent any errors from being silently ignored?

Addressed: apache/spark-docker#49

+    ln -s /lib /lib64; \

Addressed: apache/spark-docker#48

+    apt install ...
+    rm -rf /var/cache/apt/*; \

Addressed: apache/spark-docker#47

+      --driver-url $SPARK_DRIVER_URL

Addressed: apache/spark-docker#49

All comments had been addressed! @yosifkit @tianon

@tianon
Copy link
Member

tianon commented Jun 27, 2023

Sorry, did one more pass and noticed one last thing 🙇 🙇 ❤️

+    GPG_KEY=34F0FC5C

Oops, this should be the full key fingerprint: F28C9C925C188C35E345614DEDA00CE834F0FC5C (generating a collision for such a short key ID is trivial 😅); see https://github.com/docker-library/official-images#security

@Yikun
Copy link
Contributor Author

Yikun commented Jun 28, 2023

@tianon Oh, thanks for catch, I didn't realize before.

But seems F28C9C925C188C35E345614DEDA00CE834F0FC5C not working: apache/spark-docker#50

I got the key from https://dist.apache.org/repos/dist/dev/spark/KEYS of release manager. I only get the short key for v3.4.1 (https://github.com/apache/spark-docker/blob/master/tools/template.py#L34).

Maybe a stupid question: how to map the short key to full key fingerprint?

@Yikun
Copy link
Contributor Author

Yikun commented Jun 28, 2023

@dongjoon-hyun Would you mind taking a look about why we can't validate the long key.

It seems short key (it can be validated) and long key (failed to be validated due to no public key found raise) is same one. 😂 Does it need to import or some other operations?

@tianon
Copy link
Member

tianon commented Jun 28, 2023

Oh fun -- the way I mapped it was by downloading and importing that full KEYS file (gpg --import), and then running gpg --fingerprint 34F0FC5C, which gives it to you with spaces (which is also technically valid in the input, so you could set your variable to the space-separated version as well if you're careful about quoting it properly).

I think what's happening in your PR is that previously, keys.openpgp.org was probably rejecting the short ID completely so your code was falling back to keyserver.ubuntu.com, but now it's accepting the full fingerprint, but because the email address isn't verified, it's not returning a UID (https://keys.openpgp.org/about), but GnuPG is quietly ignoring the key and returning a successful exit code, which is less than ideal.

I've followed the steps in https://keys.openpgp.org/about/usage#gnupg-upload which should've emailed @dongjoon-hyun with a verification link that once verified, should allow you to restart that build and see success (as the path of least resistance here). 🙇

@tianon
Copy link
Member

tianon commented Jun 28, 2023

(all your gpg invocations should also include --batch, which essentially puts GnuPG into "API mode" instead of "UI mode", but that's not related to or causing this issue 😅)

@Yikun
Copy link
Contributor Author

Yikun commented Jun 29, 2023

@tianon Thanks, I have contacted dongjoon to help address this issue, after above 2 PRs ^ passed and merged, I will update.

Yikun added a commit to apache/spark-docker that referenced this pull request Jun 29, 2023
### What changes were proposed in this pull request?

Change GPG key from `34F0FC5C` to `F28C9C925C188C35E345614DEDA00CE834F0FC5C` to avoid pontential collision.

The full finger print can get from below cmd:
```
$ wget https://dist.apache.org/repos/dist/dev/spark/KEYS
$ gpg --import KEYS
$ gpg --fingerprint 34F0FC5C

pub   rsa4096 2015-05-05 [SC]
      F28C 9C92 5C18 8C35 E345  614D EDA0 0CE8 34F0 FC5C
uid           [ unknown] Dongjoon Hyun (CODE SIGNING KEY) <dongjoonapache.org>
sub   rsa4096 2015-05-05 [E]

```

### Why are the changes needed?

- A short gpg key had been added as v3.4.0 gpg key in #46 .
- The short key `34F0FC5C` is from https://dist.apache.org/repos/dist/dev/spark/KEYS
- According DOI review comments, docker-library/official-images#13089 (comment) , `this should be the full key fingerprint: F28C9C925C188C35E345614DEDA00CE834F0FC5C (generating a collision for such a short key ID is trivial.`
- We'd better to switch the short key to full fingerprint

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #50 from Yikun/gpg_key.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Yikun added a commit to apache/spark-docker that referenced this pull request Jun 29, 2023
### What changes were proposed in this pull request?
Add --batch to gpg command which essentially puts GnuPG into "API mode" instead of "UI mode".
Apply changes to 3.4.x dockerfile.

### Why are the changes needed?
Address DOI comments: docker-library/official-images#13089 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #51 from Yikun/batch.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
@github-actions
Copy link

Diff for 1fb130f:
diff --git a/_bashbrew-arches b/_bashbrew-arches
index 8b13789..e85a97f 100644
--- a/_bashbrew-arches
+++ b/_bashbrew-arches
@@ -1 +1,2 @@
-
+amd64
+arm64v8
diff --git a/_bashbrew-cat b/_bashbrew-cat
index bdfae4a..6bd4235 100644
--- a/_bashbrew-cat
+++ b/_bashbrew-cat
@@ -1 +1,22 @@
-Maintainers: New Image! :D (@docker-library-bot)
+Maintainers: Apache Spark Developers <dev@spark.apache.org> (@ApacheSpark)
+GitRepo: https://github.com/apache/spark-docker.git
+
+Tags: 3.4.1-scala2.12-java11-python3-r-ubuntu
+Architectures: amd64, arm64v8
+GitCommit: 58d288546e8419d229f14b62b6a653999e0390f1
+Directory: 3.4.1/scala2.12-java11-python3-r-ubuntu
+
+Tags: 3.4.1-scala2.12-java11-python3-ubuntu, 3.4.1-python3, python3, 3.4.1, latest
+Architectures: amd64, arm64v8
+GitCommit: 58d288546e8419d229f14b62b6a653999e0390f1
+Directory: 3.4.1/scala2.12-java11-python3-ubuntu
+
+Tags: 3.4.1-scala2.12-java11-r-ubuntu, 3.4.1-r, r
+Architectures: amd64, arm64v8
+GitCommit: 58d288546e8419d229f14b62b6a653999e0390f1
+Directory: 3.4.1/scala2.12-java11-r-ubuntu
+
+Tags: 3.4.1-scala2.12-java11-ubuntu, 3.4.1-scala, scala
+Architectures: amd64, arm64v8
+GitCommit: 58d288546e8419d229f14b62b6a653999e0390f1
+Directory: 3.4.1/scala2.12-java11-ubuntu
diff --git a/_bashbrew-list b/_bashbrew-list
index e69de29..d4a584b 100644
--- a/_bashbrew-list
+++ b/_bashbrew-list
@@ -0,0 +1,12 @@
+spark:3.4.1
+spark:3.4.1-python3
+spark:3.4.1-r
+spark:3.4.1-scala
+spark:3.4.1-scala2.12-java11-python3-r-ubuntu
+spark:3.4.1-scala2.12-java11-python3-ubuntu
+spark:3.4.1-scala2.12-java11-r-ubuntu
+spark:3.4.1-scala2.12-java11-ubuntu
+spark:latest
+spark:python3
+spark:r
+spark:scala
diff --git a/_bashbrew-list-build-order b/_bashbrew-list-build-order
index e69de29..66dee52 100644
--- a/_bashbrew-list-build-order
+++ b/_bashbrew-list-build-order
@@ -0,0 +1,4 @@
+spark:scala
+spark:3.4.1-scala2.12-java11-python3-r-ubuntu
+spark:latest
+spark:r
diff --git a/spark_3.4.1-scala2.12-java11-python3-r-ubuntu/Dockerfile b/spark_3.4.1-scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 0000000..30e6b86
--- /dev/null
+++ b/spark_3.4.1-scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -0,0 +1,29 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM spark:3.4.1-scala2.12-java11-ubuntu
+
+USER root
+
+RUN set -ex; \
+    apt-get update; \
+    apt-get install -y python3 python3-pip; \
+    apt-get install -y r-base r-base-dev; \
+    rm -rf /var/lib/apt/lists/*
+
+ENV R_HOME /usr/lib/R
+
+USER spark
diff --git a/spark_latest/Dockerfile b/spark_latest/Dockerfile
new file mode 100644
index 0000000..124ef71
--- /dev/null
+++ b/spark_latest/Dockerfile
@@ -0,0 +1,26 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM spark:3.4.1-scala2.12-java11-ubuntu
+
+USER root
+
+RUN set -ex; \
+    apt-get update; \
+    apt-get install -y python3 python3-pip; \
+    rm -rf /var/lib/apt/lists/*
+
+USER spark
diff --git a/spark_r/Dockerfile b/spark_r/Dockerfile
new file mode 100644
index 0000000..1c9fc38
--- /dev/null
+++ b/spark_r/Dockerfile
@@ -0,0 +1,28 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM spark:3.4.1-scala2.12-java11-ubuntu
+
+USER root
+
+RUN set -ex; \
+    apt-get update; \
+    apt-get install -y r-base r-base-dev; \
+    rm -rf /var/lib/apt/lists/*
+
+ENV R_HOME /usr/lib/R
+
+USER spark
diff --git a/spark_scala/Dockerfile b/spark_scala/Dockerfile
new file mode 100644
index 0000000..d8bba7e
--- /dev/null
+++ b/spark_scala/Dockerfile
@@ -0,0 +1,79 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+    useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex; \
+    apt-get update; \
+    apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu libnss-wrapper; \
+    mkdir -p /opt/spark; \
+    mkdir /opt/spark/python; \
+    mkdir -p /opt/spark/examples; \
+    mkdir -p /opt/spark/work-dir; \
+    chmod g+w /opt/spark/work-dir; \
+    touch /opt/spark/RELEASE; \
+    chown -R spark:spark /opt/spark; \
+    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su; \
+    rm -rf /var/lib/apt/lists/*
+
+# Install Apache Spark
+# https://downloads.apache.org/spark/KEYS
+ENV SPARK_TGZ_URL=https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz \
+    SPARK_TGZ_ASC_URL=https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz.asc \
+    GPG_KEY=F28C9C925C188C35E345614DEDA00CE834F0FC5C
+
+RUN set -ex; \
+    export SPARK_TMP="$(mktemp -d)"; \
+    cd $SPARK_TMP; \
+    wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
+    wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
+    export GNUPGHOME="$(mktemp -d)"; \
+    gpg --batch --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+    gpg --batch --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
+    gpg --batch --verify spark.tgz.asc spark.tgz; \
+    gpgconf --kill all; \
+    rm -rf "$GNUPGHOME" spark.tgz.asc; \
+    \
+    tar -xf spark.tgz --strip-components=1; \
+    chown -R spark:spark .; \
+    mv jars /opt/spark/; \
+    mv bin /opt/spark/; \
+    mv sbin /opt/spark/; \
+    mv kubernetes/dockerfiles/spark/decom.sh /opt/; \
+    mv examples /opt/spark/; \
+    mv kubernetes/tests /opt/spark/; \
+    mv data /opt/spark/; \
+    mv python/pyspark /opt/spark/python/pyspark/; \
+    mv python/lib /opt/spark/python/lib/; \
+    mv R /opt/spark/; \
+    chmod a+x /opt/decom.sh; \
+    cd ..; \
+    rm -rf "$SPARK_TMP";
+
+COPY entrypoint.sh /opt/
+
+ENV SPARK_HOME /opt/spark
+
+WORKDIR /opt/spark/work-dir
+
+USER spark
+
+ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/spark_scala/entrypoint.sh b/spark_scala/entrypoint.sh
new file mode 100755
index 0000000..2e3d2a8
--- /dev/null
+++ b/spark_scala/entrypoint.sh
@@ -0,0 +1,126 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Prevent any errors from being silently ignored
+set -eo pipefail
+
+attempt_setup_fake_passwd_entry() {
+  # Check whether there is a passwd entry for the container UID
+  local myuid; myuid="$(id -u)"
+  # If there is no passwd entry for the container UID, attempt to fake one
+  # You can also refer to the https://github.com/docker-library/official-images/pull/13089#issuecomment-1534706523
+  # It's to resolve OpenShift random UID case.
+  # See also: https://github.com/docker-library/postgres/pull/448
+  if ! getent passwd "$myuid" &> /dev/null; then
+      local wrapper
+      for wrapper in {/usr,}/lib{/*,}/libnss_wrapper.so; do
+        if [ -s "$wrapper" ]; then
+          NSS_WRAPPER_PASSWD="$(mktemp)"
+          NSS_WRAPPER_GROUP="$(mktemp)"
+          export LD_PRELOAD="$wrapper" NSS_WRAPPER_PASSWD NSS_WRAPPER_GROUP
+          local mygid; mygid="$(id -g)"
+          printf 'spark:x:%s:%s:${SPARK_USER_NAME:-anonymous uid}:%s:/bin/false\n' "$myuid" "$mygid" "$SPARK_HOME" > "$NSS_WRAPPER_PASSWD"
+          printf 'spark:x:%s:\n' "$mygid" > "$NSS_WRAPPER_GROUP"
+          break
+        fi
+      done
+  fi
+}
+
+if [ -z "$JAVA_HOME" ]; then
+  JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}')
+fi
+
+SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
+for v in "${!SPARK_JAVA_OPT_@}"; do
+    SPARK_EXECUTOR_JAVA_OPTS+=( "${!v}" )
+done
+
+if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
+  SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
+fi
+
+if ! [ -z "${PYSPARK_PYTHON+x}" ]; then
+    export PYSPARK_PYTHON
+fi
+if ! [ -z "${PYSPARK_DRIVER_PYTHON+x}" ]; then
+    export PYSPARK_DRIVER_PYTHON
+fi
+
+# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
+# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
+if [ -n "${HADOOP_HOME}"  ] && [ -z "${SPARK_DIST_CLASSPATH}"  ]; then
+  export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
+fi
+
+if ! [ -z "${HADOOP_CONF_DIR+x}" ]; then
+  SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
+fi
+
+if ! [ -z "${SPARK_CONF_DIR+x}" ]; then
+  SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH";
+elif ! [ -z "${SPARK_HOME+x}" ]; then
+  SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
+fi
+
+# Switch to spark if no USER specified (root by default) otherwise use USER directly
+switch_spark_if_root() {
+  if [ $(id -u) -eq 0 ]; then
+    echo gosu spark
+  fi
+}
+
+case "$1" in
+  driver)
+    shift 1
+    CMD=(
+      "$SPARK_HOME/bin/spark-submit"
+      --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+      --deploy-mode client
+      "$@"
+    )
+    attempt_setup_fake_passwd_entry
+    # Execute the container CMD under tini for better hygiene
+    exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
+    ;;
+  executor)
+    shift 1
+    CMD=(
+      ${JAVA_HOME}/bin/java
+      "${SPARK_EXECUTOR_JAVA_OPTS[@]}"
+      -Xms"$SPARK_EXECUTOR_MEMORY"
+      -Xmx"$SPARK_EXECUTOR_MEMORY"
+      -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"
+      org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend
+      --driver-url "$SPARK_DRIVER_URL"
+      --executor-id "$SPARK_EXECUTOR_ID"
+      --cores "$SPARK_EXECUTOR_CORES"
+      --app-id "$SPARK_APPLICATION_ID"
+      --hostname "$SPARK_EXECUTOR_POD_IP"
+      --resourceProfileId "$SPARK_RESOURCE_PROFILE_ID"
+      --podName "$SPARK_EXECUTOR_POD_NAME"
+    )
+    attempt_setup_fake_passwd_entry
+    # Execute the container CMD under tini for better hygiene
+    exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
+    ;;
+
+  *)
+    # Non-spark-on-k8s command provided, proceeding in pass-through mode...
+    exec "$@"
+    ;;
+esac

@Yikun
Copy link
Contributor Author

Yikun commented Jun 29, 2023

All comments addressed. Ready for review again.

@tianon tianon requested a review from yosifkit June 30, 2023 22:20
@Yikun
Copy link
Contributor Author

Yikun commented Jul 17, 2023

@yosifkit Would you mind taking a look recently? Thanks!

@yosifkit yosifkit merged commit 46ec7b4 into docker-library:master Jul 19, 2023
6 checks passed
@HyukjinKwon
Copy link

awesome!

@zhengruifeng
Copy link

congrats! @Yikun

@gatorsmile
Copy link

Thank you !

@Yikun
Copy link
Contributor Author

Yikun commented Jul 20, 2023

@yosifkit @tianon Much thanks for your help! I also have a post test, it passed all tests!

image

Thanks all cc @gatorsmile @erikerlandson @HyukjinKwon @zhengruifeng @pan3793 @emiliofernandes @julien-faye

@michTalebzadeh

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants