New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ZEPPELIN-4154] Build docker image for each interpreter #3769
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a really good improvement. I left a few comments.
@@ -115,7 +115,7 @@ spec: | |||
path: nginx.conf | |||
containers: | |||
- name: zeppelin-server | |||
image: apache/zeppelin:0.9.0-SNAPSHOT | |||
image: apache/zeppelin-server:0.9.0-SNAPSHOT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the official docker images are built from scripts/docker/zeppelin/bin/Dockerfile
.
Apache Infra configured the Dockerfile location for automated build on release. (see this comment)
Do you think we can release images based on new Dockerfiles in this PullRequest and remove scripts/docker/zeppelin/bin/Dockerfile
?
While /k8s/zeppelin-server.yaml
points new docker image names, i think it make sense to release new images as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can delete scripts/docker/zeppelin/bin/Dockerfile
for the next release. First we should check if Docker and k8s are able to use a flexible interpreter image. At least k8s is currently not able to use a flexible interpreter image.
It makes sense to push these new images. How should we handle different compilation versions?
At the moment I'm compiling Zeppelin with the newest versions of Hadoop and Spark.
mvn -B package -DskipTests -Pbuild-distr -Pspark-3.0 -Phadoop3 -Pspark-scala-2.12 -Pweb-angular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for Spark interpreter, it's got binary level compatibility to different spark (and hadoop) versions. Once built, It works with different versions without rebuilding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How it all will work on non-K8S? Like, using Docker just for not installing anything to machine, and one image is more handy to work with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one image is more handy to work with
You are right, one or at least a small set of images is more practical for work. In fact, I currently have only three images (distribution image, server, (one large) interpreter image) in my K8s setup.
The title of ZEPPELIN-4154 and the first PR #3380 imply an image for each interpreter. This PR tries to solve the task.
In my opinion we should at least provide different images for the Zeppelin server and the Zeppelin interpreter. A distribution image is useful to build Zeppelin only once and copy the same version to the Zeppelin server and the Zeppelin interpreter.
My main goal for different images is to reduce the size and start time of images in a container cluster.
The size of my current Zeppelin image is only 410.95MB. The download and total startup time of the new instance is short.
My Zeppelin interpreter image is quite large (1.53 GB). The download time is quite long.
If we want to create an image for each interpreter, the image size is reduced. All interpreter images should use the same base image to benefit from a potentially available layer.
How it all will work on non-K8S? Like, using Docker just for not installing anything to machine
Docker is also able to set up a local network, in most cases this is done via a bridge network. The Zeppelin server needs access to create/modify the network via the Docker daemon's tcp interface or at least the information when new containers are created via the tcp interface.
In my opinion, a docker cluster via a docker swarm should not fall within the scope of this project.
docker build -t zeppelin-interpreter-base -f Dockerfile_interpreter_base . | ||
``` | ||
|
||
Build image for zeppelin interpreter <interpreter_name>. By default, we use the `scripts/docker/zeppelin-interpreter/Dockerfile` to build the interpreter image, but we have also some customize Dockerfiles under `scripts/docker/zeppelin-interpreter/<interpreter_name>`. For examples, in offical Apache Zeppelin, we provide 3 customized images for python,r,spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Kubernetes, would it be possible to use customized interpreter image (scripts/docker/zeppelin-interpreter/<interpreter_name>) for particular interpreters and fallback to default interpreter image (scripts/docker/zeppelin-interpreter/Dockerfile) for all other interpreters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the interpreter image build process should use 'scripts/docker/Zeppelin interpreter/docker file'. For K8s you should have all interpreter images in the docker registry. Fallback logic in the Zeppelin server should be avoided.
# Copy interpreter-settings for activate interpreter | ||
# COPY --from=zeppelin-distribution /opt/zeppelin/interpreter/spark/interpreter-setting.json ${Z_HOME}/interpreter/spark/interpreter-setting.json | ||
# COPY --from=zeppelin-distribution /opt/zeppelin/interpreter/jdbc/interpreter-setting.json ${Z_HOME}/interpreter/jdbc/interpreter-setting.json | ||
# COPY --from=zeppelin-distribution /opt/zeppelin/interpreter/md/interpreter-setting.json ${Z_HOME}/interpreter/md/interpreter-setting.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see any side effects when just copy all /opt/zeppelin/interpreter/*/interpreter-setting.json
files from zeppelin-distribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the following side effect. The Zeppelin server has a long list of interpreters, not all interpreters may work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that editing these lines are a mandatory step before build an image using this Dockerfile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends, how much images should be maintained by the Zeppelin project. In my opinion Zeppelin should show the way how to extend a image. Zeppelin should only maintain a small set of interpreter.
Dockerfile
Outdated
find . -maxdepth 1 -type d ! -iname zeppelin-distribution -exec rm -rf {} \; | ||
|
||
FROM ubuntu:18.04 | ||
COPY --from=builder /workspace/zeppelin/zeppelin-distribution/target/zeppelin-0.9.0-SNAPSHOT/zeppelin-0.9.0-SNAPSHOT /opt/zeppelin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I maybe would try to make it version-independent? Or pass version explicitly, or infer somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I fixed it.
@@ -115,7 +115,7 @@ spec: | |||
path: nginx.conf | |||
containers: | |||
- name: zeppelin-server | |||
image: apache/zeppelin:0.9.0-SNAPSHOT | |||
image: apache/zeppelin-server:0.9.0-SNAPSHOT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How it all will work on non-K8S? Like, using Docker just for not installing anything to machine, and one image is more handy to work with
Closed in favour for #3859 |
### What is this PR for? This PR provides new Dockerfiles to build a Zeppelin server and interpreter image. This PR tries to solve #3769 (comment) I agree with alexott that multiple interpreter images are difficult to handle. Currently neither the docker launcher nor the K8s launcher support switching to other interpreter images. Some motivations: - conda is used to install all needed python and R packages - conda is able to manage a whole environment and both languages - files are used for installing python and R dependency - it's easy to customize the content of files. No need to change the Dockerfile itself. - a file has more readability - reduce dependency calculation - a zeppelin-distribution image is needed - you need to build Zeppelin only once, if you want to build new images in parallel on a build cluster ### What type of PR is it? Improvement ### Todos * [ ] - Task ### What is the Jira issue? * https://jira.apache.org/jira/browse/ZEPPELIN-4978 ### How should this be tested? * Build the images ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Philipp Dallig <philipp.dallig@gmail.com> Closes #3859 from Reamer/three_images and squashes the following commits: 300fc0c [Philipp Dallig] Give the Zeppelin interpreter more rights for folders 175566c [Philipp Dallig] Correct Zeppelin home d221181 [Philipp Dallig] Update miniconda to py38_4.8.3 9ffb77c [Philipp Dallig] Disallow the modification of conda packages 336ffd7 [Philipp Dallig] Use only one conda file to resolve dependencies only once 38c3d57 [Philipp Dallig] Using tini from ubuntu package repository b8d3143 [Philipp Dallig] Correct typo 8fa6f80 [Philipp Dallig] Update to Ubuntu 20.04 e095587 [Philipp Dallig] Fix download of additional interpreter packages f0585b9 [Philipp Dallig] Allow to edit conda packages 1a1211e [Philipp Dallig] Build only three images
### What is this PR for? This PR provides new Dockerfiles to build a Zeppelin server and interpreter image. This PR tries to solve #3769 (comment) I agree with alexott that multiple interpreter images are difficult to handle. Currently neither the docker launcher nor the K8s launcher support switching to other interpreter images. Some motivations: - conda is used to install all needed python and R packages - conda is able to manage a whole environment and both languages - files are used for installing python and R dependency - it's easy to customize the content of files. No need to change the Dockerfile itself. - a file has more readability - reduce dependency calculation - a zeppelin-distribution image is needed - you need to build Zeppelin only once, if you want to build new images in parallel on a build cluster ### What type of PR is it? Improvement ### Todos * [ ] - Task ### What is the Jira issue? * https://jira.apache.org/jira/browse/ZEPPELIN-4978 ### How should this be tested? * Build the images ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Philipp Dallig <philipp.dallig@gmail.com> Closes #3859 from Reamer/three_images and squashes the following commits: 300fc0c [Philipp Dallig] Give the Zeppelin interpreter more rights for folders 175566c [Philipp Dallig] Correct Zeppelin home d221181 [Philipp Dallig] Update miniconda to py38_4.8.3 9ffb77c [Philipp Dallig] Disallow the modification of conda packages 336ffd7 [Philipp Dallig] Use only one conda file to resolve dependencies only once 38c3d57 [Philipp Dallig] Using tini from ubuntu package repository b8d3143 [Philipp Dallig] Correct typo 8fa6f80 [Philipp Dallig] Update to Ubuntu 20.04 e095587 [Philipp Dallig] Fix download of additional interpreter packages f0585b9 [Philipp Dallig] Allow to edit conda packages 1a1211e [Philipp Dallig] Build only three images (cherry picked from commit 90bfb72) Signed-off-by: Philipp Dallig <philipp.dallig@gmail.com>
### What is this PR for? This PR provides new Dockerfiles to build a Zeppelin server and interpreter image. This PR tries to solve apache#3769 (comment) I agree with alexott that multiple interpreter images are difficult to handle. Currently neither the docker launcher nor the K8s launcher support switching to other interpreter images. Some motivations: - conda is used to install all needed python and R packages - conda is able to manage a whole environment and both languages - files are used for installing python and R dependency - it's easy to customize the content of files. No need to change the Dockerfile itself. - a file has more readability - reduce dependency calculation - a zeppelin-distribution image is needed - you need to build Zeppelin only once, if you want to build new images in parallel on a build cluster ### What type of PR is it? Improvement ### Todos * [ ] - Task ### What is the Jira issue? * https://jira.apache.org/jira/browse/ZEPPELIN-4978 ### How should this be tested? * Build the images ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Philipp Dallig <philipp.dallig@gmail.com> Closes apache#3859 from Reamer/three_images and squashes the following commits: 300fc0c [Philipp Dallig] Give the Zeppelin interpreter more rights for folders 175566c [Philipp Dallig] Correct Zeppelin home d221181 [Philipp Dallig] Update miniconda to py38_4.8.3 9ffb77c [Philipp Dallig] Disallow the modification of conda packages 336ffd7 [Philipp Dallig] Use only one conda file to resolve dependencies only once 38c3d57 [Philipp Dallig] Using tini from ubuntu package repository b8d3143 [Philipp Dallig] Correct typo 8fa6f80 [Philipp Dallig] Update to Ubuntu 20.04 e095587 [Philipp Dallig] Fix download of additional interpreter packages f0585b9 [Philipp Dallig] Allow to edit conda packages 1a1211e [Philipp Dallig] Build only three images
What is this PR for?
This PR provides new Dockerfiles to build a separate Zeppelin server image and interpreter images.
Some motivations:
What type of PR is it?
Improvement
Todos
What is the Jira issue?
How should this be tested?
Questions: