Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-4154] Build docker image for each interpreter #3769

Closed
wants to merge 1 commit into from

Conversation

Reamer
Copy link
Contributor

@Reamer Reamer commented May 11, 2020

What is this PR for?

This PR provides new Dockerfiles to build a separate Zeppelin server image and interpreter images.

Some motivations:

  • conda is used to install all needed python and R packages
    • conda is able to manage a whole environment and both languages
  • files are used for installing python and R dependency
    • it's easy to customize the content of files. No need to change the Dockerfile itself.
    • a file has more readability
    • reduce dependency calculation
  • a zeppelin-distribution image is needed
    • you need to build Zeppelin only once, if you want to build new images in parallel on a build cluster

What type of PR is it?

Improvement

Todos

  • - Task

What is the Jira issue?

How should this be tested?

  • I build the images successfully

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? No

Copy link
Member

@Leemoonsoo Leemoonsoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a really good improvement. I left a few comments.

@@ -115,7 +115,7 @@ spec:
path: nginx.conf
containers:
- name: zeppelin-server
image: apache/zeppelin:0.9.0-SNAPSHOT
image: apache/zeppelin-server:0.9.0-SNAPSHOT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the official docker images are built from scripts/docker/zeppelin/bin/Dockerfile.
Apache Infra configured the Dockerfile location for automated build on release. (see this comment)

Do you think we can release images based on new Dockerfiles in this PullRequest and remove scripts/docker/zeppelin/bin/Dockerfile ?

While /k8s/zeppelin-server.yaml points new docker image names, i think it make sense to release new images as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can delete scripts/docker/zeppelin/bin/Dockerfile for the next release. First we should check if Docker and k8s are able to use a flexible interpreter image. At least k8s is currently not able to use a flexible interpreter image.

It makes sense to push these new images. How should we handle different compilation versions?
At the moment I'm compiling Zeppelin with the newest versions of Hadoop and Spark.

mvn -B package -DskipTests -Pbuild-distr -Pspark-3.0 -Phadoop3 -Pspark-scala-2.12 -Pweb-angular

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for Spark interpreter, it's got binary level compatibility to different spark (and hadoop) versions. Once built, It works with different versions without rebuilding it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How it all will work on non-K8S? Like, using Docker just for not installing anything to machine, and one image is more handy to work with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one image is more handy to work with

You are right, one or at least a small set of images is more practical for work. In fact, I currently have only three images (distribution image, server, (one large) interpreter image) in my K8s setup.

The title of ZEPPELIN-4154 and the first PR #3380 imply an image for each interpreter. This PR tries to solve the task.

In my opinion we should at least provide different images for the Zeppelin server and the Zeppelin interpreter. A distribution image is useful to build Zeppelin only once and copy the same version to the Zeppelin server and the Zeppelin interpreter.

My main goal for different images is to reduce the size and start time of images in a container cluster.
The size of my current Zeppelin image is only 410.95MB. The download and total startup time of the new instance is short.
My Zeppelin interpreter image is quite large (1.53 GB). The download time is quite long.

If we want to create an image for each interpreter, the image size is reduced. All interpreter images should use the same base image to benefit from a potentially available layer.

How it all will work on non-K8S? Like, using Docker just for not installing anything to machine

Docker is also able to set up a local network, in most cases this is done via a bridge network. The Zeppelin server needs access to create/modify the network via the Docker daemon's tcp interface or at least the information when new containers are created via the tcp interface.
In my opinion, a docker cluster via a docker swarm should not fall within the scope of this project.

docker build -t zeppelin-interpreter-base -f Dockerfile_interpreter_base .
```

Build image for zeppelin interpreter <interpreter_name>. By default, we use the `scripts/docker/zeppelin-interpreter/Dockerfile` to build the interpreter image, but we have also some customize Dockerfiles under `scripts/docker/zeppelin-interpreter/<interpreter_name>`. For examples, in offical Apache Zeppelin, we provide 3 customized images for python,r,spark.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kubernetes, would it be possible to use customized interpreter image (scripts/docker/zeppelin-interpreter/<interpreter_name>) for particular interpreters and fallback to default interpreter image (scripts/docker/zeppelin-interpreter/Dockerfile) for all other interpreters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the interpreter image build process should use 'scripts/docker/Zeppelin interpreter/docker file'. For K8s you should have all interpreter images in the docker registry. Fallback logic in the Zeppelin server should be avoided.

# Copy interpreter-settings for activate interpreter
# COPY --from=zeppelin-distribution /opt/zeppelin/interpreter/spark/interpreter-setting.json ${Z_HOME}/interpreter/spark/interpreter-setting.json
# COPY --from=zeppelin-distribution /opt/zeppelin/interpreter/jdbc/interpreter-setting.json ${Z_HOME}/interpreter/jdbc/interpreter-setting.json
# COPY --from=zeppelin-distribution /opt/zeppelin/interpreter/md/interpreter-setting.json ${Z_HOME}/interpreter/md/interpreter-setting.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any side effects when just copy all /opt/zeppelin/interpreter/*/interpreter-setting.json files from zeppelin-distribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the following side effect. The Zeppelin server has a long list of interpreters, not all interpreters may work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that editing these lines are a mandatory step before build an image using this Dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends, how much images should be maintained by the Zeppelin project. In my opinion Zeppelin should show the way how to extend a image. Zeppelin should only maintain a small set of interpreter.

Dockerfile Outdated
find . -maxdepth 1 -type d ! -iname zeppelin-distribution -exec rm -rf {} \;

FROM ubuntu:18.04
COPY --from=builder /workspace/zeppelin/zeppelin-distribution/target/zeppelin-0.9.0-SNAPSHOT/zeppelin-0.9.0-SNAPSHOT /opt/zeppelin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maybe would try to make it version-independent? Or pass version explicitly, or infer somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I fixed it.

@@ -115,7 +115,7 @@ spec:
path: nginx.conf
containers:
- name: zeppelin-server
image: apache/zeppelin:0.9.0-SNAPSHOT
image: apache/zeppelin-server:0.9.0-SNAPSHOT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How it all will work on non-K8S? Like, using Docker just for not installing anything to machine, and one image is more handy to work with

@Reamer
Copy link
Contributor Author

Reamer commented Jul 30, 2020

Closed in favour for #3859

@Reamer Reamer closed this Jul 30, 2020
asfgit pushed a commit that referenced this pull request Aug 24, 2020
### What is this PR for?
This PR provides new Dockerfiles to build a Zeppelin server and interpreter image. This PR tries to solve #3769 (comment)
I agree with alexott that multiple interpreter images are difficult to handle. Currently neither the docker launcher nor the K8s launcher support switching to other interpreter images.

Some motivations:
 - conda is used to install all needed python and R packages
   - conda is able to manage a whole environment and both languages
 - files are used for installing python and R dependency
   - it's easy to customize the content of files. No need to change the Dockerfile itself.
   - a file has more readability
   - reduce dependency calculation
 - a zeppelin-distribution image is needed
   - you need to build Zeppelin only once, if you want to build new images in parallel on a build cluster

### What type of PR is it?
Improvement

### Todos
* [ ] - Task

### What is the Jira issue?
* https://jira.apache.org/jira/browse/ZEPPELIN-4978

### How should this be tested?
* Build the images

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Philipp Dallig <philipp.dallig@gmail.com>

Closes #3859 from Reamer/three_images and squashes the following commits:

300fc0c [Philipp Dallig] Give the Zeppelin interpreter more rights for folders
175566c [Philipp Dallig] Correct Zeppelin home
d221181 [Philipp Dallig] Update miniconda to py38_4.8.3
9ffb77c [Philipp Dallig] Disallow the modification of conda packages
336ffd7 [Philipp Dallig] Use only one conda file to resolve dependencies only once
38c3d57 [Philipp Dallig] Using tini from ubuntu package repository
b8d3143 [Philipp Dallig] Correct typo
8fa6f80 [Philipp Dallig] Update to Ubuntu 20.04
e095587 [Philipp Dallig] Fix download of additional interpreter packages
f0585b9 [Philipp Dallig] Allow to edit conda packages
1a1211e [Philipp Dallig] Build only three images
asfgit pushed a commit that referenced this pull request Aug 24, 2020
### What is this PR for?
This PR provides new Dockerfiles to build a Zeppelin server and interpreter image. This PR tries to solve #3769 (comment)
I agree with alexott that multiple interpreter images are difficult to handle. Currently neither the docker launcher nor the K8s launcher support switching to other interpreter images.

Some motivations:
 - conda is used to install all needed python and R packages
   - conda is able to manage a whole environment and both languages
 - files are used for installing python and R dependency
   - it's easy to customize the content of files. No need to change the Dockerfile itself.
   - a file has more readability
   - reduce dependency calculation
 - a zeppelin-distribution image is needed
   - you need to build Zeppelin only once, if you want to build new images in parallel on a build cluster

### What type of PR is it?
Improvement

### Todos
* [ ] - Task

### What is the Jira issue?
* https://jira.apache.org/jira/browse/ZEPPELIN-4978

### How should this be tested?
* Build the images

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Philipp Dallig <philipp.dallig@gmail.com>

Closes #3859 from Reamer/three_images and squashes the following commits:

300fc0c [Philipp Dallig] Give the Zeppelin interpreter more rights for folders
175566c [Philipp Dallig] Correct Zeppelin home
d221181 [Philipp Dallig] Update miniconda to py38_4.8.3
9ffb77c [Philipp Dallig] Disallow the modification of conda packages
336ffd7 [Philipp Dallig] Use only one conda file to resolve dependencies only once
38c3d57 [Philipp Dallig] Using tini from ubuntu package repository
b8d3143 [Philipp Dallig] Correct typo
8fa6f80 [Philipp Dallig] Update to Ubuntu 20.04
e095587 [Philipp Dallig] Fix download of additional interpreter packages
f0585b9 [Philipp Dallig] Allow to edit conda packages
1a1211e [Philipp Dallig] Build only three images

(cherry picked from commit 90bfb72)
Signed-off-by: Philipp Dallig <philipp.dallig@gmail.com>
prabhjyotsingh pushed a commit to prabhjyotsingh/zeppelin that referenced this pull request Sep 1, 2020
### What is this PR for?
This PR provides new Dockerfiles to build a Zeppelin server and interpreter image. This PR tries to solve apache#3769 (comment)
I agree with alexott that multiple interpreter images are difficult to handle. Currently neither the docker launcher nor the K8s launcher support switching to other interpreter images.

Some motivations:
 - conda is used to install all needed python and R packages
   - conda is able to manage a whole environment and both languages
 - files are used for installing python and R dependency
   - it's easy to customize the content of files. No need to change the Dockerfile itself.
   - a file has more readability
   - reduce dependency calculation
 - a zeppelin-distribution image is needed
   - you need to build Zeppelin only once, if you want to build new images in parallel on a build cluster

### What type of PR is it?
Improvement

### Todos
* [ ] - Task

### What is the Jira issue?
* https://jira.apache.org/jira/browse/ZEPPELIN-4978

### How should this be tested?
* Build the images

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Philipp Dallig <philipp.dallig@gmail.com>

Closes apache#3859 from Reamer/three_images and squashes the following commits:

300fc0c [Philipp Dallig] Give the Zeppelin interpreter more rights for folders
175566c [Philipp Dallig] Correct Zeppelin home
d221181 [Philipp Dallig] Update miniconda to py38_4.8.3
9ffb77c [Philipp Dallig] Disallow the modification of conda packages
336ffd7 [Philipp Dallig] Use only one conda file to resolve dependencies only once
38c3d57 [Philipp Dallig] Using tini from ubuntu package repository
b8d3143 [Philipp Dallig] Correct typo
8fa6f80 [Philipp Dallig] Update to Ubuntu 20.04
e095587 [Philipp Dallig] Fix download of additional interpreter packages
f0585b9 [Philipp Dallig] Allow to edit conda packages
1a1211e [Philipp Dallig] Build only three images
@Reamer Reamer deleted the ZEPPELIN-4154 branch September 21, 2020 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants