(WIP) Adds Dockerfile with containerized spark #55

aaronsteers · 2020-02-26T08:32:32Z

This came up as part of #31, as an idea to dockerize DBT and spark together so as to be able to run local testing and execution from the container itself.

As noted in the title, this is still a Work-in-Progress but I wanted to open this PR here to start gathering feedback.

I believe the docker image is close to working but there is a dependency on my external python library slalom.dataops which is responsible for launching the spark cluster itself. Also, spark doesn't behave well when using the default derby database so this image installs mysql for use as the hive metastore.

All feedback and ideas welcome.

Thanks!

jtcohen6 · 2020-02-27T15:57:59Z

@aaronsteers This is very cool! I'd love a better approach for folks to be able to run integration tests locally. This would be in a more similar vein to postgres and presto.

@beckjake Would you be up to take a look at this?

beckjake · 2020-02-27T16:28:25Z

This is a great idea! I've got some suggestions but if we can't do any of this, ship it anyway IMO - better to have any kind of local test env than none.

Can we base this on python:3.8.1-slim-buster? That's what I'm using in dbt-core and what the dbt-presto PR is using, would love to have it be based on the same thing for all. If you need things from the "full" install, also fine to leave it this way.

You might also want to set some env values, either in the dockerfile or the bootstrap script. They tend to minimize the risk of python losing its mind about weird file encoding things: PYTHONIOENCODING=utf-8 and LANG=C.UTF-8.

Instead of running docker inside docker, would it make more sense to set up a docker-compose.yml file and put this container + the spark container in a network together, etc? If that makes no sense/sounds implausibly hard, no worries. I just think docker-within-docker is usually some kind of smell (at least it used to require elevated permissions for the host container, or is that fixed?)

aaronsteers · 2020-02-27T23:48:10Z

@beckjake - Thanks for the detailed thoughts and suggestions. My comments below inline.

Can we base this on python:3.8.1-slim-buster? That's what I'm using in dbt-core and what the dbt-presto PR is using, would love to have it be based on the same thing for all. If you need things from the "full" install, also fine to leave it this way.

Spark has a bunch of dependencies and I'm not sure if it would work on the slim-buster base, but I will definitely try!

You might also want to set some env values, either in the dockerfile or the bootstrap script. They tend to minimize the risk of python losing its mind about weird file encoding things: PYTHONIOENCODING=utf-8 and LANG=C.UTF-8.

No worries at all. Happy to add those two env vars - and any others you would suggest.

Instead of running docker inside docker, would it make more sense to set up a docker-compose.yml file and put this container + the spark container in a network together, etc? If that makes no sense/sounds implausibly hard, no worries. I just think docker-within-docker is usually some kind of smell (at least it used to require elevated permissions for the host container, or is that fixed?)

Regarding a single docker image vs the docker-compose approach, I can see some advantages on both sides. A single docker container seems much easier to manage for testing a 'standalone' execution, but I also see the added complexity. Probably there are two decisions here to make: (1) is whether we really need to call docker from inside docker (and I think we don't) but (2) is whether it's better to have a single fully-powered image vs two slimmer-but-dependent images (or to have support for both options).

What I can reasonably commit to right away is to review the code and see if we really need to call docker from within docker. If this isn't directly needed, shall I remove the install of the docker library, or would it be helpful to still have it "onboard" the image? Any strong preference?

Thanks!

beckjake · 2020-02-28T01:05:17Z

What I can reasonably commit to right away is to review the code and see if we really need to call docker from within docker.

Sure, that sounds totally reasonable

If this isn't directly needed, shall I remove the install of the docker library, or would it be helpful to still have it "onboard" the image? Any strong preference?

I think if we don't need docker-within-docker, that would be ideal!

If we do need to use docker-within docker, or it's even just substantially easier, let's keep it as-is. I definitely am not enough of a docker guru to get to have strong opinions about docker here!

aaronsteers · 2020-02-28T01:25:06Z

@beckjake - Done and done!

I've removed the docker reference from Dockerfile and added the mentioned environment variables.

I'd like to also add at least a basic CI test if that's okay and also to test with the base image python:3.8.1-slim-buster as discussed.

dmateusp

I have a couple of questions but I like the direction it's taking, thanks for taking the time to share that!

Dockerfile

dmateusp · 2020-02-28T15:32:59Z

Dockerfile

+    SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" \
+    PATH=$PATH:$SPARK_HOME/bin
+
+# Install mysql server and drivers (for hive metastore)


should we separate the mysql bit in a docker-compose ?

I would love to remove the mysql dependency, but I don't feel good about the default experience when this is omitted. I had previously leaned on the default 'derby' implementation for the spark metastore but the problem is that derby falls down immediately as soon as you try to run thrift, which is currently a hard requirement for dbt-spark.

but right now we have the MySQL instance running directly inside the same container right ? I'm just wondering if we should remove just the server part of it, and use an existing MySQL docker image, then connect both through docker-compose

@dmateusp - Yes, we can certainly do that - remove the mysql server part and use a myswl docker image but (rightly or wrongly) I was really hoping to have a single image "just work" without introducing the extra overhead of docker compose and managing multiple container lifecycles. This might be misguided on my part, but it still seems that would be valueable for a number of use cases.

If it is important to remove the mysql dependency (moving it outside the container), what if we take a two-pronged approach here:

As proposed, we remove the mysql-server aspect from this base Dockerfile and instead refactor as a docker-compose.yml.

To support any use cases where having a single standalone docker image is important, I can wrap the core image in second downstream Dockerfile which comes with mysql bootstrapped into the image as the built-in metadata store and "just works" out of box when executed directly via docker run.

@dmateusp - Would this be an okay approach? We can provide a viable docker run experience (using the second image) while also honoring the fact that docker-compose.yml is "the right way" to support RDMS backends. (In the long run, I would love to find a solution that can host thrift locally without requiring mysql.)

Interesting! I often found that docker-compose allowed me to hide complexity from users because I could set some environment variables, ports and mount volumes by default

The reason I like having databases in a separate container (even for local dev environments) is that they can be parameterized easily, if you have a look at the mysql image for example https://hub.docker.com/_/mysql, you can see there's a complete doc and it tells where volumes should be mounted to persist the data, and what environment variables are available

I think we might lack an actual example usage to compare set-ups, do you think you could add just 1 integration test, or/and a README example section ?

@dmateusp - Absolutely, that sounds like a great next step. I'll focus on adding some type of integration test that calls dbt-spark, and a readme with usage example. Thanks!

dmateusp · 2020-02-28T15:34:04Z

Dockerfile

+    chmod -R 777 $HADOOP_HOME && \
+    rm hadoop-$HADOOP_MINOR_VERSION.tar.gz
+
+# Copy hadoop libraries for AWS & S3 to spark classpath


should we make that optional ? maybe we could use a Dockerfile.template ?

I'm not familiar with Dockerfile.template. I do think that both AWS and Azure libraries (jars) are needed in the 'base/core' image. They enable core functionality that otherwise would be a blocker for even basic usage and testing.

One simplification I could make here, if we didn't want the rest of the hadoop binaries, is that I could use a multi-image build, downloading hadoop binaries in a ephemeral image and then only copy the files we need into the spark classpath.

Thoughts?

I just imagine AWS users might not need Azure / GCP jars and vice versa.
Or might not need any if they are just using the local filesystem / hdfs ?

I have just seen in past projects, people using tokens inside their Dockerfile that they replace using sed to customize a "base" dockerfile. Could also have a base docker image and then an aws, azure and gcp image built from it.

All that said, this PR should probably just focus on providing a local environment (through Docker) that is good enough for (integration) testing, we could look into making the image smaller later.

Anyways! I'm going to play around with this image over this week end and give you more feedback :)

Dockerfile

dmateusp · 2020-03-01T18:17:06Z

@aaronsteers I made some suggestions in a PR to your branch: aaronsteers#3 hopefully you find it useful

aaronsteers · 2020-03-09T22:42:52Z

@aaronsteers I made some suggestions in a PR to your branch: aaronsteers#3 hopefully you find it useful

Absolutely. Thanks very much. As mentioned on the new PR, I think it makes sense to consolidate efforts there on #58. I'll close this PR for now at least in favor of #58 and we can re-open or resurface items from here as and when need. Thanks!

adds docker image

a5dfced

aaronsteers mentioned this pull request Feb 26, 2020

How to connect to a local spark install #31

Closed

aaronsteers added 2 commits February 26, 2020 01:09

fix jinja breakage

ba8d81e

add venv

a043b65

added env vars

1bdf0a4

remove docker-in-docker

c8b7224

dmateusp reviewed Feb 28, 2020

View reviewed changes

aaronsteers added 2 commits February 28, 2020 12:00

updates per feedback

ca2320a

trim down pip installs

222190f

dmateusp mentioned this pull request Mar 7, 2020

Add a docker-compose environment for local/integration testing #58

Merged

aaronsteers closed this Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Adds Dockerfile with containerized spark #55

(WIP) Adds Dockerfile with containerized spark #55

aaronsteers commented Feb 26, 2020

jtcohen6 commented Feb 27, 2020

beckjake commented Feb 27, 2020

aaronsteers commented Feb 27, 2020 •

edited

Loading

beckjake commented Feb 28, 2020

aaronsteers commented Feb 28, 2020 •

edited

Loading

dmateusp left a comment

dmateusp Feb 28, 2020

aaronsteers Feb 28, 2020

dmateusp Feb 29, 2020

aaronsteers Mar 1, 2020

dmateusp Mar 1, 2020

aaronsteers Mar 1, 2020

dmateusp Feb 28, 2020

aaronsteers Feb 28, 2020

dmateusp Feb 29, 2020

dmateusp commented Mar 1, 2020

aaronsteers commented Mar 9, 2020

(WIP) Adds Dockerfile with containerized spark #55

(WIP) Adds Dockerfile with containerized spark #55

Conversation

aaronsteers commented Feb 26, 2020

jtcohen6 commented Feb 27, 2020

beckjake commented Feb 27, 2020

aaronsteers commented Feb 27, 2020 • edited Loading

beckjake commented Feb 28, 2020

aaronsteers commented Feb 28, 2020 • edited Loading

dmateusp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmateusp commented Mar 1, 2020

aaronsteers commented Mar 9, 2020

aaronsteers commented Feb 27, 2020 •

edited

Loading

aaronsteers commented Feb 28, 2020 •

edited

Loading