Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Adds Dockerfile with containerized spark #55

Closed
wants to merge 7 commits into from

Conversation

aaronsteers
Copy link
Contributor

This came up as part of #31, as an idea to dockerize DBT and spark together so as to be able to run local testing and execution from the container itself.

As noted in the title, this is still a Work-in-Progress but I wanted to open this PR here to start gathering feedback.

I believe the docker image is close to working but there is a dependency on my external python library slalom.dataops which is responsible for launching the spark cluster itself. Also, spark doesn't behave well when using the default derby database so this image installs mysql for use as the hive metastore.

All feedback and ideas welcome.

Thanks!

@jtcohen6
Copy link
Contributor

@aaronsteers This is very cool! I'd love a better approach for folks to be able to run integration tests locally. This would be in a more similar vein to postgres and presto.

@beckjake Would you be up to take a look at this?

@beckjake
Copy link
Contributor

This is a great idea! I've got some suggestions but if we can't do any of this, ship it anyway IMO - better to have any kind of local test env than none.

Can we base this on python:3.8.1-slim-buster? That's what I'm using in dbt-core and what the dbt-presto PR is using, would love to have it be based on the same thing for all. If you need things from the "full" install, also fine to leave it this way.

You might also want to set some env values, either in the dockerfile or the bootstrap script. They tend to minimize the risk of python losing its mind about weird file encoding things: PYTHONIOENCODING=utf-8 and LANG=C.UTF-8.

Instead of running docker inside docker, would it make more sense to set up a docker-compose.yml file and put this container + the spark container in a network together, etc? If that makes no sense/sounds implausibly hard, no worries. I just think docker-within-docker is usually some kind of smell (at least it used to require elevated permissions for the host container, or is that fixed?)

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Feb 27, 2020

@beckjake - Thanks for the detailed thoughts and suggestions. My comments below inline.

Can we base this on python:3.8.1-slim-buster? That's what I'm using in dbt-core and what the dbt-presto PR is using, would love to have it be based on the same thing for all. If you need things from the "full" install, also fine to leave it this way.

Spark has a bunch of dependencies and I'm not sure if it would work on the slim-buster base, but I will definitely try!

You might also want to set some env values, either in the dockerfile or the bootstrap script. They tend to minimize the risk of python losing its mind about weird file encoding things: PYTHONIOENCODING=utf-8 and LANG=C.UTF-8.

No worries at all. Happy to add those two env vars - and any others you would suggest.

Instead of running docker inside docker, would it make more sense to set up a docker-compose.yml file and put this container + the spark container in a network together, etc? If that makes no sense/sounds implausibly hard, no worries. I just think docker-within-docker is usually some kind of smell (at least it used to require elevated permissions for the host container, or is that fixed?)

Regarding a single docker image vs the docker-compose approach, I can see some advantages on both sides. A single docker container seems much easier to manage for testing a 'standalone' execution, but I also see the added complexity. Probably there are two decisions here to make: (1) is whether we really need to call docker from inside docker (and I think we don't) but (2) is whether it's better to have a single fully-powered image vs two slimmer-but-dependent images (or to have support for both options).

What I can reasonably commit to right away is to review the code and see if we really need to call docker from within docker. If this isn't directly needed, shall I remove the install of the docker library, or would it be helpful to still have it "onboard" the image? Any strong preference?

Thanks!

@beckjake
Copy link
Contributor

What I can reasonably commit to right away is to review the code and see if we really need to call docker from within docker.

Sure, that sounds totally reasonable

If this isn't directly needed, shall I remove the install of the docker library, or would it be helpful to still have it "onboard" the image? Any strong preference?

I think if we don't need docker-within-docker, that would be ideal!

If we do need to use docker-within docker, or it's even just substantially easier, let's keep it as-is. I definitely am not enough of a docker guru to get to have strong opinions about docker here!

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Feb 28, 2020

@beckjake - Done and done!

I've removed the docker reference from Dockerfile and added the mentioned environment variables.

I'd like to also add at least a basic CI test if that's okay and also to test with the base image python:3.8.1-slim-buster as discussed.

Copy link
Contributor

@dmateusp dmateusp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions but I like the direction it's taking, thanks for taking the time to share that!

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" \
PATH=$PATH:$SPARK_HOME/bin

# Install mysql server and drivers (for hive metastore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we separate the mysql bit in a docker-compose ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to remove the mysql dependency, but I don't feel good about the default experience when this is omitted. I had previously leaned on the default 'derby' implementation for the spark metastore but the problem is that derby falls down immediately as soon as you try to run thrift, which is currently a hard requirement for dbt-spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but right now we have the MySQL instance running directly inside the same container right ? I'm just wondering if we should remove just the server part of it, and use an existing MySQL docker image, then connect both through docker-compose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmateusp - Yes, we can certainly do that - remove the mysql server part and use a myswl docker image but (rightly or wrongly) I was really hoping to have a single image "just work" without introducing the extra overhead of docker compose and managing multiple container lifecycles. This might be misguided on my part, but it still seems that would be valueable for a number of use cases.

If it is important to remove the mysql dependency (moving it outside the container), what if we take a two-pronged approach here:

  1. As proposed, we remove the mysql-server aspect from this base Dockerfile and instead refactor as a docker-compose.yml.
  2. To support any use cases where having a single standalone docker image is important, I can wrap the core image in second downstream Dockerfile which comes with mysql bootstrapped into the image as the built-in metadata store and "just works" out of box when executed directly via docker run.

@dmateusp - Would this be an okay approach? We can provide a viable docker run experience (using the second image) while also honoring the fact that docker-compose.yml is "the right way" to support RDMS backends. (In the long run, I would love to find a solution that can host thrift locally without requiring mysql.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I often found that docker-compose allowed me to hide complexity from users because I could set some environment variables, ports and mount volumes by default

The reason I like having databases in a separate container (even for local dev environments) is that they can be parameterized easily, if you have a look at the mysql image for example https://hub.docker.com/_/mysql, you can see there's a complete doc and it tells where volumes should be mounted to persist the data, and what environment variables are available

I think we might lack an actual example usage to compare set-ups, do you think you could add just 1 integration test, or/and a README example section ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmateusp - Absolutely, that sounds like a great next step. I'll focus on adding some type of integration test that calls dbt-spark, and a readme with usage example. Thanks!

chmod -R 777 $HADOOP_HOME && \
rm hadoop-$HADOOP_MINOR_VERSION.tar.gz

# Copy hadoop libraries for AWS & S3 to spark classpath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make that optional ? maybe we could use a Dockerfile.template ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with Dockerfile.template. I do think that both AWS and Azure libraries (jars) are needed in the 'base/core' image. They enable core functionality that otherwise would be a blocker for even basic usage and testing.

One simplification I could make here, if we didn't want the rest of the hadoop binaries, is that I could use a multi-image build, downloading hadoop binaries in a ephemeral image and then only copy the files we need into the spark classpath.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just imagine AWS users might not need Azure / GCP jars and vice versa.
Or might not need any if they are just using the local filesystem / hdfs ?

I have just seen in past projects, people using tokens inside their Dockerfile that they replace using sed to customize a "base" dockerfile. Could also have a base docker image and then an aws, azure and gcp image built from it.

All that said, this PR should probably just focus on providing a local environment (through Docker) that is good enough for (integration) testing, we could look into making the image smaller later.

Anyways! I'm going to play around with this image over this week end and give you more feedback :)

Dockerfile Show resolved Hide resolved
@dmateusp
Copy link
Contributor

dmateusp commented Mar 1, 2020

@aaronsteers I made some suggestions in a PR to your branch: aaronsteers#3 hopefully you find it useful

@aaronsteers
Copy link
Contributor Author

@aaronsteers I made some suggestions in a PR to your branch: aaronsteers#3 hopefully you find it useful

Absolutely. Thanks very much. As mentioned on the new PR, I think it makes sense to consolidate efforts there on #58. I'll close this PR for now at least in favor of #58 and we can re-open or resurface items from here as and when need. Thanks!

@aaronsteers aaronsteers closed this Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants