-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-3673] Add official dockerfile #4483
Conversation
@ffinfo Thanks, would be nice to have an image that represents master for quick testing |
A lot of things here here inspired from @puckel ;) |
.travis.yml
Outdated
- name: docker-push | ||
# require the branch name to be master (note for PRs this is the base branch name) | ||
if: branch = master | ||
script: docker build -t apache/airflow:master -t apache/airflow:$TRAVIS_COMMIT . && docker push apache/airflow:$TRAVIS_COMMIT && docker push apache/airflow:master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will have to use docker pull
followed by docker build --cache-from
in order to utilise docker caching. The problem is that when you start builds on a fresh machine the layers are different from what was built on another machine and effectively you do not reuse previously built layers - i.e. the image is re-build and all layers are pushed every time you make push.
There is relatively new --cache-from feature of docker which works in the way that when you pull the image built elsewhere you can use it as source of cache for subsequent builds. This means that you will only incrementally built final layers only if for example just sources or just setup.py changes.
You can read more about it in this issue : https://stackoverflow.com/questions/37440705/how-to-setup-docker-to-use-cache-from-registry-on-every-build-step (see the second answer - highest voted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option - we could consider using Kaniko. Kaniko is a great tool provided by Google to build reproducible images that you can use inside another Docker container as well. It has much better support for building reproducible images (independent from the machine it is built on):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do realise that this part of the code is already gone? The building will happen by docker hub itself once this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I see now. I was looking at commits rather than final change. Indeed in this case Dockerhub manages caching on its own. All good then. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW. Not sure if this is the right moment, but maybe we could incorporate some of the layerin and concepts from what we've implemented in our GCP-operator related Airflow Development environment (We call it airflow-breeze).
It has a number of things implemented such as:
- nicer layering
- ability to only rebuild different layers separately rather than everything
- support for python 2.7, 3.5, 3.6 in one image
- faster build for cassandra drivers
It's much more cache friendly and I think some of the concepts from it might be incorporated in the official image as well:
https://github.com/PolideaInternal/airflow-breeze/blob/master/Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @potiuk for the feedback. I think this Dockerfile is fine for a first step, after that we can improve and make it nicer. The idea is to have a build each time there is a commit on master, not on the other branches.
Right now this is done by the automatic buiding tool of Dockerhub: https://jira.apache.org/jira/browse/INFRA-17595
There is no way to use caching since it will kick off a fresh build every time.
Feel free to improve the image, highly appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you can enable build caching in dockerhub automated builds: https://docs.docker.com/docker-hub/builds/ :
For each branch or tag, enable or disable the Build Caching toggle.
Build caching can save time if you are building a large image frequently or have many dependencies. You might want to leave build caching disabled to make sure all of your dependencies are resolved at build time, or if you have a large layer that is quicker to build locally.
I will propose a change shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of that, thanks for pointing out.
Actually - I looked again at the final Dockerfile and I have to restate what I commented on before. The way it is implemented now is quite bad for users who want to pull the image and then keep updated with future changes. Every time you push a new commit, pretty much the whole image will be created from scratch. This is quite bad for someone who would like to pull the image regularly - because every time they pull it they will effectively pull the whole image as it will be pretty much single layer image. Plus you do not tap into benefit of docker downloading and decompressing the layers in parallel if you have mutli-layered image. Layering the image with separately installing apt-get dependencies and then installing airflow via pip on top of it and then possibly updating dependencies / airflow (with possibility to rebuild starting from any layer if needed) could provide a much better experience with users only pulling incremental layers for future updates. I am happy to help with that if you think it is a good idea. |
ARG AIRFLOW_DEPS="all" | ||
ARG PYTHON_DEPS="" | ||
ARG buildDeps="freetds-dev libkrb5-dev libsasl2-dev libssl-dev libffi-dev libpq-dev git" | ||
ARG APT_DEPS="$buildDeps libsasl2-dev freetds-bin build-essential default-libmysqlclient-dev apt-utils curl rsync netcat locales" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having all of these in the "official" image would make it rather heavy weight for most users wouldn't it? What is the size of this image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did bring it already down to 1 Gb now, before it was 1,6 Gb because of the non slim python image.
I dit this on purpose because I like that the default image does support everything. But I did make it with ARG so that it would be possible to generate multiple images from the same docker file in case a smaller version is also required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, having multiple official images is maybe also confusing. Now a days 1 Gb is not so much anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oke, after fixing some of the comments it did drop to 877MB now
scripts/docker-entrypoint.sh
Outdated
echo starting airflow with command: | ||
echo airflow $@ | ||
|
||
exec airflow $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if this docker image would work with other commands, such as bash
or python
https://github.com/puckel/docker-airflow/blob/master/script/entrypoint.sh#L70-L95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashb You can always do something like this: docker run --entrypoint bash -ti <image>
I created a Multi-Layered Dockerfile pull request from my private repo : ffinfo#1 (https://github.com/potiuk/incubator-airflow/commit/f7e3e2646823122c05f0075e5b019b21426a90fa) . This improves the original Dockerfile in the following ways:
This should work as follows (if caching is enabled in dockerhub):
As result - if we just push code to airflow repo, the image built in Dockerhub will be prepared in optimal way and users will only download incremental updates to the base image they already have. We also have a way to force rebuilding of parts or the whole image if we choose to. |
Dockerfile
Outdated
&& export SLUGIFY_USES_TEXT_UNIDECODE=yes \ | ||
&& apt update \ | ||
&& if [ -n "${APT_DEPS}" ]; then apt install -y $APT_DEPS; fi \ | ||
&& if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest disable a cache for pip.
pip install --no-cache-dir
It does not apply to an isolated container environment and only increases the image.
Other options: Remove cache at the end. Probably it will be ~/.cache/pip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hej @mik-laj. You are right if this is something done in a single RUN call of the Dockerfile. But if we decide to go the direction I proposed in ffinfo#1, and have multiple layers of the image, it makes sense to keep the cache dir because there you have subsequent calls to pip install - both using the same cache.
In the solution I proposed in my pull requests, you run second 'pip install' of airflow sources after installing dependencies - which will result in optimised experience of users - they will only download/build incremental layers. Without it, the whole image will have to be downloaded /rebuilt including all dependencies after every single change to airflow sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now did add --no-cach-dir
. This did make the image 150Mb smaller 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Witam ciepło @potiuk
I think that even in this case the cache behavior is not expected. Cache is a mechanism that is used to speed up building, but I think that the time of downloading image rather than building image is more important. We are able to accept building the image by two minutes longer once, because users will download the image a few seconds shorter. Users, however, will not download this image once, but hundreds of times, so the profit outweighs the losses.
I have doubts whether pip cache is idempotent. I have doubts whether the cache is idempotent. You have to check whether it will not change after time, which would force building the image from scratch. For example: cache of apt
should never be placed in the image because it contains information about each software, not just the software we install. The update of any (not associated with our project) software in the repository creates a new variant of the cache, and therefore the entire new docker's layer.
@Fokko @ffinfo -> what do you think about the layering approach I proposed? Do you think of any reason where it might bring problems? I think it's really bad to build the whole image as one layer. It's a best practice to leverage layer caching: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache . It used to be that you should minimize number of layers (https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#minimize-the-number-of-layers) but that was back in the old days where every single line in Dockerfile created new layer (which is not the case any more). With the proposal I have you not only minimize one-time image download but most of all you think about your users and minimize size of any future versions/incremental downloads. Even if you minimize the image to 500 MB it's not really helping users because with the current approach every time you pull new airflow image you will have 500 MB extra taken in your local docker storage (unless you run |
@potiuk I'm not really convinced right now. Having caching in the Dockerhub will speed up the build process, but also, for how long will this be cached. If it builds on top of an old cached layer, and for example, an While looking at other official images, I see a lot of commands combined in a few In the end, the layer that you will redownload is 60% (600mb) of the image. The current image by @ffinfo is 877MB:
|
Codecov Report
@@ Coverage Diff @@
## master #4483 +/- ##
==========================================
- Coverage 74.73% 74.72% -0.01%
==========================================
Files 429 429
Lines 29620 29620
==========================================
- Hits 22136 22135 -1
- Misses 7484 7485 +1
Continue to review full report at Codecov.
|
I'm going to merge this for now so that the CI is okay again. @potiuk Feel free to open up a PR to make the Dockerfile nicer. |
@Fokko You took very specific cases. Your images are basic images that are selected once and used for a very long time without updating. No one updates the execution environment every month, so they are not analyzed how to increase user experience. I suggest to look at real application files. Sentry is a good example: |
@Fokko @ffinfo -> i see this is merged now. I will still try to convince you anyway :) I will soon open the PR to the main apache line with some detailed calculations. I will prepare some numbers and analysis including times to download and simulation of usage by the users - in order to show that rather than simply 'state' it. But just for now to answer your concerns @Fokko:
I've added these lines below in my Docker image. Currently those need to be manually triggered, but if updates to latest versions of apt-get installed packages is a concern, this can be moved. Those lines can be always added after setup.py changes or sources added - this way latest versions will be upgraded frequently (or very frequently). And you will get the same as if you installed it from the scratch. 'RUN apt-get update This layer will grow over time - of course - but then periodically it's worth to rebuild it from the scratch (and that's what I also implemented - if you increase FORCE_REINSTALL_APT_GET_DEPENDENCIES env value the apt-get dependencies will be installed from the scratch). This way the apt-get install layer will be rebuild from scratch with latest dependencies (say every official release or every quarter).
With every commit where the sources are changed you will only re-download the '68bc9494193a' image from your example (10.4MB) which is a bit more than 10% of the full image. This is because the 600MB layer will only be rebuilt when setup.py changes - otherwise it's taken from cache. Setup.py changes far less frequently than the sources - sources change every time but the setup.py around once every 2 weeks or so. Sometimes even there are 3 weeks where it remains unchanged. This means that during those two -three weeks anyone syncing last image will save significant amount of time/bandwidth. Later on we could even split it even further - using the already existing setup.py split to core (more stable) and non-core (changing more frequently) dependencies. This way the big pip-install layer will be split even further and core part will be downloaded even less frequently. |
I think I agree with @potiuk and I'd like to see something like:
Edit: thinking a bit more about the way docker layers work having the two images based off each other wouldn't be sensible (the goal being a small code change should only result in a small layer download) - so I think two tags would be the way I'd like to go |
I think we can have some basee image with minimal apt-get dependencies and then build both Dockers "forking" from that - maybe we can use multi-stage builds for that. Though I will have to check if there is a way to have dependencies between images built via DockerHub. If not then it would make sense to have separate Dockerfiles for every Docker. I just finalize PR for optional project id for GCP operators, so this week I can build some POC and perform the calculations to show the savings from multiple layers. Actually we probably would like to incorporate in similar process a third image - one that is used as Travis CI (currently it is in separate project, but it probably makes sense to build it in the main airflow project). I actually thinke it would be greate to have more than one image for that - separate images for different versions of python is probably the way to go for Travis. See the discussion/comments in AIP-7: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-7+Simplified+development+workflow I would also love to be able to use the Images as base for Google Cloud Build we are running for automated System Tests with GCP (see AIP-4: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-4+Support+for+System+Tests+for+external+systems) I don't think we will be far from that as long as layered approach is implemented. |
I think we use the "other" version of python in the tests of the PythonVirtualEnvOperator? |
Ah. I thought more about installing dependencies for Airflow. Bare Python installation is not that huge as the one with full dependencies. I think it's enough for Python Virtualenv operators to just have the other python versions installed (without all airflow deps). |
There is already a PR for having a Python2 and Python3 image for the CI: apache/airflow-ci#4 I do think we should start simple, and go from there. Having a lot of dependencies also makes it harder to maintain etc. |
Ah. That's nice. What do you think @Fokko about bringing the airflow-ci Dockerfile to the main airflow project? I think it makes sense to be in airflow repo rather than in the separate project? Now that we have Dockerhub enabled it will be just a matter of bringing it in and defining separate a dockerfile lockation as described in apache/airflow-ci#4 . It already was discussed (I made comment about it) in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-7+Simplified+development+workflow and it seems that being able to change the Dockerfiles together with the code you are working on in your fork would be a nice thing. Then anyone could setup their own DockerHub + Travis CI account and modify the CI scripts to use their image rather than the apache ones. I would also like to use the same approach (I am already doing it in our fork) to be able to run system tests for GCP using similar approach with Cloud Build (which is much better integrated for GCP system tests) - as described in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-4+Support+for+System+Tests+for+external+systems I just finished a big piece of work and over the next couple of days and can make some PR(s) with that to bring modularisation in and to add Travis CI + possibly later Cloud Build images to the 'airflow' repo. |
Let me know if you think it's a good idea, then I can proceed with it :) |
One more comment - in our system tests we encountered a lot of troubles with differences of python 3.5 vs. 3.6. There are some subtle differences (the way how Popen.stdout returns bytes or strings etc.) |
On Python 3.5 vs 3.6? With otherwise identical environments?! |
Hi @potiuk that sounds like a great idea. Like I said earlier, we should keep stuff simple. Having the CI image inside of the Airflow repository itself, makes it easier to maintain. Right now we have to push changes to two repositories, which does not really work in practice. I like the idea that you propose. Maybe it might be better to first strip out stuff like tox, which will also impact the image. However, I'm fine with any approach. Let me know if I can help. |
@ashb. Right - not Popen. The Popen difference was 2.7 vs. 3.x of course, but there were others we hit. One difference I documented 3.5 vs. 3.6 as bug (one that gave me most headeaches). You will understand when you look at it why it was a headache and why I needed to test it. There is very, very subtle difference of urlparse behaviour (hostname is not lowercased for some url types) in python 3.6 vs. 3.5 (and s: https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-3615. I tested in 3.6 and it did not work in 3.5. The other one which we solved before it hit the repo was the dict insertion ordering (which works in 3.6 as implementation detail but not in 3.5). In 3.7 it became language feature. There was comparision of keys in Bigtable column families vs. spec and initial implementation was implicitly relying on dict ordering (and did not use OrderedDict initially). |
@Fokko - right. I will work on it next after some other small PRs :). |
And yes stripping tox is definitely what should happen, once you have several docker images with different virtualenvs/python versions, tox is really an overhead |
@ashb. I tracked down the root cause of the "Popen" differences I had in mind. Now I know why I thought it was related to Popen. It was in fact In python3 (both 3.5 and 3.6) But json.loads() in python 2.7 and 3.5 can only consume We often (for setting up/tearing down test environment) run gcloud command that produces json output that we then parse and use in python to do something (for example kill all running compute instances). So then the problem is that It can be easily fixed with Python 2.7
Python 3.6:
Python 3.5
|
@ffinfo @Fokko @ashb -> multi-layered image seems to be even slightly smaller than the mono-layered one and when I calculate some assumed usage scenarios I got 5 GB (multi-layer) vs 16 GB (mono-layered) downloaded by the user over the course of 8 weeks. PTAL and let me know what you think, but I think it's no brainer. I opened a PR #4543 to the main repo and wrote proposal describing my calculations and assumptions in detail in the AIP-10 proposal: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+official+Airflow+image I am happy to discuss my findings tomorrow and I hope I can convince you that multi-layered approach is far better on general. |
Make sure you have checked all steps below.
Jira
Description
Adding a docker build step to travis and push to docker hub
Tests
This should only happen when test are successful and on the master branch
Commits
Documentation
To finish documentation first the push step should be confirmed to work. This can be done in a next PR
Code Quality
flake8