Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-3673] Add official dockerfile #4483

Merged
merged 12 commits into from
Jan 12, 2019
Merged

Conversation

ffinfo
Copy link
Contributor

@ffinfo ffinfo commented Jan 11, 2019

Make sure you have checked all steps below.

Jira

Description

Adding a docker build step to travis and push to docker hub

Tests

This should only happen when test are successful and on the master branch

Commits

Documentation

To finish documentation first the push step should be confirmed to work. This can be done in a next PR

Code Quality

  • Passes flake8

@Fokko Fokko changed the title [AIRFLOW-3673] Adding docker build step to travis [AIRFLOW-3673] Add docker build step to travis Jan 11, 2019
@ffinfo ffinfo changed the title [AIRFLOW-3673] Add docker build step to travis [AIRFLOW-3673] Add official dockerfile Jan 11, 2019
@Fokko
Copy link
Contributor

Fokko commented Jan 11, 2019

@ffinfo Thanks, would be nice to have an image that represents master for quick testing

@ffinfo
Copy link
Contributor Author

ffinfo commented Jan 11, 2019

A lot of things here here inspired from @puckel ;)

.travis.yml Outdated
- name: docker-push
# require the branch name to be master (note for PRs this is the base branch name)
if: branch = master
script: docker build -t apache/airflow:master -t apache/airflow:$TRAVIS_COMMIT . && docker push apache/airflow:$TRAVIS_COMMIT && docker push apache/airflow:master
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will have to use docker pull followed by docker build --cache-from in order to utilise docker caching. The problem is that when you start builds on a fresh machine the layers are different from what was built on another machine and effectively you do not reuse previously built layers - i.e. the image is re-build and all layers are pushed every time you make push.

There is relatively new --cache-from feature of docker which works in the way that when you pull the image built elsewhere you can use it as source of cache for subsequent builds. This means that you will only incrementally built final layers only if for example just sources or just setup.py changes.

You can read more about it in this issue : https://stackoverflow.com/questions/37440705/how-to-setup-docker-to-use-cache-from-registry-on-every-build-step (see the second answer - highest voted)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option - we could consider using Kaniko. Kaniko is a great tool provided by Google to build reproducible images that you can use inside another Docker container as well. It has much better support for building reproducible images (independent from the machine it is built on):

https://github.com/GoogleContainerTools/kaniko

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do realise that this part of the code is already gone? The building will happen by docker hub itself once this is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I see now. I was looking at commits rather than final change. Indeed in this case Dockerhub manages caching on its own. All good then. 👍

Copy link
Member

@potiuk potiuk Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW. Not sure if this is the right moment, but maybe we could incorporate some of the layerin and concepts from what we've implemented in our GCP-operator related Airflow Development environment (We call it airflow-breeze).
It has a number of things implemented such as:

  • nicer layering
  • ability to only rebuild different layers separately rather than everything
  • support for python 2.7, 3.5, 3.6 in one image
  • faster build for cassandra drivers

It's much more cache friendly and I think some of the concepts from it might be incorporated in the official image as well:

https://github.com/PolideaInternal/airflow-breeze/blob/master/Dockerfile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @potiuk for the feedback. I think this Dockerfile is fine for a first step, after that we can improve and make it nicer. The idea is to have a build each time there is a commit on master, not on the other branches.
Right now this is done by the automatic buiding tool of Dockerhub: https://jira.apache.org/jira/browse/INFRA-17595
There is no way to use caching since it will kick off a fresh build every time.

Feel free to improve the image, highly appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you can enable build caching in dockerhub automated builds: https://docs.docker.com/docker-hub/builds/ :

For each branch or tag, enable or disable the Build Caching toggle.

Build caching can save time if you are building a large image frequently or have many dependencies. You might want to leave build caching disabled to make sure all of your dependencies are resolved at build time, or if you have a large layer that is quicker to build locally.

I will propose a change shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of that, thanks for pointing out.

@potiuk
Copy link
Member

potiuk commented Jan 11, 2019

Actually - I looked again at the final Dockerfile and I have to restate what I commented on before. The way it is implemented now is quite bad for users who want to pull the image and then keep updated with future changes. Every time you push a new commit, pretty much the whole image will be created from scratch. This is quite bad for someone who would like to pull the image regularly - because every time they pull it they will effectively pull the whole image as it will be pretty much single layer image. Plus you do not tap into benefit of docker downloading and decompressing the layers in parallel if you have mutli-layered image.

Layering the image with separately installing apt-get dependencies and then installing airflow via pip on top of it and then possibly updating dependencies / airflow (with possibility to rebuild starting from any layer if needed) could provide a much better experience with users only pulling incremental layers for future updates. I am happy to help with that if you think it is a good idea.

scripts/docker-entrypoint.sh Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
scripts/docker-entrypoint.sh Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
ARG AIRFLOW_DEPS="all"
ARG PYTHON_DEPS=""
ARG buildDeps="freetds-dev libkrb5-dev libsasl2-dev libssl-dev libffi-dev libpq-dev git"
ARG APT_DEPS="$buildDeps libsasl2-dev freetds-bin build-essential default-libmysqlclient-dev apt-utils curl rsync netcat locales"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having all of these in the "official" image would make it rather heavy weight for most users wouldn't it? What is the size of this image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did bring it already down to 1 Gb now, before it was 1,6 Gb because of the non slim python image.
I dit this on purpose because I like that the default image does support everything. But I did make it with ARG so that it would be possible to generate multiple images from the same docker file in case a smaller version is also required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, having multiple official images is maybe also confusing. Now a days 1 Gb is not so much anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, after fixing some of the comments it did drop to 877MB now

echo starting airflow with command:
echo airflow $@

exec airflow $@
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this docker image would work with other commands, such as bash or python https://github.com/puckel/docker-airflow/blob/master/script/entrypoint.sh#L70-L95

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb You can always do something like this: docker run --entrypoint bash -ti <image>

@potiuk
Copy link
Member

potiuk commented Jan 11, 2019

I created a Multi-Layered Dockerfile pull request from my private repo : ffinfo#1 (https://github.com/potiuk/incubator-airflow/commit/f7e3e2646823122c05f0075e5b019b21426a90fa) . This improves the original Dockerfile in the following ways:

  • there are several layers of the image:
    • two apt-get layers for basic/complex dependencie
    • upgrade apt-get layer for future upgrades in apt-get dependencies
    • layer with pip airflow dependencies installed
    • layer with source changes of airflow
    • layer with additional pip dependencies
  • Cassandra driver install time is vastly decreased - no CYTHON compilation is done which shortens the build time by 10 minutes or so. Also automatically multi-processor build (8 processors) is enabled. This all can be changed by simply changing the default values. also it can be overwritten by --build-arg flag of docker build

This should work as follows (if caching is enabled in dockerhub):

  • if only airflow sources change without setup.py dependencies (which is most common case) only the last layer is rebuild, all the other layers are taken from the cache. This should not only speed up the build time but also amount of data downloaded by the users/developers. This can also be forced by increasing value of FORCE_REINSTALL_AIRFLOW_SOURCES env in the Dockerfile
  • if dependencies are changed in setup.py, then last two layers are invalidated and rebuilt - all dependencies will be installed from scratch. This can also be forced by increasing the value of FORCE_REINSTALL_ALL_PIP_DEPENDENCIES
  • if we want to upgrade all apt-get dependencies we can increase the value of FORCE_UPGRADE_OF_APT_GET variable in the Dockerfile - this will invalidate the cache and force "apt-get upgrade"
  • if we want to force reinstalling of everything from the scratch we can increase the value of FORCE_REINSTALL_APT_GET_DEPENDENCIES variable in the Dockerfile

As result - if we just push code to airflow repo, the image built in Dockerhub will be prepared in optimal way and users will only download incremental updates to the base image they already have. We also have a way to force rebuilding of parts or the whole image if we choose to.

Dockerfile Outdated
&& export SLUGIFY_USES_TEXT_UNIDECODE=yes \
&& apt update \
&& if [ -n "${APT_DEPS}" ]; then apt install -y $APT_DEPS; fi \
&& if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest disable a cache for pip.
pip install --no-cache-dir
It does not apply to an isolated container environment and only increases the image.

Other options: Remove cache at the end. Probably it will be ~/.cache/pip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hej @mik-laj. You are right if this is something done in a single RUN call of the Dockerfile. But if we decide to go the direction I proposed in ffinfo#1, and have multiple layers of the image, it makes sense to keep the cache dir because there you have subsequent calls to pip install - both using the same cache.

In the solution I proposed in my pull requests, you run second 'pip install' of airflow sources after installing dependencies - which will result in optimised experience of users - they will only download/build incremental layers. Without it, the whole image will have to be downloaded /rebuilt including all dependencies after every single change to airflow sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now did add --no-cach-dir. This did make the image 150Mb smaller 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Witam ciepło @potiuk
I think that even in this case the cache behavior is not expected. Cache is a mechanism that is used to speed up building, but I think that the time of downloading image rather than building image is more important. We are able to accept building the image by two minutes longer once, because users will download the image a few seconds shorter. Users, however, will not download this image once, but hundreds of times, so the profit outweighs the losses.

I have doubts whether pip cache is idempotent. I have doubts whether the cache is idempotent. You have to check whether it will not change after time, which would force building the image from scratch. For example: cache of apt should never be placed in the image because it contains information about each software, not just the software we install. The update of any (not associated with our project) software in the repository creates a new variant of the cache, and therefore the entire new docker's layer.

@potiuk
Copy link
Member

potiuk commented Jan 12, 2019

@Fokko @ffinfo -> what do you think about the layering approach I proposed?

Do you think of any reason where it might bring problems?

I think it's really bad to build the whole image as one layer. It's a best practice to leverage layer caching: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache . It used to be that you should minimize number of layers (https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#minimize-the-number-of-layers) but that was back in the old days where every single line in Dockerfile created new layer (which is not the case any more).

With the proposal I have you not only minimize one-time image download but most of all you think about your users and minimize size of any future versions/incremental downloads.

Even if you minimize the image to 500 MB it's not really helping users because with the current approach every time you pull new airflow image you will have 500 MB extra taken in your local docker storage (unless you run docker system prune this space will not be reclaimed).

@Fokko
Copy link
Contributor

Fokko commented Jan 12, 2019

@potiuk I'm not really convinced right now. Having caching in the Dockerhub will speed up the build process, but also, for how long will this be cached. If it builds on top of an old cached layer, and for example, an apt package is being updated, we won't notice it. Therefore the Dockerhub CI might run fine, but on people's local machine it might break.

While looking at other official images, I see a lot of commands combined in a few RUN statements:
PHP: https://github.com/docker-library/php/blob/b4aeb948e2e240c732d78890ff03285b16e8edda/5.6/Dockerfile
Python: https://github.com/docker-library/python/blob/3e5826ad0c6e29f07f6dc7ff8f30b4c54385d1bb/3.4/Dockerfile
Ruby: https://github.com/docker-library/ruby/blob/e34b201a0f0b49818fc8373f6a9148e13d546bdf/2.2/Dockerfile

In the end, the layer that you will redownload is 60% (600mb) of the image. The current image by @ffinfo is 877MB:

MacBook-Pro-van-Fokko:incubator-airflow fokkodriesprong$ docker images airflow-jarek
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
airflow-jarek       latest              e8ae4090d00d        10 minutes ago      1.04GB
MacBook-Pro-van-Fokko:incubator-airflow fokkodriesprong$ docker history airflow-jarek
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
e8ae4090d00d        10 minutes ago      /bin/bash -c #(nop)  CMD ["--help"]             0B                  
6ae4a2a2e1f4        10 minutes ago      /bin/bash -c #(nop)  ENTRYPOINT ["/docker-en…   0B                  
c3cb4448969f        10 minutes ago      /bin/bash -c #(nop) COPY file:0d53d67240ab83…   105B                
c3180143551f        11 minutes ago      |4 ADDITIONAL_PYTHON_DEPS= AIRFLOW_EXTRAS=al…   0B                  
001f197fc49b        11 minutes ago      /bin/bash -c #(nop)  ARG ADDITIONAL_PYTHON_D…   0B                  
8abe93e681f3        11 minutes ago      |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca…   50.5kB              
68bc9494193a        11 minutes ago      /bin/bash -c #(nop) COPY dir:3864bf4d7954e65…   10.4MB              
c79384f5054c        11 minutes ago      |3 AIRFLOW_EXTRAS=all AIRFLOW_HOME=/usr/loca…   607MB               
b0e483941ae8        15 minutes ago      /bin/bash -c #(nop) WORKDIR /opt/airflow        0B                  
d11de3d9d9eb        15 minutes ago      /bin/bash -c #(nop)  ENV FORCE_REINSTALL_AIR…   0B                  
2aa8a9f441d1        15 minutes ago      /bin/bash -c #(nop) COPY file:143db2e76b8f16…   1.26kB              
92be30663ae7        15 minutes ago      /bin/bash -c #(nop) COPY file:590340f7066102…   3.04kB              
42a5d2e39a70        15 minutes ago      /bin/bash -c #(nop) COPY file:3e78814fb55a47…   838B                
21e6013ecff7        15 minutes ago      /bin/bash -c #(nop) COPY multi:9a7d03f641d42…   15kB                
56085757672b        15 minutes ago      /bin/bash -c #(nop)  ENV SLUGIFY_USES_TEXT_U…   0B                  
2e170dabdab7        15 minutes ago      /bin/bash -c #(nop)  ENV CASS_DRIVER_NO_CYTH…   0B                  
7557c5518ec7        15 minutes ago      /bin/bash -c #(nop)  ENV CASS_DRIVER_BUILD_C…   0B                  
b979faff6f0c        15 minutes ago      /bin/bash -c #(nop)  ARG CASS_DRIVER_NO_CYTH…   0B                  
85bbf27ce200        15 minutes ago      /bin/bash -c #(nop)  ENV FORCE_REINSTALL_ALL…   0B                  
be7abae62e4b        15 minutes ago      /bin/bash -c #(nop)  ARG AIRFLOW_EXTRAS=all     0B                  
9c913c49db39        15 minutes ago      |1 AIRFLOW_HOME=/usr/local/airflow /bin/bash…   0B                  
a5b0a864d7ce        15 minutes ago      /bin/bash -c #(nop)  ARG AIRFLOW_HOME=/usr/l…   0B                  
e2135a58f7aa        15 minutes ago      /bin/bash -c apt-get update     && apt-get u…   5.04MB              
09d11dcdd0a5        16 minutes ago      /bin/bash -c #(nop)  ENV FORCE_UPGRADE_OF_AP…   0B                  
a5347df6b818        16 minutes ago      /bin/bash -c apt-get update     && apt-get i…   155MB               
17642cef8b50        16 minutes ago      /bin/bash -c apt-get update     && apt-get i…   127MB               
4d583a89cee6        16 minutes ago      /bin/bash -c #(nop)  ENV FORCE_REINSTALL_APT…   0B                  
fc78f0a6a9ad        16 minutes ago      /bin/bash -c #(nop)  ENV DEBIAN_FRONTEND=non…   0B                  
6f4b3293aa89        16 minutes ago      /bin/bash -c #(nop)  SHELL [/bin/bash -c]       0B                  
ea57895cf3f9        8 weeks ago         /bin/sh -c #(nop)  CMD ["python3"]              0B                  
<missing>           8 weeks ago         /bin/sh -c set -ex;   savedAptMark="$(apt-ma…   7.13MB              
<missing>           8 weeks ago         /bin/sh -c #(nop)  ENV PYTHON_PIP_VERSION=18…   0B                  
<missing>           8 weeks ago         /bin/sh -c cd /usr/local/bin  && ln -s idle3…   32B                 
<missing>           8 weeks ago         /bin/sh -c set -ex   && savedAptMark="$(apt-…   69.1MB              
<missing>           8 weeks ago         /bin/sh -c #(nop)  ENV PYTHON_VERSION=3.6.7     0B                  
<missing>           8 weeks ago         /bin/sh -c #(nop)  ENV GPG_KEY=0D96DF4D4110E…   0B                  
<missing>           8 weeks ago         /bin/sh -c apt-get update && apt-get install…   6.45MB              
<missing>           8 weeks ago         /bin/sh -c #(nop)  ENV LANG=C.UTF-8             0B                  
<missing>           8 weeks ago         /bin/sh -c #(nop)  ENV PATH=/usr/local/bin:/…   0B                  
<missing>           8 weeks ago         /bin/sh -c #(nop)  CMD ["bash"]                 0B                  
<missing>           8 weeks ago         /bin/sh -c #(nop) ADD file:dab9baf938799c515…   55.3MB 

@codecov-io
Copy link

Codecov Report

Merging #4483 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4483      +/-   ##
==========================================
- Coverage   74.73%   74.72%   -0.01%     
==========================================
  Files         429      429              
  Lines       29620    29620              
==========================================
- Hits        22136    22135       -1     
- Misses       7484     7485       +1
Impacted Files Coverage Δ
airflow/models/__init__.py 92.5% <0%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 486b3a2...f867daa. Read the comment docs.

@Fokko
Copy link
Contributor

Fokko commented Jan 12, 2019

I'm going to merge this for now so that the CI is okay again.

@potiuk Feel free to open up a PR to make the Dockerfile nicer.

@Fokko Fokko merged commit e2c22fe into apache:master Jan 12, 2019
@mik-laj
Copy link
Member

mik-laj commented Jan 12, 2019

@Fokko You took very specific cases. Your images are basic images that are selected once and used for a very long time without updating. No one updates the execution environment every month, so they are not analyzed how to increase user experience.

I suggest to look at real application files. Sentry is a good example:
https://github.com/getsentry/docker-sentry/blob/f648178adf33bec68be47550a82850e18deb0cd8/9.0/Dockerfile

@potiuk
Copy link
Member

potiuk commented Jan 12, 2019

@Fokko @ffinfo -> i see this is merged now. I will still try to convince you anyway :)

I will soon open the PR to the main apache line with some detailed calculations. I will prepare some numbers and analysis including times to download and simulation of usage by the users - in order to show that rather than simply 'state' it.

But just for now to answer your concerns @Fokko:

  1. Docker caching is basically forever. It wont 'expire' until you change a line in Dockerfile / Docker and part of the context that are added in Dockerfile (for example sources of Airflow). What DockerHub is doing (well at least this is best practice and I am sure we can double check with them) is they are pulling the latest image from the same branch, use it as --cache-from and this way they will always reuse the layers that were already published. This is not a local machine caching - it is using published image layers as cache.

  2. Updates to apt-get packages:

I've added these lines below in my Docker image. Currently those need to be manually triggered, but if updates to latest versions of apt-get installed packages is a concern, this can be moved. Those lines can be always added after setup.py changes or sources added - this way latest versions will be upgraded frequently (or very frequently). And you will get the same as if you installed it from the scratch.

'RUN apt-get update
&& apt-get upgrade -y --no-install-recommends
&& apt-get clean'

This layer will grow over time - of course - but then periodically it's worth to rebuild it from the scratch (and that's what I also implemented - if you increase FORCE_REINSTALL_APT_GET_DEPENDENCIES env value the apt-get dependencies will be installed from the scratch). This way the apt-get install layer will be rebuild from scratch with latest dependencies (say every official release or every quarter).

  1. Other images: PHP/Python/Ruby are actually "Baseline" kind of images. They have almost no "external" dependencies different than "core OS" - unlike Airfloe (which has quite a lot of them in fact).

  2. I think you looked at different layer than the one I was vocal about. The 600 MB layer is exactly the one that will NOT be re-downloaded very frequently in my solution. Those are installed dependencies for airflow (without the actual airflow sources!). They don't change with every commit.

With every commit where the sources are changed you will only re-download the '68bc9494193a' image from your example (10.4MB) which is a bit more than 10% of the full image. This is because the 600MB layer will only be rebuilt when setup.py changes - otherwise it's taken from cache. Setup.py changes far less frequently than the sources - sources change every time but the setup.py around once every 2 weeks or so. Sometimes even there are 3 weeks where it remains unchanged. This means that during those two -three weeks anyone syncing last image will save significant amount of time/bandwidth.

Later on we could even split it even further - using the already existing setup.py split to core (more stable) and non-core (changing more frequently) dependencies. This way the big pip-install layer will be split even further and core part will be downloaded even less frequently.

@ashb
Copy link
Member

ashb commented Jan 14, 2019

I think I agree with @potiuk and I'd like to see something like:

  • apache/aiflow:1.10-slim which just contains the absolute minimum to run Airflow (python, core deps, not much else)
  • apache/aiflow:1.10-full which builds upon the -slim and adds in everything else: (cassandra, kerberos, pq et al.)

Edit: thinking a bit more about the way docker layers work having the two images based off each other wouldn't be sensible (the goal being a small code change should only result in a small layer download) - so I think two tags would be the way I'd like to go

@potiuk
Copy link
Member

potiuk commented Jan 14, 2019

I think we can have some basee image with minimal apt-get dependencies and then build both Dockers "forking" from that - maybe we can use multi-stage builds for that. Though I will have to check if there is a way to have dependencies between images built via DockerHub.

If not then it would make sense to have separate Dockerfiles for every Docker. I just finalize PR for optional project id for GCP operators, so this week I can build some POC and perform the calculations to show the savings from multiple layers.

Actually we probably would like to incorporate in similar process a third image - one that is used as Travis CI (currently it is in separate project, but it probably makes sense to build it in the main airflow project). I actually thinke it would be greate to have more than one image for that - separate images for different versions of python is probably the way to go for Travis. See the discussion/comments in AIP-7: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-7+Simplified+development+workflow

I would also love to be able to use the Images as base for Google Cloud Build we are running for automated System Tests with GCP (see AIP-4: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-4+Support+for+System+Tests+for+external+systems)

I don't think we will be far from that as long as layered approach is implemented.

@ashb
Copy link
Member

ashb commented Jan 14, 2019

separate images for different versions of python is probably the way to go for Travis

I think we use the "other" version of python in the tests of the PythonVirtualEnvOperator?

@potiuk
Copy link
Member

potiuk commented Jan 15, 2019

Ah. I thought more about installing dependencies for Airflow. Bare Python installation is not that huge as the one with full dependencies. I think it's enough for Python Virtualenv operators to just have the other python versions installed (without all airflow deps).

@Fokko
Copy link
Contributor

Fokko commented Jan 15, 2019

There is already a PR for having a Python2 and Python3 image for the CI: apache/airflow-ci#4

I do think we should start simple, and go from there. Having a lot of dependencies also makes it harder to maintain etc.

@potiuk
Copy link
Member

potiuk commented Jan 15, 2019

Ah. That's nice.

What do you think @Fokko about bringing the airflow-ci Dockerfile to the main airflow project? I think it makes sense to be in airflow repo rather than in the separate project?

Now that we have Dockerhub enabled it will be just a matter of bringing it in and defining separate a dockerfile lockation as described in apache/airflow-ci#4 . It already was discussed (I made comment about it) in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-7+Simplified+development+workflow and it seems that being able to change the Dockerfiles together with the code you are working on in your fork would be a nice thing.

Then anyone could setup their own DockerHub + Travis CI account and modify the CI scripts to use their image rather than the apache ones. I would also like to use the same approach (I am already doing it in our fork) to be able to run system tests for GCP using similar approach with Cloud Build (which is much better integrated for GCP system tests) - as described in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-4+Support+for+System+Tests+for+external+systems

I just finished a big piece of work and over the next couple of days and can make some PR(s) with that to bring modularisation in and to add Travis CI + possibly later Cloud Build images to the 'airflow' repo.

@potiuk
Copy link
Member

potiuk commented Jan 15, 2019

Let me know if you think it's a good idea, then I can proceed with it :)

@potiuk
Copy link
Member

potiuk commented Jan 15, 2019

One more comment - in our system tests we encountered a lot of troubles with differences of python 3.5 vs. 3.6. There are some subtle differences (the way how Popen.stdout returns bytes or strings etc.)
that make them work differently. And those were often caught at System Tests. I think there is a bit of confusion with officially supported versions: in tox we have 3.5 and 2.7 but in Google Cloud Composer default python 3 version is 3.6. So I guess having 3 images 2.7, 3.5, 3.6 could be better than two 2/3 :).

@ashb
Copy link
Member

ashb commented Jan 15, 2019

3.5 vs. 3.6. There are some subtle differences (the way how Popen.stdout returns bytes or strings etc.)

On Python 3.5 vs 3.6? With otherwise identical environments?!

@Fokko
Copy link
Contributor

Fokko commented Jan 15, 2019

Hi @potiuk that sounds like a great idea. Like I said earlier, we should keep stuff simple. Having the CI image inside of the Airflow repository itself, makes it easier to maintain. Right now we have to push changes to two repositories, which does not really work in practice.

I like the idea that you propose. Maybe it might be better to first strip out stuff like tox, which will also impact the image. However, I'm fine with any approach. Let me know if I can help.

@potiuk
Copy link
Member

potiuk commented Jan 15, 2019

@ashb. Right - not Popen. The Popen difference was 2.7 vs. 3.x of course, but there were others we hit.

One difference I documented 3.5 vs. 3.6 as bug (one that gave me most headeaches). You will understand when you look at it why it was a headache and why I needed to test it. There is very, very subtle difference of urlparse behaviour (hostname is not lowercased for some url types) in python 3.6 vs. 3.5 (and s: https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-3615. I tested in 3.6 and it did not work in 3.5.

The other one which we solved before it hit the repo was the dict insertion ordering (which works in 3.6 as implementation detail but not in 3.5). In 3.7 it became language feature. There was comparision of keys in Bigtable column families vs. spec and initial implementation was implicitly relying on dict ordering (and did not use OrderedDict initially).

@potiuk
Copy link
Member

potiuk commented Jan 15, 2019

@Fokko - right. I will work on it next after some other small PRs :).

@potiuk
Copy link
Member

potiuk commented Jan 15, 2019

And yes stripping tox is definitely what should happen, once you have several docker images with different virtualenvs/python versions, tox is really an overhead

@potiuk
Copy link
Member

potiuk commented Jan 16, 2019

@ashb. I tracked down the root cause of the "Popen" differences I had in mind. Now I know why I thought it was related to Popen. It was in factjson.loads() behaviour difference that caused our often problems where we developed in 3.6 and run in 3.5.

In python3 (both 3.5 and 3.6) check_output() always produces bytes (in 2.7 it was str of course).

But json.loads() in python 2.7 and 3.5 can only consume str. Fortunately (or unfortunately) In python3.6 - json.loads() can also consume "utf-8" encoded bytes (https://docs.python.org/3/whatsnew/3.6.html#json).

We often (for setting up/tearing down test environment) run gcloud command that produces json output that we then parse and use in python to do something (for example kill all running compute instances).

So then the problem is that json.loads(subprocess.check_output()) works in python 2.7 (with 'str') and in python 3.6 (with bytes) but does not work for 3.5 (json.loads() cannot consume bytes).

It can be easily fixed with json.loads(subprocess.check_output().decode("utf-8")) which works on all versions. But it is super-easy to forget about it when your main development environment is 3.6 :)
Here is some easily reproducible example run in python 2.7, 3.5 and 3.6:

Python 2.7

[potiuk:~/code/airflow-breeze/workspaces/polidea/airflow] AIRFLOW-3710-fix-warnings-in-Google-Cloud-Base-Hook(+1/-1)+ 27m45s ± python2.7
Python 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json; import subprocess; x=json.loads(subprocess.check_output(['gcloud','compute','instances','list','--format=json']));x
[{u'cpuPlatform': u'Unknown CPU Platform', u'deletionProtection': False, u'kind': u'compute#instance', u'canIpForward': False, u'description': u'', u'zone': u'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b', u'tags': {u'fingerprint': u'42WmSpB8rSM='}, u'labelFingerprint': u'42WmSpB8rSM=', u'disks': [{u'deviceName': u'pa-1', u'kind': u'compute#attachedDisk', u'autoDelete': True, u'index': 0, u'boot': True, u'guestOsFeatures': [{u'type': u'VIRTIO_SCSI_MULTIQUEUE'}], u'licenses': [u'https://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-9-stretch'], u'mode': u'READ_WRITE', u'interface': u'SCSI', u'type': u'PERSISTENT', u'source': u'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b/disks/pa-1'}], u'startRestricted': False, u'name': u'pa-1', u'status': u'TERMINATED', u'scheduling': {u'automaticRestart': True, u'preemptible': False, u'onHostMaintenance': u'MIGRATE'}, u'machineType': u'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b/machineTypes/n1-standard-1', u'serviceAccounts': [{u'scopes': [u'https://www.googleapis.com/auth/cloud-platform'], u'email': u'uberdarek@polidea-airflow.iam.gserviceaccount.com'}], u'networkInterfaces': [{u'kind': u'compute#networkInterface', u'network': u'https://www.googleapis.com/compute/v1/projects/polidea-airflow/global/networks/default', u'accessConfigs': [{u'networkTier': u'PREMIUM', u'kind': u'compute#accessConfig', u'type': u'ONE_TO_ONE_NAT', u'name': u'External NAT'}], u'networkIP': u'10.156.0.2', u'fingerprint': u'ah2_E5U9DlI=', u'subnetwork': u'https://www.googleapis.com/compute/v1/projects/polidea-airflow/regions/europe-west3/subnetworks/default', u'name': u'nic0'}], u'creationTimestamp': u'2018-09-25T01:29:48.779-07:00', u'id': u'2480086944131075860', u'selfLink': u'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b/instances/pa-1', u'metadata': {u'kind': u'compute#metadata', u'fingerprint': u'GDPUYxlwHe4='}}]
>>> exit()

Python 3.6:

[potiuk:~/code/airflow-breeze/workspaces/polidea/airflow] AIRFLOW-3710-fix-warnings-in-Google-Cloud-Base-Hook(+1/-1)+ 28m9s ± python3.6
Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 03:02:14)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json; import subprocess; x=json.loads(subprocess.check_output(['gcloud','compute','instances','list','--format=json']));x
[{'canIpForward': False, 'cpuPlatform': 'Unknown CPU Platform', 'creationTimestamp': '2018-09-25T01:29:48.779-07:00', 'deletionProtection': False, 'description': '', 'disks': [{'autoDelete': True, 'boot': True, 'deviceName': 'pa-1', 'guestOsFeatures': [{'type': 'VIRTIO_SCSI_MULTIQUEUE'}], 'index': 0, 'interface': 'SCSI', 'kind': 'compute#attachedDisk', 'licenses': ['https://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-9-stretch'], 'mode': 'READ_WRITE', 'source': 'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b/disks/pa-1', 'type': 'PERSISTENT'}], 'id': '2480086944131075860', 'kind': 'compute#instance', 'labelFingerprint': '42WmSpB8rSM=', 'machineType': 'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b/machineTypes/n1-standard-1', 'metadata': {'fingerprint': 'GDPUYxlwHe4=', 'kind': 'compute#metadata'}, 'name': 'pa-1', 'networkInterfaces': [{'accessConfigs': [{'kind': 'compute#accessConfig', 'name': 'External NAT', 'networkTier': 'PREMIUM', 'type': 'ONE_TO_ONE_NAT'}], 'fingerprint': 'ah2_E5U9DlI=', 'kind': 'compute#networkInterface', 'name': 'nic0', 'network': 'https://www.googleapis.com/compute/v1/projects/polidea-airflow/global/networks/default', 'networkIP': '10.156.0.2', 'subnetwork': 'https://www.googleapis.com/compute/v1/projects/polidea-airflow/regions/europe-west3/subnetworks/default'}], 'scheduling': {'automaticRestart': True, 'onHostMaintenance': 'MIGRATE', 'preemptible': False}, 'selfLink': 'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b/instances/pa-1', 'serviceAccounts': [{'email': 'uberdarek@polidea-airflow.iam.gserviceaccount.com', 'scopes': ['https://www.googleapis.com/auth/cloud-platform']}], 'startRestricted': False, 'status': 'TERMINATED', 'tags': {'fingerprint': '42WmSpB8rSM='}, 'zone': 'https://www.googleapis.com/compute/v1/projects/polidea-airflow/zones/europe-west3-b'}]
>>> quit()

Python 3.5

[potiuk:~/code/airflow-breeze/workspaces/polidea/airflow] AIRFLOW-3710-fix-warnings-in-Google-Cloud-Base-Hook(+1/-1)+ 28m26s ± python3.5
Python 3.5.4 (v3.5.4:3f56838976, Aug  7 2017, 12:56:33)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json; import subprocess; x=json.loads(subprocess.check_output(['gcloud','compute','instances','list','--format=json']));x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
>>>

@potiuk
Copy link
Member

potiuk commented Jan 16, 2019

@ffinfo @Fokko @ashb -> multi-layered image seems to be even slightly smaller than the mono-layered one and when I calculate some assumed usage scenarios I got 5 GB (multi-layer) vs 16 GB (mono-layered) downloaded by the user over the course of 8 weeks. PTAL and let me know what you think, but I think it's no brainer. I opened a PR #4543 to the main repo and wrote proposal describing my calculations and assumptions in detail in the AIP-10 proposal: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+official+Airflow+image

I am happy to discuss my findings tomorrow and I hope I can convince you that multi-layered approach is far better on general.

wmorris75 pushed a commit to modmed/incubator-airflow that referenced this pull request Jul 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants