-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Adds ADRs resulting from discussions during internship #21549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
151 changes: 151 additions & 0 deletions
151
dev/breeze/doc/adr/0008-fixing-group-permissions-before-build.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one | ||
| or more contributor license agreements. See the NOTICE file | ||
| distributed with this work for additional information | ||
| regarding copyright ownership. The ASF licenses this file | ||
| to you under the Apache License, Version 2.0 (the | ||
| "License"); you may not use this file except in compliance | ||
| with the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, | ||
| software distributed under the License is distributed on an | ||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> | ||
|
|
||
| <!-- START doctoc generated TOC please keep comment here to allow auto update --> | ||
| <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> | ||
| **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* | ||
|
|
||
| - [8. Fixing group permissions before build](#8-fixing-group-permissions-before-build) | ||
| - [Status](#status) | ||
| - [Context](#context) | ||
| - [Decision](#decision) | ||
| - [Consequences](#consequences) | ||
|
|
||
| <!-- END doctoc generated TOC please keep comment here to allow auto update --> | ||
|
|
||
| # 8. Fixing group permissions before build | ||
|
|
||
| Date: 2022-02-13 | ||
|
|
||
| ## Status | ||
|
|
||
| Draft | ||
|
|
||
| ## Context | ||
|
|
||
| The problem to solve is the fact that depending on configuration of the local | ||
| distribution of Linux, remote cache can be invalidated resulting in | ||
| rebuilding most of the image, even if the files that are used for the | ||
| build have not changed. | ||
|
|
||
|
|
||
| It is a result of a few combined issues. | ||
|
|
||
| 1) Git does not store "group write" permissions for files you put in the | ||
| repository. This is a historical decision also coming from the fact that | ||
| "group" permissions only make sense for POSIX compliant systems. This | ||
| Is not available for Windows, for example. | ||
|
|
||
| 3) When the files are checked out, by default git uses "umask" of the current | ||
| system to determine whether the group permissions to write are | ||
| [set or not](https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresharedRepository). | ||
| That's the default behaviour when 'false' is set. | ||
|
|
||
| 4) Unfortunately POSIX does not define what should be the default umask, It | ||
| is usually either `0022` or `0002` - depending on the system. This basically | ||
| mean that your file when you check it out with git will be "group read" when | ||
| umask is `0022` but will be "group read-write" when umask is `0002`. More about | ||
| umask and difference between those two settings can be found here: | ||
| https://www.cyberciti.biz/tips/understanding-linux-unix-umask-value-usage.html | ||
|
|
||
|
|
||
| As the result, when you check out airflow project on `0022` system you will get this: | ||
|
|
||
|
|
||
| ``` | ||
| ls -la scripts/ci | ||
| total 132 | ||
| drwxr-xr-x 19 jarek jarek 4096 Feb 5 20:49 . | ||
| drwxr-xr-x 8 jarek jarek 4096 Feb 5 20:49 .. | ||
| drwxr-xr-x 2 jarek jarek 4096 Feb 5 20:49 build_airflow | ||
| ``` | ||
|
|
||
| But when you check out airflow project on `0002` umask system, you will get | ||
| this: (note "w" in the group permission part): | ||
|
|
||
| ``` | ||
| ls -la scripts/ci | ||
| total 132 | ||
| drwxrwxr-x 19 jarek jarek 4096 Feb 5 20:49 . | ||
| drwxrwxr-x 8 jarek jarek 4096 Feb 5 20:49 .. | ||
| drwxrwxr-x 2 jarek jarek 4096 Feb 5 20:49 build_airflow | ||
| ``` | ||
|
|
||
| Pretty much all Linux distributions use `0022` umask for root. But when it | ||
| comes to "regular" users, it is different - Ubuntu up until recently used | ||
| "0002" for regular users, and Debian/Mint "0022". The result is - we cannot | ||
| rely on the "w" bit being set for files when users check it out - it might or | ||
| might not be set, and it's beyond of our control when `git checkout` is | ||
| executed. | ||
|
|
||
| On its own - this is not a problem, but it is a HUGE problem when it comes to | ||
| caching Docker builds. The problem is, that when you build Docker image, docker | ||
| actually uses the permissions on the file to determine whether a file changed | ||
| or not. In the case above. those two "build_airflow" files will have identical | ||
| content but Docker finds them "different" (because of the "w" permission). | ||
|
|
||
| In this case whenever in Dockerfile we will use: | ||
|
|
||
| ``` | ||
| COPY scripts/ci/build_airlfow /opt/airflow/scripts/ci/build | ||
| ``` | ||
|
|
||
| Docker "thinks" that the file has changed and will simply invalidate the cache. | ||
| So when we use remote cache (as we do) this means that docker will always | ||
| rebuild the docker image from that step, because it will not realise that this | ||
| is in fact the same file. This is a huge problem in our case, because in order | ||
| to speed up local rebuilds of Breeze image. we use remote cache, and we have to | ||
| make sure that the file that was used to build airflow will be properly seen as | ||
| "unchanged" by anyone who builds the image locally - in order to heavily | ||
| optimize the build time. Pulling cached layers is usually much faster than | ||
| rebuilding them - especially that rebuild often involves pulling `pip` packages | ||
| and `apt` packages. | ||
|
|
||
| There is no other way to fix it but to make sure that write permissions for | ||
| all the files that can be potentially used during the Docker image build should | ||
| have the `write` group permission removed. | ||
|
|
||
| ## Decision | ||
|
|
||
| Breeze should fix the problem just before building. We should find all files | ||
| and directories which are present in git repository and that have "w" bit set | ||
| and remove the bit. After this operation none of the files that are going to be | ||
| used for docker build will have the "w" bit set for them - and the cache is | ||
| invalidated. | ||
|
|
||
| Note! We cannot "blank" remove all the files in airflow directory, because | ||
| it can take a very long time when the user develops/builds Airflow. When you | ||
| develop and run airflow locally, there are a lot of files created and | ||
| generated - log files, build artifacts, but also node modules (a lot of) and | ||
| generated static files. Those files are not needed/used in the Docker build | ||
| (we make sure of that by removing all files by default and only selectively | ||
| adding files that we need). On a number of systems (especially those where | ||
| the filesystem is mounted to docker container via user filesystem - MacOS | ||
| or Windows) just changing permission of all those files can change tens of | ||
| seconds - so instead we opted to change only permissions of those files | ||
| that are added to Git. | ||
|
|
||
| ## Consequences | ||
|
|
||
| As a consequence of this change, no matter what umask your system is configured | ||
| with, when you use remote cache and the "scripts" to build image have not been | ||
| modified, you will be able to use the remote cache and pull the layers that | ||
| were stored in the cache rather than rebuild it, which might save multiple minutes | ||
| on rebuilding the image (depending on the ratio of your network vs. disk/cpu | ||
| speed). |
109 changes: 109 additions & 0 deletions
109
dev/breeze/doc/adr/0009-exclude-all-files-from-dockerignore-by-default.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one | ||
| or more contributor license agreements. See the NOTICE file | ||
| distributed with this work for additional information | ||
| regarding copyright ownership. The ASF licenses this file | ||
| to you under the Apache License, Version 2.0 (the | ||
| "License"); you may not use this file except in compliance | ||
| with the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, | ||
| software distributed under the License is distributed on an | ||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> | ||
|
|
||
| <!-- START doctoc generated TOC please keep comment here to allow auto update --> | ||
| <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> | ||
| **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* | ||
|
|
||
| - [9. Exclude all files from dockerignore by default](#9-exclude-all-files-from-dockerignore-by-default) | ||
| - [Status](#status) | ||
| - [Context](#context) | ||
| - [Decision](#decision) | ||
| - [Consequences](#consequences) | ||
|
|
||
| <!-- END doctoc generated TOC please keep comment here to allow auto update --> | ||
|
|
||
| # 9. Exclude all files from dockerignore by default | ||
|
|
||
| Date: 2022-02-13 | ||
|
|
||
| ## Status | ||
|
|
||
| Draft | ||
|
|
||
| ## Context | ||
|
|
||
| Building Docker container always starts with sending the build | ||
| context first - depending on a number of files in the context. | ||
| The context might have even many hundreds of MB, which - | ||
| depending on where your docker builder is and how fast your | ||
| system is - might lead to even tens of seconds of delays before | ||
| the docker build command is run and the actual build starts. | ||
|
|
||
| The context has to be compressed, sent, decompressed, so it | ||
| takes CPU, networking and I/O. | ||
|
|
||
| Airflow - unfortunately has Dockerfiles, and sources | ||
| directly in the top-level of the project. There is no "src" | ||
| folder and by default the docker commands use the folder | ||
| where the "Dockerfile" is placed and the context files cannot | ||
| be taken from outside the context. Thus - Dockerfiles have | ||
| to be put at the "top-level" of the airflow project. | ||
|
|
||
| By default, all files in the current context should be sent | ||
| as context unless you ignore them via .dockerfile - in a | ||
| concept that is similar to .gitignore. Airflow has many | ||
| files that are huge and generated during running and building | ||
| (for example node_modules) but also `.egginfo` and plenty of | ||
| other files in many folders. | ||
|
|
||
| It sounds like a reasonable approach to do to ignore specific | ||
| folders, however it has one drawback. You might simply not | ||
| realise that some newly generated files have been added and | ||
| increase the context - thus increase the overhead needed to | ||
| build the docker images. There is no way to prevent or check | ||
| such accidental additions - for example when refactoring | ||
| files, or adding new functionalities. | ||
|
|
||
| ## Decision | ||
|
|
||
| There are a number of strategies that can address the problem, | ||
| ranging by convention and automated checks but in our case, | ||
| we have multiple independent contributors and committers | ||
| reviewing the code, such a change might easily slip-through. | ||
| So the solution should be "self-managing". Luckily, there is | ||
| a way that has been discussed in a number of places but | ||
| [notably here](https://stackoverflow.com/questions/28097064/dockerignore-ignore-everything-except-a-file-and-the-dockerfile) | ||
|
|
||
| The strategy involves ignoring all files by default and only | ||
| selectively excluding certain folders and patterns that should | ||
| be allowed to be part of the context. | ||
|
|
||
| The `.dockerignore` file has | ||
| [appropriate functionality](https://docs.docker.com/engine/reference/builder/#dockerignore-file) | ||
|
|
||
| This is what we decided to use for our Dockerfile and Dockerfile.ci. | ||
|
|
||
| ## Consequences | ||
|
|
||
| There are two consequences of this decision: | ||
|
|
||
| * whenever new files (that do not follow the "approved" patterns | ||
| will be added to the airflow repository, they will not increase | ||
| the size of the context | ||
| * we have to still regularly monitor the context to see whether | ||
| the approved patterns did not - by accident approved some | ||
| unnecessary files, but that should be a rather rare event | ||
| * whenever someone wants to add something to our container images, | ||
| and it is not a part of already "approved" patterns, the file | ||
| will be missing during the build, which might lead to a little | ||
| surprise, but it is explained in the `.dockerignore` what to do | ||
| in this case, and `.dockerignore` is the place where you would | ||
| look for a problem anyway. The users should be guided to add | ||
| new pattern to the `.dockerignore` in this case. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it here because the sequence Licence -> TOC ended up with broken licence always when I added new .md file.