Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,17 @@ repos:
hooks:
- id: identity
- id: check-hooks-apply
- repo: https://github.com/thlorenz/doctoc.git
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it here because the sequence Licence -> TOC ended up with broken licence always when I added new .md file.

rev: v2.1.0
hooks:
- id: doctoc
name: Add TOC for md files
files:
^README\.md$|^CONTRIBUTING\.md$|^UPDATING.*\.md$|^chart/UPDATING.*\.md$|^dev/.*\.md$|^dev/.*\.rst$
exclude: ^airflow/_vendor/
args:
- "--maxlevel"
- "2"
- repo: https://github.com/Lucas-C/pre-commit-hooks
rev: v1.1.10
hooks:
Expand Down Expand Up @@ -142,17 +153,6 @@ repos:
- --fuzzy-match-generates-todo
files: \.mermaid$
exclude: ^airflow/_vendor/
- repo: https://github.com/thlorenz/doctoc.git
rev: v2.1.0
hooks:
- id: doctoc
name: Add TOC for md files
files:
^README\.md$|^CONTRIBUTING\.md$|^UPDATING.*\.md$|^chart/UPDATING.*\.md$|^dev/.*\.md$|^dev/.*\.rst$
exclude: ^airflow/_vendor/
args:
- "--maxlevel"
- "2"
- repo: https://github.com/psf/black
rev: 21.12b0
hooks:
Expand Down
151 changes: 151 additions & 0 deletions dev/breeze/doc/adr/0008-fixing-group-permissions-before-build.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*

- [8. Fixing group permissions before build](#8-fixing-group-permissions-before-build)
- [Status](#status)
- [Context](#context)
- [Decision](#decision)
- [Consequences](#consequences)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

# 8. Fixing group permissions before build

Date: 2022-02-13

## Status

Draft

## Context

The problem to solve is the fact that depending on configuration of the local
distribution of Linux, remote cache can be invalidated resulting in
rebuilding most of the image, even if the files that are used for the
build have not changed.


It is a result of a few combined issues.

1) Git does not store "group write" permissions for files you put in the
repository. This is a historical decision also coming from the fact that
"group" permissions only make sense for POSIX compliant systems. This
Is not available for Windows, for example.

3) When the files are checked out, by default git uses "umask" of the current
system to determine whether the group permissions to write are
[set or not](https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresharedRepository).
That's the default behaviour when 'false' is set.

4) Unfortunately POSIX does not define what should be the default umask, It
is usually either `0022` or `0002` - depending on the system. This basically
mean that your file when you check it out with git will be "group read" when
umask is `0022` but will be "group read-write" when umask is `0002`. More about
umask and difference between those two settings can be found here:
https://www.cyberciti.biz/tips/understanding-linux-unix-umask-value-usage.html


As the result, when you check out airflow project on `0022` system you will get this:


```
ls -la scripts/ci
total 132
drwxr-xr-x 19 jarek jarek 4096 Feb 5 20:49 .
drwxr-xr-x 8 jarek jarek 4096 Feb 5 20:49 ..
drwxr-xr-x 2 jarek jarek 4096 Feb 5 20:49 build_airflow
```

But when you check out airflow project on `0002` umask system, you will get
this: (note "w" in the group permission part):

```
ls -la scripts/ci
total 132
drwxrwxr-x 19 jarek jarek 4096 Feb 5 20:49 .
drwxrwxr-x 8 jarek jarek 4096 Feb 5 20:49 ..
drwxrwxr-x 2 jarek jarek 4096 Feb 5 20:49 build_airflow
```

Pretty much all Linux distributions use `0022` umask for root. But when it
comes to "regular" users, it is different - Ubuntu up until recently used
"0002" for regular users, and Debian/Mint "0022". The result is - we cannot
rely on the "w" bit being set for files when users check it out - it might or
might not be set, and it's beyond of our control when `git checkout` is
executed.

On its own - this is not a problem, but it is a HUGE problem when it comes to
caching Docker builds. The problem is, that when you build Docker image, docker
actually uses the permissions on the file to determine whether a file changed
or not. In the case above. those two "build_airflow" files will have identical
content but Docker finds them "different" (because of the "w" permission).

In this case whenever in Dockerfile we will use:

```
COPY scripts/ci/build_airlfow /opt/airflow/scripts/ci/build
```

Docker "thinks" that the file has changed and will simply invalidate the cache.
So when we use remote cache (as we do) this means that docker will always
rebuild the docker image from that step, because it will not realise that this
is in fact the same file. This is a huge problem in our case, because in order
to speed up local rebuilds of Breeze image. we use remote cache, and we have to
make sure that the file that was used to build airflow will be properly seen as
"unchanged" by anyone who builds the image locally - in order to heavily
optimize the build time. Pulling cached layers is usually much faster than
rebuilding them - especially that rebuild often involves pulling `pip` packages
and `apt` packages.

There is no other way to fix it but to make sure that write permissions for
all the files that can be potentially used during the Docker image build should
have the `write` group permission removed.

## Decision

Breeze should fix the problem just before building. We should find all files
and directories which are present in git repository and that have "w" bit set
and remove the bit. After this operation none of the files that are going to be
used for docker build will have the "w" bit set for them - and the cache is
invalidated.

Note! We cannot "blank" remove all the files in airflow directory, because
it can take a very long time when the user develops/builds Airflow. When you
develop and run airflow locally, there are a lot of files created and
generated - log files, build artifacts, but also node modules (a lot of) and
generated static files. Those files are not needed/used in the Docker build
(we make sure of that by removing all files by default and only selectively
adding files that we need). On a number of systems (especially those where
the filesystem is mounted to docker container via user filesystem - MacOS
or Windows) just changing permission of all those files can change tens of
seconds - so instead we opted to change only permissions of those files
that are added to Git.

## Consequences

As a consequence of this change, no matter what umask your system is configured
with, when you use remote cache and the "scripts" to build image have not been
modified, you will be able to use the remote cache and pull the layers that
were stored in the cache rather than rebuild it, which might save multiple minutes
on rebuilding the image (depending on the ratio of your network vs. disk/cpu
speed).
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*

- [9. Exclude all files from dockerignore by default](#9-exclude-all-files-from-dockerignore-by-default)
- [Status](#status)
- [Context](#context)
- [Decision](#decision)
- [Consequences](#consequences)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

# 9. Exclude all files from dockerignore by default

Date: 2022-02-13

## Status

Draft

## Context

Building Docker container always starts with sending the build
context first - depending on a number of files in the context.
The context might have even many hundreds of MB, which -
depending on where your docker builder is and how fast your
system is - might lead to even tens of seconds of delays before
the docker build command is run and the actual build starts.

The context has to be compressed, sent, decompressed, so it
takes CPU, networking and I/O.

Airflow - unfortunately has Dockerfiles, and sources
directly in the top-level of the project. There is no "src"
folder and by default the docker commands use the folder
where the "Dockerfile" is placed and the context files cannot
be taken from outside the context. Thus - Dockerfiles have
to be put at the "top-level" of the airflow project.

By default, all files in the current context should be sent
as context unless you ignore them via .dockerfile - in a
concept that is similar to .gitignore. Airflow has many
files that are huge and generated during running and building
(for example node_modules) but also `.egginfo` and plenty of
other files in many folders.

It sounds like a reasonable approach to do to ignore specific
folders, however it has one drawback. You might simply not
realise that some newly generated files have been added and
increase the context - thus increase the overhead needed to
build the docker images. There is no way to prevent or check
such accidental additions - for example when refactoring
files, or adding new functionalities.

## Decision

There are a number of strategies that can address the problem,
ranging by convention and automated checks but in our case,
we have multiple independent contributors and committers
reviewing the code, such a change might easily slip-through.
So the solution should be "self-managing". Luckily, there is
a way that has been discussed in a number of places but
[notably here](https://stackoverflow.com/questions/28097064/dockerignore-ignore-everything-except-a-file-and-the-dockerfile)

The strategy involves ignoring all files by default and only
selectively excluding certain folders and patterns that should
be allowed to be part of the context.

The `.dockerignore` file has
[appropriate functionality](https://docs.docker.com/engine/reference/builder/#dockerignore-file)

This is what we decided to use for our Dockerfile and Dockerfile.ci.

## Consequences

There are two consequences of this decision:

* whenever new files (that do not follow the "approved" patterns
will be added to the airflow repository, they will not increase
the size of the context
* we have to still regularly monitor the context to see whether
the approved patterns did not - by accident approved some
unnecessary files, but that should be a rather rare event
* whenever someone wants to add something to our container images,
and it is not a part of already "approved" patterns, the file
will be missing during the build, which might lead to a little
surprise, but it is explained in the `.dockerignore` what to do
in this case, and `.dockerignore` is the place where you would
look for a problem anyway. The users should be guided to add
new pattern to the `.dockerignore` in this case.