Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlay2 mount fails for larger Dockerfiles when length of data-root is 24 #1012

Closed
2 of 3 tasks
maresb opened this issue May 20, 2020 · 11 comments · Fixed by moby/moby#41830
Closed
2 of 3 tasks

Overlay2 mount fails for larger Dockerfiles when length of data-root is 24 #1012

maresb opened this issue May 20, 2020 · 11 comments · Fixed by moby/moby#41830

Comments

@maresb
Copy link

maresb commented May 20, 2020

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

Expected behavior

Regardless of data-root when building a large Dockerfile such as

FROM alpine:3.9.6
RUN echo 1 > 1
RUN echo 2 > 2
RUN echo 3 > 3
⋮
RUN echo 999 > 999
RUN echo 1000 > 1000

I would expect either a successful build, or the error

Step 126/1001 : RUN echo 125 > 125
 ---> Running in 2f2e56da89e0
max depth exceeded

Actual behavior

Step 61/1001 : RUN echo 60 > 60
 ---> Running in dde85ac3b1e3
Removing intermediate container dde85ac3b1e3
 ---> 80a12a18a241
Step 62/1001 : RUN echo 61 > 61
error creating overlay mount to /23456789112345678921234/overlay2/d368abcc97d6c6ebcf23fa71225e2011d095295d5d8c9b31d6810bea748bdf07-init/merged: no such file or directory

Steps to reproduce the behavior

  • Install Ubuntu 18.04.4 Desktop, selecting all the default options, including downloading updates while installing. (This also selects the universe apt source, from which Docker version 19.03.6, build 369ce74a3c can be installed.)
  • Run the following commands:
sudo apt-get install -y docker.io

cat <<'EOF' >> make_dockerfile.py
with open("Dockerfile", "w") as d:
     d.write("FROM alpine:3.9.6\n")
     for n in range(1, 1 + 1000):
         d.write(f"RUN echo {n} > {n}\n")
EOF

python3 make_dockerfile.py 

cat <<'EOF' | sudo tee /etc/docker/daemon.json
{
    "data-root": "/23456789112345678921234"
}
EOF

sudo service docker stop
sudo service docker start
sudo docker build --no-cache .

Output is

Sending build context to Docker daemon  4.457MB
Step 1/1001 : FROM alpine:3.9.6
 ---> 78a2ce922f86
Step 2/1001 : RUN echo 1 > 1
 ---> Running in 74bb7c8e4d8e
Removing intermediate container 74bb7c8e4d8e
 ---> 410d62b102d3
Step 3/1001 : RUN echo 2 > 2
⋮
Step 61/1001 : RUN echo 60 > 60
 ---> Running in dde85ac3b1e3
Removing intermediate container dde85ac3b1e3
 ---> 80a12a18a241
Step 62/1001 : RUN echo 61 > 61
error creating overlay mount to /23456789112345678921234/overlay2/d368abcc97d6c6ebcf23fa71225e2011d095295d5d8c9b31d6810bea748bdf07-init/merged: no such file or directory

Notes:

  1. Performing docker run on the layer 80a12a18a241 produces the same error.
  2. From my testing, this bug seems to arise when and only when len(data-root) is exactly 24.
  3. The bug persists even after upgrading the Ubuntu-packaged Docker version 19.03.6, build 369ce74a3c to the latest stable Docker version 19.03.9, build 9d988398e7 following these instructions.
  4. This bug also presents on my desktop, and has caused me tremendous grief over the last several months. Thanks to this comment I managed to construct the above Dockerfile which builds sufficiently quickly that I could trace the problem to my data-root and reproduce it consistently in a VM.

Related:
#791
docker/for-mac#1396
docker/for-mac#1974

Environment details

Output of docker version:

Docker version 19.03.9, build 9d988398e7

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 181
 Server Version: 19.03.9
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.3.0-28-generic
 Operating System: Ubuntu 18.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 1.941GiB
 Name: b-Standard-PC-i440FX-PIIX-1996
 ID: A7L6:GOT6:FNE3:AA2G:FJTB:NWWB:GGHA:RAS4:VIA3:FIZ7:OJW6:ZVCI
 Docker Root Dir: /23456789112345678921234
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.)
Present in both my physical machine and in QEMU/KVM virtualization.
Processor: Core i7-6700 BX80662I76700

@maresb maresb changed the title Overlay2 mount fails for larger Dockerfiles with data-root length 24 Overlay2 mount fails for larger Dockerfiles when length of data-root is 24 May 20, 2020
@thaJeztah
Copy link
Member

I suspect you're hitting ARG_MAX (which limits the number of arguments that can be passed to the linux mount to mount the overlay layers) early because /23456789112345678921234/ is longer than the default /var/lib/docker/

@maresb
Copy link
Author

maresb commented Jun 11, 2020

So what's weird is that if length is 23 or 25 it works. I only see the problem when the length is exactly 24. I suspect that there is a faulty edge case somewhere.

(Note that the length of /var/lib/docker/ is 16, which does not seem related to the number 24.)

@thaJeztah
Copy link
Member

That's definitely ..... interesting 🤔

No direct ideas 😂

@maresb
Copy link
Author

maresb commented Jun 11, 2020

After I lost several days of productive time to this bug, I decided it was worth dedicating a full day to making it reproducible inside a VM. I succeeded, so if this piques the interest of a curious developer, it would be great if they could follow my reproducing instructions and track down the bug. (Unfortunately I can't do it myself since I'm only really proficient in Python atm.)

A lot of other people seem to be afflicted by similar bugs (see references), so it's not just some theoretical curiosity.

@ob
Copy link

ob commented Dec 19, 2020

I can repro this on RedHat 7.6. One clue is that the output of dmesg -T is:

[Sat Dec 19 02:35:40 2020] overlayfs: failed to resolve '/23456789112345678921234/overlay2/89e435a1b24583c463abb73e8abfad8bf8a88312ef8253455390c5fa0a765517-init/wor': -2

Note that the pathname is truncated.

I suspect the combination of many layers (61) and a docker-root of exactly 24 bytes might be triggering some bug in the overlay filesystem in the kernel.

@ob
Copy link

ob commented Dec 21, 2020

So the problem wasn't in the kernel, it was an off-by-one bug in docker's overlay2 driver.

ob added a commit to ob/moby that referenced this issue Dec 21, 2020
This is a fix for docker/for-linux#1012.

The code was not considering that C strings are NULL-terminated so
we need to leave one extra byte.

Without this fix, the testcase in docker/for-linux#1012
fails with

```
Step 61/1001 : RUN echo 60 > 60
 ---> Running in dde85ac3b1e3
Removing intermediate container dde85ac3b1e3
 ---> 80a12a18a241
Step 62/1001 : RUN echo 61 > 61
error creating overlay mount to /23456789112345678921234/overlay2/d368abcc97d6c6ebcf23fa71225e2011d095295d5d8c9b31d6810bea748bdf07-init/merged: no such file or directory
```

with the output of `dmesg -T` as:

```
[Sat Dec 19 02:35:40 2020] overlayfs: failed to resolve '/23456789112345678921234/overlay2/89e435a1b24583c463abb73e8abfad8bf8a88312ef8253455390c5fa0a765517-init/wor': -2
```

with this fix, you get the expected:

```
Step 126/1001 : RUN echo 125 > 125
 ---> Running in 2f2e56da89e0
max depth exceeded
```

Signed-off-by: Oscar Bonilla <6f6231@gmail.com>
@ob
Copy link

ob commented Dec 21, 2020

@maresb Thanks a lot for providing a script for reproducing. It was super-helpful in debugging this.

@maresb
Copy link
Author

maresb commented Dec 21, 2020

Wow, @ob, that's amazing work!!! I am so glad that you were able to fix this, and that my report was helpful, especially since it took me so much time.

I didn't notice the dmesg error. I'll have to keep this in mind for the next time I have some kernel-ish level problem.

For my own knowledge, I'm trying to understand the source code around this point. From what I can tell, this mountData string builds up the arguments for the mount command, and it must fit into a single page (in my case, pageSize=4096 bytes). In case mountData is too big, it tries to regenerate mountData using relative paths. But whenever mountData is exactly the pageSize, the shortening is not triggered due to > instead of , and because of the null terminator it gets truncated, leading to the error. Is this more or less correct?

I feel relieved that you managed to fix this, since I had a bit the feeling that I was alone and that nobody would be able to reproduce this. Thank you!!!

@ob
Copy link

ob commented Dec 21, 2020

@maresb Yeah, your description of the bug is spot-on, that was exactly what was happening.

docker-jenkins pushed a commit to docker/docker-ce that referenced this issue Dec 22, 2020
This is a fix for docker/for-linux#1012.

The code was not considering that C strings are NULL-terminated so
we need to leave one extra byte.

Without this fix, the testcase in docker/for-linux#1012
fails with

```
Step 61/1001 : RUN echo 60 > 60
 ---> Running in dde85ac3b1e3
Removing intermediate container dde85ac3b1e3
 ---> 80a12a18a241
Step 62/1001 : RUN echo 61 > 61
error creating overlay mount to /23456789112345678921234/overlay2/d368abcc97d6c6ebcf23fa71225e2011d095295d5d8c9b31d6810bea748bdf07-init/merged: no such file or directory
```

with the output of `dmesg -T` as:

```
[Sat Dec 19 02:35:40 2020] overlayfs: failed to resolve '/23456789112345678921234/overlay2/89e435a1b24583c463abb73e8abfad8bf8a88312ef8253455390c5fa0a765517-init/wor': -2
```

with this fix, you get the expected:

```
Step 126/1001 : RUN echo 125 > 125
 ---> Running in 2f2e56da89e0
max depth exceeded
```

Signed-off-by: Oscar Bonilla <6f6231@gmail.com>
Upstream-commit: c923f6ac3bf61c8eb369a978b55a5d3f1fad0fbb
Component: engine
@thaJeztah
Copy link
Member

moby/moby#41830 was merged, and is included in docker 20.10.2, so this should be fixed in that version

@kbulygin
Copy link

@maresb Thank you for the description!

For those who cannot upgrade: adding a RUN true to Dockerfile solved the problem for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants