Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

docker run is flaky #3824

Closed
farcaller opened this Issue · 17 comments

5 participants

@farcaller

Running docker 0.7.6 on arm host (RPi) is flaky. Sometimes containers fail to start:

% docker run -t -i -volumes-from test.data0 cellofellow/rpi-arch /bin/bash
2014/01/29 14:15:15 operation not permitted
% docker run -t -i -volumes-from test.data0 cellofellow/rpi-arch /bin/bash
[root@220221543b0c /]# exit
% docker run -t -i -volumes-from test.data0 cellofellow/rpi-arch /bin/bash
[root@04439b91eac8 /]# exit
% docker run -t -i -volumes-from test.data0 cellofellow/rpi-arch /bin/bash
2014/01/29 14:15:46 operation not permitted
% docker run -t -i -volumes-from test.data0 cellofellow/rpi-arch /bin/bash
2014/01/29 14:16:33 operation not permitted

FWIW, it might be somehow related to volumes usage, I'm not sure.

docker output for successful start:

[debug] api.go:1038 Calling POST /containers/create
2014/01/29 14:16:40 POST /v1.8/containers/create
[/var/lib/docker|cc1301e3] +job create()
[/var/lib/docker|cc1301e3] -job create() = OK (0)
[debug] api.go:1038 Calling POST /containers/{name:.*}/attach
2014/01/29 14:16:41 POST /v1.8/containers/60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80/attach?stderr=1&stdin=1&stdout=1&stream=1
[debug] container.go:381 attach: stdin: begin
[debug] container.go:416 attach: stdout: begin
[debug] container.go:454 attach: stderr: begin
[debug] container.go:497 attach: waiting for job 1/3
[debug] api.go:1038 Calling POST /containers/{name:.*}/start
2014/01/29 14:16:41 POST /v1.8/containers/60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80/start
[/var/lib/docker|cc1301e3] +job start(60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80)
[debug] container.go:329 startPty: begin of stdout pipe
[debug] container.go:340 startPty: begin of stdin pipe
[debug] container.go:1187 monitor: waiting for container 60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80 using cmd.Wait
[debug] container.go:790 Container running: true
[/var/lib/docker|cc1301e3] -job start(60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80) = OK (0)
[debug] api.go:1038 Calling POST /containers/{name:.*}/resize
2014/01/29 14:16:43 POST /v1.8/containers/60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80/resize?h=33&w=140
[/var/lib/docker|cc1301e3] +job resize(60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80, 33, 140)
[/var/lib/docker|cc1301e3] -job resize(60bd187778530ef5424f3e6b087e3d5c30d6d75061a8ab2ee88c166ff995bd80, 33, 140) = OK (0)

For failed:

[debug] api.go:1038 Calling POST /containers/create
2014/01/29 14:15:43 POST /v1.8/containers/create
[/var/lib/docker|cc1301e3] +job create()
[/var/lib/docker|cc1301e3] -job create() = OK (0)
[debug] api.go:1038 Calling POST /containers/{name:.*}/attach
2014/01/29 14:15:44 POST /v1.8/containers/5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc/attach?stderr=1&stdin=1&stdout=1&stream=1
[debug] api.go:1038 Calling POST /containers/{name:.*}/start
[debug] container.go:381 attach: stdin: begin
[debug] container.go:497 attach: waiting for job 1/3
[debug] container.go:416 attach: stdout: begin
[debug] container.go:454 attach: stderr: begin
2014/01/29 14:15:44 POST /v1.8/containers/5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc/start
[/var/lib/docker|cc1301e3] +job start(5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc)
[debug] container.go:329 startPty: begin of stdout pipe
[debug] container.go:340 startPty: begin of stdin pipe
[debug] container.go:1187 monitor: waiting for container 5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc using cmd.Wait
[debug] container.go:790 Container running: true
[/var/lib/docker|cc1301e3] -job start(5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc) = OK (0)
[debug] api.go:1038 Calling POST /containers/{name:.*}/resize
2014/01/29 14:15:46 POST /v1.8/containers/5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc/resize?h=33&w=140
[/var/lib/docker|cc1301e3] +job resize(5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc, 33, 140)
[/var/lib/docker|cc1301e3] -job resize(5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc, 33, 140) = OK (0)
[debug] container.go:331 startPty: end of stdout pipe
[debug] container.go:433 attach: stdout: end
[debug] container.go:502 attach: job 1 completed successfully
[debug] container.go:406 attach: stdin: end
[debug] container.go:471 attach: stderr: end
[debug] container.go:497 attach: waiting for job 2/3
[debug] container.go:502 attach: job 2 completed successfully
[debug] container.go:497 attach: waiting for job 3/3
[debug] container.go:502 attach: job 3 completed successfully
[debug] container.go:504 attach: all jobs completed successfully
[debug] server.go:1947 Closing buffered stdin pipe
[debug] api.go:1038 Calling GET /containers/{name:.*}/json
[debug] container.go:1191 monitor: cmd.Wait reported exit status exit status 1 for container 5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc
[debug] container.go:1194 monitor: container 5224b3345cdf49c6dcbd3e9c002a9fefd3c1ab1c8f151bcd8c3f9aafb1cb95fc finished
[debug] container.go:342 startPty: end of stdin pipe
@shykes
Owner
@farcaller

The docker is from archlinuxarm.org package.

@farcaller

FWIW, the problem lies within lxc-start:

logger[2030]: lxc-start -n 9500bca3a0ce8bc57a866e021ad6cacf942868c5e03adbe9488851ea6e99c5ed /.dockerinit -g 172.17.42.1 -i 172.17.0.22/16 -mtu 1500 -- /bin/bash
logger[2045]: lxc-start died: 1

Unfortunately, adding -l DEBUG to lxc-start seems to "fix" the issue, cannot trigger it anymore. Looks like some kind of a race condition to me (and adding debug output slows lxc-start down).

@farcaller

Looking deeper, seems that in the end that's docker fault:

lxc-start 1391449848.247 NOTICE   lxc_start - '/.dockerinit' started with pid '12539'
lxc-start 1391449848.351 INFO     lxc_error - child <12539> ended on error (1)

Any reason why docker might return 1 with no error message whatsoever?

@farcaller

Looking around the logs, an interesting finding:

lxc-start -n 4c0889dfee34db7eaa805885571dd052f987941a71893bb4689bec444b90e80a /.dockerinit -g 172.17.42.1 -i 172.17.0.27/16 -mtu 1500 -- /bin/true
kernel: device vethXSouLM entered promiscuous mode
kernel: IPv6: ADDRCONF(NETDEV_UP): vethXSouLM: link is not ready
docker[136]: [debug] container.go:790 Container running: true
docker[136]: [/var/lib/docker|2df2d673] -job start(4c0889dfee34db7eaa805885571dd052f987941a71893bb4689bec444b90e80a) = OK (0)
kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethXSouLM: link becomes ready
kernel: docker0: port 3(vethXSouLM) entered forwarding state
kernel: docker0: port 3(vethXSouLM) entered forwarding state
logger[12544]: lxc-start died: 1
docker[136]: [debug] container.go:331 startPty: end of stdout pipe

I guess there's a race condition in setupNetworking, docker tries to set up interface which isn't yet ready. recompiling docker binary to confirm.

@crosbymichael

@farcaller

We just made a lot of changes around networking and exec driver in master. Could you try to reproduce by building master and testing?

@farcaller

Will do (seems that I need to get access to some i386 machine to build docker in "official" way).

@farcaller

So far cannot reproduce the problem on 88d6660.

@crosbymichael

@farcaller awesome

@farcaller

I will now proceed to run a good number of containers under load, if it doesn't fail in a few hours, I guess we can consider the bug as fixed.

Thanks for your hints.

@crosbymichael

Thanks for taking the time to test and verify

@farcaller

Closing, cannot reproduce anymore (as of 88d6660).

@farcaller farcaller closed this
@farcaller

Managed to reproduce it on 88d6660.

      lxc-start 1391510496.031 NOTICE   lxc_start - '/.dockerinit' started with pid '27675'
      lxc-start 1391510496.031 INFO     lxc_console - no console will be used
      lxc-start 1391510496.031 WARN     lxc_start - invalid pid for SIGCHLD
      lxc-start 1391510496.042 INFO     lxc_start - forwarded signal 28 to pid 27675
      lxc-start 1391510496.811 INFO     lxc_conf - running to reset 0 nic names
      lxc-start 1391510496.811 INFO     lxc_error - child <27675> ended on error (1)
      lxc-start 1391510496.811 WARN     lxc_conf - failed to remove interface 'eth0'
@farcaller farcaller reopened this
@unclejack
Collaborator

We don't support ARM hosts right now and figuring these problems out on Linux distributions customized and recompiled by third parties for custom platforms is quite difficult.

The kernel might be missing essential modules and options from its configuration, thus breaking it for Docker.

Since we don't support ARM, Raspberry Pi and such a custom setup we can't verify that it meets all requirements, I'll close this issue now. Please feel free to comment.

@unclejack unclejack closed this
@farcaller

My position on this is that it happens sometimes, so it's not the problem with kernel, modules or something. I.e docker is running ok in this configuration, but sometimes it doesn't umount volumes.

I guess the problem is that my test boards are much, much slower than the usual thing you'd be running docker on, thus they are more susceptible to races. Thus, this is a real docker problem that someone should take a look at.

@farcaller

Err.. sorry, for some reason I thought I'm commenting on #4303. FWIW, my comment is mostly relevant :wink:

As I cannot reliably confirm this issue now, I'm ok with this one closed, #4298 is more interesting (and reproducible).

@jkp
jkp commented

I see this consistently on circle-ci if I use the --rm switch to docker run. This is certainly a real issue and I think the author of this issue is onto something - it's a race of some sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.