Fix for Docker 1.11.2 startup issue #604

lingmann · 2016-08-29T18:17:20Z

On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

Error starting daemon: Error initializing network controller: Error
creating default "bridge" network: failed to allocate gateway
(172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

daemon ungraceful shutdown during starting make daemon failed to start next time with Error initializing network controller moby/moby#22834
Issue #20312 still open with 1.11.1 moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

cmaloney · 2016-08-29T19:42:45Z

gen/azure/cloud-config.yaml

@@ -19,6 +19,7 @@ root:
      Restart=always
      StartLimitInterval=0
      RestartSec=15
+      ExecStartPre=-/sbin/ip link del docker0


/usr/bin/env ip link del docker0 / do we need this for AWS hosts?

Yup, my plan is to also land this fix on the AWS EL7 AMI's that we ship by default. That will involve changing the cloud_images AMI script as well as the AWS templates. It does not happen with Docker 1.10, so we don't see the problem on CoreOS.

See the linked cloud_images/centos7/install_prereqs.sh changes in this PR for the corresponding AWS EL7 AMI updates.

On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon will fail to start up with the following error: > Error starting daemon: Error initializing network controller: Error > creating default "bridge" network: failed to allocate gateway > (172.17.0.1): Address already in use This seems to be related to a Docker bug around the network controller initialization, where the controller has allocated an ip pool and persisted some state but not all of it. See: * moby/moby#22834 * moby/moby#23078 This fix simply removes the docker0 interface if it exists before starting the Docker daemon. This fix will need to be re-evaluated if we want to enable the 1.12+ containerd live-restore like Docker options as discussed in: * https://docs.docker.com/engine/admin/live-restore/ * moby/moby#2658

Reduce the size of our cloud-config by replacing stop/disable/mask with equivalent behavior by using the `--now` flag. This is helpful since we have reached the limit of what ARM can handle around concat expression length.

lingmann · 2016-09-01T19:26:51Z

ext/dcos-installer/dcos_installer/action_lib.py

+Restart=always
+StartLimitInterval=0
+RestartSec=15
+ExecStartPre=-/sbin/ip link del docker0


@malnick or @mellenburg do we have integration test coverage for this code in the deploy-vpc tests?

Yes and no. Yes, we have an option in test_installer_ccm.py called test_install_prereqs that will cover this, but no, its not running anywhere (as it is a very slow and expensive job). But these docker issues are killer, so maybe its time to turn it on for real

I'm pretty confident that this change will result in a more robust docker install for anyone using the web installer against EL7 hosts... I've also tested this change extensively against our AWS EL7 builds running Docker 1.11.2 and it's working great.

But it would still be nice to have CI button we can press to do a full test of the web installer. Not a priority right now, since that will likely be a flaky test (it depends on a lot of external things)... but @mellenburg can you add it to a wish list to keep track of?

lingmann · 2016-09-01T21:18:57Z

@cmaloney I believe this is good to go. In addition to our standard integrations tests, I ran the following scale tests:

Azure 5Mx62A cluster: (0% failure rate)
Azure 5Mx62A cluster: (0% failure rate)
AWS EL7 1Mx60A cluster: (0% failure rate)
AWS EL7 3Mx80A cluster: (0% failure rate)

Note I stuck with /sbin/ip because it exists in this location on both platforms where we need the fix (CentOS and Ubuntu), and it uses less characters than wrapping in bash, which is unfortunately a precious cloud config resource right now.

cmaloney · 2016-09-01T22:33:32Z

🚢

lingmann added the Work In Progress label Aug 29, 2016

lingmann added this to the 1.8 GA milestone Aug 29, 2016

lingmann self-assigned this Aug 29, 2016

cmaloney reviewed Aug 29, 2016
View reviewed changes

lingmann force-pushed the jeremy/docker0-startup-fix branch 2 times, most recently from ce3552a to 211d639 Compare August 29, 2016 22:51

lingmann changed the title ~~Workaround for Docker 1.11.2 startup issue~~ Fix for Docker 1.11.2 startup issue Aug 30, 2016

lingmann force-pushed the jeremy/docker0-startup-fix branch 3 times, most recently from f5cebea to b617316 Compare September 1, 2016 03:58

Jeremy Lingmann added 2 commits September 1, 2016 12:22

Reduce size of cloud-config

13d002f

Reduce the size of our cloud-config by replacing stop/disable/mask with equivalent behavior by using the `--now` flag. This is helpful since we have reached the limit of what ARM can handle around concat expression length.

lingmann force-pushed the jeremy/docker0-startup-fix branch from b617316 to 13d002f Compare September 1, 2016 19:22

lingmann reviewed Sep 1, 2016
View reviewed changes

lingmann added Ready For Review and removed Work In Progress labels Sep 1, 2016

lingmann assigned cmaloney and unassigned lingmann Sep 1, 2016

cmaloney added a commit to mesosphere/dcos that referenced this pull request Sep 1, 2016

Merge dcos#604

1098dab

cmaloney mentioned this pull request Sep 1, 2016

Train 65 #642

Merged

cmaloney merged commit 13d002f into dcos:master Sep 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Docker 1.11.2 startup issue #604

Fix for Docker 1.11.2 startup issue #604

lingmann commented Aug 29, 2016

cmaloney Aug 29, 2016

lingmann Aug 30, 2016

lingmann Sep 1, 2016

lingmann Sep 1, 2016

mellenburg Sep 1, 2016

lingmann Sep 1, 2016

lingmann commented Sep 1, 2016

cmaloney commented Sep 1, 2016

Fix for Docker 1.11.2 startup issue #604

Fix for Docker 1.11.2 startup issue #604

Conversation

lingmann commented Aug 29, 2016

cmaloney Aug 29, 2016

Choose a reason for hiding this comment

lingmann Aug 30, 2016

Choose a reason for hiding this comment

lingmann Sep 1, 2016

Choose a reason for hiding this comment

lingmann Sep 1, 2016

Choose a reason for hiding this comment

mellenburg Sep 1, 2016

Choose a reason for hiding this comment

lingmann Sep 1, 2016

Choose a reason for hiding this comment

lingmann commented Sep 1, 2016

cmaloney commented Sep 1, 2016