Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Docker 1.11.2 startup issue #604

Merged
merged 2 commits into from
Sep 2, 2016

Conversation

lingmann
Copy link
Contributor

On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

Error starting daemon: Error initializing network controller: Error
creating default "bridge" network: failed to allocate gateway
(172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

@lingmann lingmann added this to the 1.8 GA milestone Aug 29, 2016
@lingmann lingmann self-assigned this Aug 29, 2016
@@ -19,6 +19,7 @@ root:
Restart=always
StartLimitInterval=0
RestartSec=15
ExecStartPre=-/sbin/ip link del docker0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/bin/env ip link del docker0 / do we need this for AWS hosts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, my plan is to also land this fix on the AWS EL7 AMI's that we ship by default. That will involve changing the cloud_images AMI script as well as the AWS templates. It does not happen with Docker 1.10, so we don't see the problem on CoreOS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the linked cloud_images/centos7/install_prereqs.sh changes in this PR for the corresponding AWS EL7 AMI updates.

@lingmann lingmann force-pushed the jeremy/docker0-startup-fix branch 2 times, most recently from ce3552a to 211d639 Compare August 29, 2016 22:51
@lingmann lingmann changed the title Workaround for Docker 1.11.2 startup issue Fix for Docker 1.11.2 startup issue Aug 30, 2016
@lingmann lingmann force-pushed the jeremy/docker0-startup-fix branch 3 times, most recently from f5cebea to b617316 Compare September 1, 2016 03:58
Jeremy Lingmann added 2 commits September 1, 2016 12:22
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
Reduce the size of our cloud-config by replacing stop/disable/mask with
equivalent behavior by using the `--now` flag. This is helpful since we
have reached the limit of what ARM can handle around concat expression
length.
Restart=always
StartLimitInterval=0
RestartSec=15
ExecStartPre=-/sbin/ip link del docker0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malnick or @mellenburg do we have integration test coverage for this code in the deploy-vpc tests?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. Yes, we have an option in test_installer_ccm.py called test_install_prereqs that will cover this, but no, its not running anywhere (as it is a very slow and expensive job). But these docker issues are killer, so maybe its time to turn it on for real

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confident that this change will result in a more robust docker install for anyone using the web installer against EL7 hosts... I've also tested this change extensively against our AWS EL7 builds running Docker 1.11.2 and it's working great.

But it would still be nice to have CI button we can press to do a full test of the web installer. Not a priority right now, since that will likely be a flaky test (it depends on a lot of external things)... but @mellenburg can you add it to a wish list to keep track of?

@lingmann
Copy link
Contributor Author

lingmann commented Sep 1, 2016

@cmaloney I believe this is good to go. In addition to our standard integrations tests, I ran the following scale tests:

  • Azure 5Mx62A cluster: (0% failure rate)
  • Azure 5Mx62A cluster: (0% failure rate)
  • AWS EL7 1Mx60A cluster: (0% failure rate)
  • AWS EL7 3Mx80A cluster: (0% failure rate)

Note I stuck with /sbin/ip because it exists in this location on both platforms where we need the fix (CentOS and Ubuntu), and it uses less characters than wrapping in bash, which is unfortunately a precious cloud config resource right now.

@cmaloney
Copy link
Contributor

cmaloney commented Sep 1, 2016

🚢

cmaloney added a commit to mesosphere/dcos that referenced this pull request Sep 1, 2016
@cmaloney cmaloney mentioned this pull request Sep 1, 2016
@cmaloney cmaloney merged commit 13d002f into dcos:master Sep 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants