-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Docker 1.11.2 startup issue #604
Conversation
@@ -19,6 +19,7 @@ root: | |||
Restart=always | |||
StartLimitInterval=0 | |||
RestartSec=15 | |||
ExecStartPre=-/sbin/ip link del docker0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/usr/bin/env ip link del docker0
/ do we need this for AWS hosts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, my plan is to also land this fix on the AWS EL7 AMI's that we ship by default. That will involve changing the cloud_images AMI script as well as the AWS templates. It does not happen with Docker 1.10, so we don't see the problem on CoreOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the linked cloud_images/centos7/install_prereqs.sh
changes in this PR for the corresponding AWS EL7 AMI updates.
ce3552a
to
211d639
Compare
f5cebea
to
b617316
Compare
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon will fail to start up with the following error: > Error starting daemon: Error initializing network controller: Error > creating default "bridge" network: failed to allocate gateway > (172.17.0.1): Address already in use This seems to be related to a Docker bug around the network controller initialization, where the controller has allocated an ip pool and persisted some state but not all of it. See: * moby/moby#22834 * moby/moby#23078 This fix simply removes the docker0 interface if it exists before starting the Docker daemon. This fix will need to be re-evaluated if we want to enable the 1.12+ containerd live-restore like Docker options as discussed in: * https://docs.docker.com/engine/admin/live-restore/ * moby/moby#2658
Reduce the size of our cloud-config by replacing stop/disable/mask with equivalent behavior by using the `--now` flag. This is helpful since we have reached the limit of what ARM can handle around concat expression length.
b617316
to
13d002f
Compare
Restart=always | ||
StartLimitInterval=0 | ||
RestartSec=15 | ||
ExecStartPre=-/sbin/ip link del docker0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malnick or @mellenburg do we have integration test coverage for this code in the deploy-vpc
tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. Yes, we have an option in test_installer_ccm.py called test_install_prereqs that will cover this, but no, its not running anywhere (as it is a very slow and expensive job). But these docker issues are killer, so maybe its time to turn it on for real
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty confident that this change will result in a more robust docker install for anyone using the web installer against EL7 hosts... I've also tested this change extensively against our AWS EL7 builds running Docker 1.11.2 and it's working great.
But it would still be nice to have CI button we can press to do a full test of the web installer. Not a priority right now, since that will likely be a flaky test (it depends on a lot of external things)... but @mellenburg can you add it to a wish list to keep track of?
@cmaloney I believe this is good to go. In addition to our standard integrations tests, I ran the following scale tests:
Note I stuck with |
🚢 |
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:
This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:
This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in: