New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No network connectivity in some docker containers after upgrade to 1153.0.0 #1554

Closed
cdwertmann opened this Issue Sep 2, 2016 · 19 comments

Comments

Projects
None yet
9 participants
@cdwertmann

cdwertmann commented Sep 2, 2016

Issue Report

Bug

After upgrading from stable (1068.10.0) to alpha (1153.0.0), some freshly submitted fleet services start containers that do not have network connectivity. From within the container I cannot ping the docker bridge (default gateway address) or any other containers or host interfaces. A simple "docker restart " resolves the issue.

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1153.0.0
VERSION_ID=1153.0.0
BUILD_ID=2016-08-27-0408
PRETTY_NAME="CoreOS 1153.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

QEMU/KVM on Openstack

Expected Behavior

All docker containers have network connectivity, as it has always been in the past.

Actual Behavior

A few containers cannot reach any hosts, not even able to ping the default gateway (docker0).

Reproduction Steps

  1. upgrade to CoreOS alpha 1153.0.0
  2. submit your usual containers
  3. find the container that does not have network access
  4. restart the container and see that it now does have access

Other Information

This happens across different docker images that are based on different distributions, so I don't think it is related to the image. docker inspect shows no difference before (when networking is down) and after a restart of the container (when networking works again).

I'm passing these options to docker:

cat /etc/systemd/system/docker.service.d/50-insecure-registry.conf
[Service]
Environment='DOCKER_OPTS=--bip 172.17.42.1/16 --dns 172.17.42.1 --dns-search=service.consul --insecure-registry="0.0.0.0/0"'
@crawford

This comment has been minimized.

Member

crawford commented Sep 8, 2016

I was able to reproduce this with Docker 1.11.2 on Linux 4.7.1 and 4.6.3, but was unable to reproduce with Docker 1.10.3.

@bryanlatten

This comment has been minimized.

bryanlatten commented Sep 9, 2016

@crawford did you reproduce generically? or only in QEMU/KVM?

@crawford

This comment has been minimized.

Member

crawford commented Sep 9, 2016

@bryanlatten I've been reproducing it with QEMU. I haven't tried other platforms.

@PiotrProkop

This comment has been minimized.

PiotrProkop commented Sep 12, 2016

I have the same issue docker doesn't attach veth interface to docker0 bridge. Restarting daemon helps and manually attaching interface by running brctl addif docker0 veth. We started to see this issue after upgrading to docker 1.12.1 on baremetal CoreOS server.

docker info:

Containers: 38
 Running: 20
 Paused: 0
 Stopped: 18
Images: 60
Server Version: 1.12.1
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: selinux
Kernel Version: 4.7.3-coreos
Operating System: CoreOS 1164.1.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 377.9 GiB
Name: host-1
ID: 655L:2CC7:MOFH:H2NQ:UKZG:DAEO:BVIR:3IAY:HHSN:UPBF:6EMS:NT7M
Docker Root Dir: /var/lib/docker

os-release:

NAME=CoreOS
ID=coreos
VERSION=1164.1.0
VERSION_ID=1164.1.0
BUILD_ID=2016-09-10-0834
PRETTY_NAME="CoreOS 1164.1.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
@Quentin-M

This comment has been minimized.

Quentin-M commented Sep 13, 2016

Linking moby/moby#26492

@dm0-

This comment has been minimized.

Member

dm0- commented Sep 14, 2016

This is what I have observed so far from trying to test and narrow it down to a problematic component: Docker containers' network links randomly fail to have their master set. This happens with Docker in CoreOS alpha and beta. The ip link command can be used on the host to set the master and restore networking for the containers.

It continues to fail when booting a kernel from stable and the user space from alpha or beta. It does not fail with alpha or beta kernels and stable user spaces.

It fails whether Docker is built with Go 1.6 or 1.7. It fails with all Project Atomic patches applied.

It fails when patching libnetwork to just use an ioctl instead of netlink to set the master. The contents of the netlink request in LinkSetMasterByIndex is essentially identical between working and failing containers. Calling LockOSThread around the syscalls and logging the thread's network namespace shows no indication of the Go runtime leaking namespaces.

@matthewdfuller

This comment has been minimized.

matthewdfuller commented Sep 14, 2016

I'll add that we're experiencing the same issue using CoreOS Beta (1153.4.0) running in AWS:

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1153.4.0
VERSION_ID=1153.4.0
BUILD_ID=2016-09-10-0107
PRETTY_NAME="CoreOS 1153.4.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

I used a container that simply curls the AWS metadata service (http://169.254.169.254/latest/meta-data) and then wrote a script to run it over and over. The result was that about 60% of the time the container could not curl the above endpoint while the remaining 40% could. I could not detect any pattern, as the same docker run run within seconds of each other could result in completely different results.

@dm0-

This comment has been minimized.

Member

dm0- commented Sep 15, 2016

Proof-of-concept fix is here: dm0-/libnetwork@4343ba4. Waiting for confirmation from upstream.

@dm0-

This comment has been minimized.

Member

dm0- commented Sep 20, 2016

I believe I have a (rather unfortunate) workaround for people who can't run a patched Docker: stop/mask systemd-networkd.service. Obviously, keep in mind the implications of stopping your network manager. You'd only want to do this after it's initialized your real interfaces. I'll see if there is a less destructive way to work around this.

@crawford

This comment has been minimized.

Member

crawford commented Sep 26, 2016

This was fixed with coreos/coreos-overlay@874c1b8 and coreos/docker#29 and should roll out in the next Alpha. Assuming nothing goes wrong, we'll backport this to Docker 1.11.2 in Beta in the coming weeks.

@crawford crawford closed this Sep 26, 2016

@crawford

This comment has been minimized.

Member

crawford commented Nov 2, 2016

This is now available in Stable. /cc @bryanlatten

@bboreham

This comment has been minimized.

bboreham commented Nov 7, 2016

I'm interested why you don't modify the systemd-networkd config to avoid matching on these interfaces?

@crawford

This comment has been minimized.

Member

crawford commented Nov 7, 2016

@bboreham that is currently not possible with networkd. We have a proposal to add this functionality to networkd.

@bboreham

This comment has been minimized.

bboreham commented Nov 7, 2016

Well, it's not possible to exactly specify what you don't want, but you could write some rules like 'Name=eth*' to match what you do want. Maybe too hard to cover all bases?

@crawford

This comment has been minimized.

Member

crawford commented Nov 7, 2016

There isn't a way to match all ethernet devices. The names use the persistent naming scheme (so they won't be eth*).

@SpComb

This comment has been minimized.

SpComb commented Nov 7, 2016

ATM this seems to be a regression with CoreOS stable 1122.3.0 -> 1185.3.0 breaking Weave for some users.

Should we have a separate issue to track that somewhere?

@bboreham

This comment has been minimized.

bboreham commented Dec 5, 2016

I note that systemd/systemd#4228 has now been merged and the upstream Docker PR docker/libnetwork#1450 was rejected.

Do you have a CoreOS issue to make use of the new systemd feature?

@dm0-

This comment has been minimized.

Member

dm0- commented Dec 5, 2016

@crawford

This comment has been minimized.

Member

crawford commented Dec 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment