Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flanneld unable to talk to etcd following a restart #1445

Closed
george-angel opened this Issue Jul 6, 2016 · 4 comments

Comments

Projects
None yet
3 participants
@george-angel
Copy link

george-angel commented Jul 6, 2016

flanneld runs happily on first boot, I'm able to get an hour's worth of work done, then update-engine restarts the machine, and following the restart flanneld so no longer able to communicate with etcd

Running on AWS.

Logs from flanneld:

Jul 06 14:49:24 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:49:24.266983 00001 vxlan.go:340] Ignoring not a miss: d2:1a:24:40:ce:2d, 10.2.58.2
Jul 06 14:49:25 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:49:25.268970 00001 vxlan.go:340] Ignoring not a miss: d2:1a:24:40:ce:2d, 10.2.58.2
Jul 06 14:49:29 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:49:29.374473 00001 vxlan.go:345] L3 miss: 10.2.58.2
Jul 06 14:49:29 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:49:29.374531 00001 device.go:187] calling NeighSet: 10.2.58.2, d2:1a:24:40:ce:2d
Jul 06 14:49:29 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:49:29.374722 00001 vxlan.go:356] AddL3 succeeded
Jul 06 14:50:47 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:50:47.794955 00001 vxlan.go:340] Ignoring not a miss: d2:1a:24:40:ce:2d, 10.2.58.2
Jul 06 14:51:09 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:51:09.058987 00001 vxlan.go:340] Ignoring not a miss: d2:1a:24:40:ce:2d, 10.2.58.2
Jul 06 14:51:12 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:51:12.064845 00001 vxlan.go:340] Ignoring not a miss: d2:1a:24:40:ce:2d, 10.2.58.2
Jul 06 14:51:13 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:51:13.066805 00001 vxlan.go:340] Ignoring not a miss: d2:1a:24:40:ce:2d, 10.2.58.2
Jul 06 14:51:14 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:51:14.068979 00001 vxlan.go:340] Ignoring not a miss: d2:1a:24:40:ce:2d, 10.2.58.2
Jul 06 14:51:34 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:51:34.130837 00001 vxlan.go:345] L3 miss: 10.2.58.2
Jul 06 14:51:34 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:51:34.130896 00001 device.go:187] calling NeighSet: 10.2.58.2, d2:1a:24:40:ce:2d
Jul 06 14:51:34 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:51:34.131099 00001 vxlan.go:356] AddL3 succeeded
Jul 06 14:52:20 ip-10-66-10-140.eu-west-1.compute.internal systemd[1]: Stopping Network fabric for containers...
Jul 06 14:52:20 ip-10-66-10-140.eu-west-1.compute.internal sdnotify-proxy[952]: I0706 14:52:20.323567 00001 main.go:292] Exiting...
Jul 06 14:52:20 ip-10-66-10-140.eu-west-1.compute.internal systemd[1]: Stopped Network fabric for containers.
-- Reboot --
Jul 06 14:52:51 ip-10-66-10-140.eu-west-1.compute.internal systemd[1]: Starting Network fabric for containers...
Jul 06 14:52:51 ip-10-66-10-140.eu-west-1.compute.internal curl[878]: {"errorCode":105,"message":"Key already exists","cause":"/coreos.com/network/config","index":266187}
Jul 06 14:52:51 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: image: using image from file /usr/lib/rkt/stage1-images/stage1-fly.aci
Jul 06 14:52:52 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: image: searching for app image quay.io/coreos/flannel
Jul 06 14:52:53 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: image: remote fetching from URL "https://quay.io/c1/aci/quay.io/coreos/flannel/0.5.5/aci/linux/amd64/"
Jul 06 14:52:54 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: Downloading ACI:  0 B/8.86 MB
Jul 06 14:52:54 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: Downloading ACI:  16.4 KB/8.86 MB
Jul 06 14:52:55 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: Downloading ACI:  2.61 MB/8.86 MB
Jul 06 14:52:55 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: Downloading ACI:  8.86 MB/8.86 MB
Jul 06 14:52:58 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: I0706 14:52:58.049585 00884 main.go:275] Installing signal handlers
Jul 06 14:52:58 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: I0706 14:52:58.050507 00884 main.go:188] Using 10.66.10.140 as external interface
Jul 06 14:52:58 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: I0706 14:52:58.051057 00884 main.go:189] Using 10.66.10.140 as external endpoint
Jul 06 14:52:58 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: E0706 14:52:58.054006 00884 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jul 06 14:52:59 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: E0706 14:52:59.054563 00884 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jul 06 14:53:00 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: E0706 14:53:00.055151 00884 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jul 06 14:53:01 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: E0706 14:53:01.055803 00884 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jul 06 14:53:02 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: E0706 14:53:02.056373 00884 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Jul 06 14:53:03 ip-10-66-10-140.eu-west-1.compute.internal rkt[884]: E0706 14:53:03.057036 00884 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured

CoreOS version:

core@ip-10-66-10-140 ~ $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1097.0.0
VERSION_ID=1097.0.0
BUILD_ID=2016-07-02-0145
PRETTY_NAME="CoreOS 1097.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

flanneld.service:

core@ip-10-66-10-140 ~ $ systemctl cat flanneld
# /usr/lib64/systemd/system/flanneld.service
[Unit]
Description=Network fabric for containers
Documentation=https://github.com/coreos/flannel
After=etcd.service etcd2.service
Before=docker.service

[Service]
Type=notify
Restart=always
RestartSec=5
Environment="TMPDIR=/var/tmp/"
Environment="FLANNEL_VER=0.5.5"
Environment="FLANNEL_IMG=quay.io/coreos/flannel"
Environment="ETCD_SSL_DIR=/etc/ssl/etcd"
EnvironmentFile=-/run/flannel/options.env
LimitNOFILE=40000
LimitNPROC=1048576
ExecStartPre=/sbin/modprobe ip_tables
ExecStartPre=/usr/bin/mkdir -p /run/flannel
ExecStartPre=/usr/bin/mkdir -p ${ETCD_SSL_DIR}

ExecStart=/usr/bin/rkt run --net=host \
   --stage1-path=/usr/lib/rkt/stage1-images/stage1-fly.aci \
   --insecure-options=image \
   --set-env=NOTIFY_SOCKET=/run/systemd/notify \
   --inherit-env=true \
   --volume runsystemd,kind=host,source=/run/systemd,readOnly=false \
   --volume runflannel,kind=host,source=/run/flannel,readOnly=false \
   --volume ssl,kind=host,source=${ETCD_SSL_DIR},readOnly=true \
   --mount volume=runsystemd,target=/run/systemd \
   --mount volume=runflannel,target=/run/flannel \
   --mount volume=ssl,target=${ETCD_SSL_DIR} \
   ${FLANNEL_IMG}:${FLANNEL_VER} \
   -- --ip-masq=true

# Update docker options
ExecStartPost=/usr/bin/rkt run --net=host \
   --stage1-path=/usr/lib/rkt/stage1-images/stage1-fly.aci \
   --insecure-options=image \
   --volume runvol,kind=host,source=/run,readOnly=false \
   --mount volume=runvol,target=/run \
   ${FLANNEL_IMG}:${FLANNEL_VER} \
   --exec /opt/bin/mk-docker-opts.sh -- -d /run/flannel_docker_opts.env -i

ExecStopPost=/usr/bin/rkt gc --mark-only

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/flanneld.service.d/10-etcd.conf
[Service]
Environment="FLANNELD_IFACE=10.66.10.140"
Environment="FLANNELD_ETCD_ENDPOINTS=http://etcd.k8s.dev.uw.systems:2379"
ExecStartPre=/usr/bin/curl --silent -X PUT -d \
"value={\"Network\" : \"10.2.0.0/16\", \"Backend\" : {\"Type\" : \"vxlan\"}}" \
http://etcd.k8s.dev.uw.systems:2379/v2/keys/coreos.com/network/config?prevExist=false

etcd cluster is there, and working, as can be seen by the curl from ExecStartPre output.

@crawford

This comment has been minimized.

Copy link
Member

crawford commented Jul 6, 2016

@ajeddeloh

This comment has been minimized.

Copy link

ajeddeloh commented Jul 6, 2016

This looks like a similar issue to #1439 and #1436. Can you try adding this as a drop in unit for flanneld:

[Service]
ExecStart=
ExecStart=/usr/bin/rkt run --net=host \
   --stage1-path=/usr/lib/rkt/stage1-images/stage1-fly.aci \
   --insecure-options=image \
   --set-env=NOTIFY_SOCKET=/run/systemd/notify \
   --inherit-env=true \
   --volume runsystemd,kind=host,source=/run/systemd,readOnly=false \
   --volume runflannel,kind=host,source=/run/flannel,readOnly=false \
   --volume ssl,kind=host,source=${ETCD_SSL_DIR},readOnly=true \
   --volume certs,kind=host,source=/usr/share/ca-certificates,readOnly=true \
   --volume resolv,kind=host,source=/etc/resolv.conf,readOnly=true \
   --mount volume=runsystemd,target=/run/systemd \
   --mount volume=runflannel,target=/run/flannel \
   --mount volume=ssl,target=${ETCD_SSL_DIR} \
   --mount volume=certs,target=/etc/ssl/certs \
   --mount volume=resolv,target=/etc/resolv.conf \
   ${FLANNEL_IMG}:${FLANNEL_VER} \
   --exec /opt/bin/flanneld \
   -- --ip-masq=true
@george-angel

This comment has been minimized.

Copy link
Author

george-angel commented Jul 6, 2016

Booted well first time, and everytime after that, with 4 restarts. I'm relucant to say "its fixed", but it seems to be working.

Will check back tomorrow morning. Thank you @ajeddeloh

@ajeddeloh

This comment has been minimized.

Copy link

ajeddeloh commented Jul 6, 2016

Closed via coreos/coreos-overlay#2043. Will be included in next alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.