New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker daemon not getting options from flannel_docker_opts.env #1659

Closed
zihaoyu opened this Issue Nov 13, 2016 · 16 comments

Comments

@zihaoyu

zihaoyu commented Nov 13, 2016

Issue Report

Bug

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1221.0.0
VERSION_ID=1221.0.0
BUILD_ID=2016-11-03-0452
PRETTY_NAME="CoreOS 1221.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

CoreOS on AWS.

Expected Behavior

Docker daemon gets the --bip and other command line options that are generated by flanneld and are available in /run/flannel/flannel_docker_opts.env.

Actual Behavior

Docker daemon starts without --bip and other command line flags. Below is my cloud-config YAML passed as user-data:

#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service

And I can see the generated flannel_docker_opts.env:

ip-10-72-98-227 core # cat /run/flannel/flannel_docker_opts.env  
DOCKER_OPT_BIP="--bip=10.100.23.1/24"
DOCKER_OPT_IPMASQ="--ip-masq=false"
DOCKER_OPT_MTU="--mtu=8973"

However, docker daemon starts without these flags:

ip-10-72-98-227 core # systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-11-13 02:00:07 UTC; 5min ago
     Docs: http://docs.docker.com
 Main PID: 1121 (dockerd)
    Tasks: 8
   Memory: 40.6M
      CPU: 207ms
   CGroup: /system.slice/docker.service
           └─1121 dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock --selinux-enabled

Nov 13 01:59:52 ip-10-72-98-227.us-west-2.compute.internal systemd[1]: Starting Docker Application Container Engine...

Interestingly, if I restart docker.service, Docker daemon is able to pick up the flags:

ip-10-72-98-227 core # systemctl restart docker.service
ip-10-72-98-227 core # systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-11-13 02:10:12 UTC; 3s ago
     Docs: http://docs.docker.com
 Main PID: 1294 (dockerd)
    Tasks: 8
   Memory: 10.4M
      CPU: 117ms
   CGroup: /system.slice/docker.service
           └─1294 dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock --bip=10.100.23.1/24 --mtu=8973 --ip-masq=false --selinux-enabled

Nov 13 02:10:12 ip-10-72-98-227.us-west-2.compute.internal systemd[1]: Stopped Docker Application Container Engine.
Nov 13 02:10:12 ip-10-72-98-227.us-west-2.compute.internal systemd[1]: Starting Docker Application Container Engine...

I wonder if there is a problem in the order in which services are started by systemd? I looked through the unit files of flanneld.service, flannel-docker-opts.service, and docker.service, they all seem to be in correct order: flanneld -> flannel-docker-opts -> docker.

Reproduction Steps

  1. Launch an instance with CoreOS Alpha 1221.0.0.
  2. Use the following user-data.
#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service

Other Information

CoreOS Stable and Beta AMIs work correctly at the moment. Alpha is broken. I noticed that Alpha ships with Docker 1.12, wondering if that's the problem.

@crawford

This comment has been minimized.

Member

crawford commented Nov 14, 2016

Is docker.service enabled? It sounds like it is running before coreos-cloudinit has a change to start flannel.

@zihaoyu

This comment has been minimized.

zihaoyu commented Nov 14, 2016

@crawford Is there anything special for enabling it? All I have is this piece in the cloud-config:

    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service

Also, the same cloud-config content works in Stable and Beta AMIs.

@crawford

This comment has been minimized.

Member

crawford commented Nov 14, 2016

Enabling the service allows systemd to manage the startup (which usually results in that service starting before coreos-cloudinit). You can check to see if it's enabled with systemctl is-enabled docker.

@zihaoyu

This comment has been minimized.

zihaoyu commented Nov 14, 2016

@crawford Here is the output

ip-10-72-98-58 core # systemctl is-enabled docker.service
disabled
ip-10-72-98-58 core # systemctl is-enabled flanneld.service
disabled
@crawford

This comment has been minimized.

Member

crawford commented Nov 14, 2016

Okay, that looks fine. I tried reproducing the issue with the config you provided and was unable to do so. I watched as systemd waited for flanneld.service to start before starting docker.service.

Can you attach the output of journalctl --boot --unit oem-cloudinit --unit docker? This will help me understand the boot ordering.

@zihaoyu

This comment has been minimized.

zihaoyu commented Nov 14, 2016

@crawford Thanks. Here is the output:

ip-10-72-98-58 core # journalctl --boot --unit oem-cloudinit --unit docker
-- Logs begin at Mon 2016-11-14 21:26:32 UTC, end at Mon 2016-11-14 21:35:24 UTC. --
Nov 14 21:27:22 localhost systemd[1]: Starting Cloudinit from EC2-style metadata...
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Checking availability of "cloud-drive"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Checking availability of "ec2-metadata-service"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching user-data from datasource of type "ec2-metadata-service"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/user-data. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching meta-data from datasource of type "ec2-metadata-service"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/public-keys. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Found SSH key for "hurley20141209"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/hostname. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/local-ipv4. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/public-ipv4. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Parsing user-data as cloud-config
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Merging cloud-config from meta-data and user-data
Nov 14 21:27:22 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:22 Set hostname to ip-10-72-98-58.us-west-2.compute.internal
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Authorized SSH keys for core user
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing file to "/etc/environment"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote file to "/etc/environment"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Updated /etc/environment
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing drop-in unit "50-network-config.conf" to filesystem
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing file to "/etc/systemd/system/flanneld.service.d/50-network-config.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote file to "/etc/systemd/system/flanneld.service.d/50-network-config.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote drop-in unit "50-network-config.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing drop-in unit "50-require-flannel.conf" to filesystem
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing file to "/etc/systemd/system/docker.service.d/50-require-flannel.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote file to "/etc/systemd/system/docker.service.d/50-require-flannel.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote drop-in unit "50-require-flannel.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "etcd.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "etcd2.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "fleet.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "locksmithd.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Calling unit command "start" on "etcd2.service"'
Nov 14 21:27:28 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:28 Result of "start" on "etcd2.service": done
Nov 14 21:27:28 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:28 Calling unit command "start" on "flanneld.service"'
Nov 14 21:27:44 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:44 Result of "start" on "flanneld.service": done
Nov 14 21:27:44 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:44 Calling unit command "start" on "docker.service"'
Nov 14 21:27:44 ip-10-72-98-58.us-west-2.compute.internal systemd[1]: Starting Docker Application Container Engine...
Nov 14 21:27:57 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:57.031456095Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Nov 14 21:27:57 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:57.032463237Z" level=info msg="Loading containers: start."
Nov 14 21:27:57 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:57.632025873Z" level=info msg="Firewalld running: false"
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.171497401Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/1
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.327286213Z" level=info msg="Loading containers: done."
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.327398052Z" level=info msg="Daemon has completed initialization"
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.327417111Z" level=info msg="Docker daemon" commit=34a2ead graphdriver=overlay version=1.12.3
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal systemd[1]: Started Docker Application Container Engine.
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:58 Result of "start" on "docker.service": done
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal systemd[1]: Started Cloudinit from EC2-style metadata.
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.382285055Z" level=info msg="API listen on /var/run/docker.sock"

Just out of curiosity, did you test on Alpha 1221 version?

@crawford

This comment has been minimized.

Member

crawford commented Nov 14, 2016

I was actually able to reproduce the failure.

@crawford

This comment has been minimized.

Member

crawford commented Nov 14, 2016

To work around the issue, you can use the following config (I made flannel-docker-opts.service a oneshot):

#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flannel-docker-opts.service
      drop-ins:
        - name: oneshot.conf
          content: |
            [Service]
            Type=oneshot
            Restart=no
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service

@crawford crawford added this to the CoreOS Alpha 1235.0.0 milestone Nov 14, 2016

@zihaoyu

This comment has been minimized.

zihaoyu commented Nov 14, 2016

@crawford Yup that worked. Would you mind explaining the oneshot part?

@crawford

This comment has been minimized.

Member

crawford commented Nov 14, 2016

flannel-docker-opts.service actually creates /run/flannel/flannel_docker_opts.env which docker.service sources. By default, it is a simple service. These services are considered "started" as soon as the ExecStart is exec'd. This service actually takes a while to finish and there is enough time for systemd to think that flannel-docker-opts.service is "started" and it therefore starts docker.service. When I make the service a oneshot, systemd doesn't consider the service "started" until the ExecStart returns. That means docker.service is not allowed to start before /run/flannel/flannel_docker_opts.env is written.

@zihaoyu

This comment has been minimized.

zihaoyu commented Nov 14, 2016

@crawford Great. Thanks!

@zihaoyu

This comment has been minimized.

zihaoyu commented Dec 5, 2016

@crawford Can you please consider re-opening this? I'm still able to re-produce this bug in Beta 1235.1.0. Please see this gist.

  • flanneld.service finished starting at 16:40:28.293241
  • docker.service started starting at 16:40:28.293519
  • The timestamp on /run/flannel/flannel_docker_opts.env is 16:40:29.055128842 which is after flanneld finished starting so docker still was not able to source that file
@crawford

This comment has been minimized.

Member

crawford commented Dec 6, 2016

@zihaoyu docker needs to wait for flannel-docker-opts.service to finish executing since that it the service which creates /run/flannel/flannel_docker_opts.env. It looks like this isn't correct on your system. Can you run systemctl cat docker.service and systemctl cat flanneld.service?

@zihaoyu

This comment has been minimized.

zihaoyu commented Dec 6, 2016

@crawford

ip-10-72-113-68 core # systemctl cat docker.service
# /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=containerd.service docker.socket early-docker.target network.target
Requires=containerd.service docker.socket early-docker.target

[Service]
Type=notify
EnvironmentFile=-/run/flannel/flannel_docker_opts.env

# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/lib/coreos/dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock $DOCKER_OPTS $DOCKER_CGROUPS $DOCKER_OPT_BIP $DOCKER_OPT_MTU $DOCKER_OPT_IPMASQ
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

[Install]
WantedBy=multi-user.target
ip-10-72-113-68 core # systemctl cat flanneld.service
# /usr/lib/systemd/system/flanneld.service
[Unit]
Description=flannel - Network fabric for containers (System Application Container)
Documentation=https://github.com/coreos/flannel
After=etcd.service etcd2.service etcd-member.service
Before=docker.service flannel-docker-opts.service
Requires=flannel-docker-opts.service

[Service]
Type=notify
Restart=always
RestartSec=10s
LimitNOFILE=40000
LimitNPROC=1048576

Environment="FLANNEL_IMAGE_TAG=v0.6.2"
Environment="FLANNEL_OPTS=--ip-masq=true"
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/lib/coreos/flannel-wrapper.uuid"
EnvironmentFile=-/run/flannel/options.env

ExecStartPre=/sbin/modprobe ip_tables
ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos /run/flannel
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/flannel-wrapper.uuid
ExecStart=/usr/lib/coreos/flannel-wrapper $FLANNEL_OPTS
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/lib/coreos/flannel-wrapper.uuid

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/flanneld.service.d/40-no-local-etcd.conf
[Unit]
After=network.target
# /etc/systemd/system/flanneld.service.d/50-network-config.conf
[Service]
Environment=FLANNELD_ETCD_PREFIX=/coreos.com/network
Environment=FLANNELD_ETCD_ENDPOINTS=http://etcd.mycompany.com:80
ExecStartPre=/usr/bin/etcdctl --endpoint http://etcd.mycompany.com:80 set /coreos.com/network/config "{\"Network\":\"10.4.0.0/16\", \"Backend\": {\"Type\": \"vxlan\"}}"
@crawford

This comment has been minimized.

Member

crawford commented Dec 6, 2016

Oh, I forgot you were using coreos-cloudinit. I don't have a good solution for you. Your two options are either switch to Ignition (preferred) or restructure your cloud-config to ensure the options are created. Something like this:

#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: flannel-docker-opts.service
      command: start
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service
@zihaoyu

This comment has been minimized.

zihaoyu commented Dec 6, 2016

@crawford OK I will check out Ignition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment