Skip to content
This repository has been archived by the owner. It is now read-only.

Docker daemon not getting options from flannel_docker_opts.env #1659

Closed
zihaoyu opened this issue Nov 13, 2016 · 16 comments
Closed

Docker daemon not getting options from flannel_docker_opts.env #1659

zihaoyu opened this issue Nov 13, 2016 · 16 comments

Comments

@zihaoyu
Copy link

@zihaoyu zihaoyu commented Nov 13, 2016

Issue Report

Bug

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1221.0.0
VERSION_ID=1221.0.0
BUILD_ID=2016-11-03-0452
PRETTY_NAME="CoreOS 1221.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

CoreOS on AWS.

Expected Behavior

Docker daemon gets the --bip and other command line options that are generated by flanneld and are available in /run/flannel/flannel_docker_opts.env.

Actual Behavior

Docker daemon starts without --bip and other command line flags. Below is my cloud-config YAML passed as user-data:

#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service

And I can see the generated flannel_docker_opts.env:

ip-10-72-98-227 core # cat /run/flannel/flannel_docker_opts.env  
DOCKER_OPT_BIP="--bip=10.100.23.1/24"
DOCKER_OPT_IPMASQ="--ip-masq=false"
DOCKER_OPT_MTU="--mtu=8973"

However, docker daemon starts without these flags:

ip-10-72-98-227 core # systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-11-13 02:00:07 UTC; 5min ago
     Docs: http://docs.docker.com
 Main PID: 1121 (dockerd)
    Tasks: 8
   Memory: 40.6M
      CPU: 207ms
   CGroup: /system.slice/docker.service
           └─1121 dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock --selinux-enabled

Nov 13 01:59:52 ip-10-72-98-227.us-west-2.compute.internal systemd[1]: Starting Docker Application Container Engine...

Interestingly, if I restart docker.service, Docker daemon is able to pick up the flags:

ip-10-72-98-227 core # systemctl restart docker.service
ip-10-72-98-227 core # systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-11-13 02:10:12 UTC; 3s ago
     Docs: http://docs.docker.com
 Main PID: 1294 (dockerd)
    Tasks: 8
   Memory: 10.4M
      CPU: 117ms
   CGroup: /system.slice/docker.service
           └─1294 dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock --bip=10.100.23.1/24 --mtu=8973 --ip-masq=false --selinux-enabled

Nov 13 02:10:12 ip-10-72-98-227.us-west-2.compute.internal systemd[1]: Stopped Docker Application Container Engine.
Nov 13 02:10:12 ip-10-72-98-227.us-west-2.compute.internal systemd[1]: Starting Docker Application Container Engine...

I wonder if there is a problem in the order in which services are started by systemd? I looked through the unit files of flanneld.service, flannel-docker-opts.service, and docker.service, they all seem to be in correct order: flanneld -> flannel-docker-opts -> docker.

Reproduction Steps

  1. Launch an instance with CoreOS Alpha 1221.0.0.
  2. Use the following user-data.
#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service

Other Information

CoreOS Stable and Beta AMIs work correctly at the moment. Alpha is broken. I noticed that Alpha ships with Docker 1.12, wondering if that's the problem.

@crawford
Copy link
Member

@crawford crawford commented Nov 14, 2016

Is docker.service enabled? It sounds like it is running before coreos-cloudinit has a change to start flannel.

@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Nov 14, 2016

@crawford Is there anything special for enabling it? All I have is this piece in the cloud-config:

    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service

Also, the same cloud-config content works in Stable and Beta AMIs.

@crawford
Copy link
Member

@crawford crawford commented Nov 14, 2016

Enabling the service allows systemd to manage the startup (which usually results in that service starting before coreos-cloudinit). You can check to see if it's enabled with systemctl is-enabled docker.

@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Nov 14, 2016

@crawford Here is the output

ip-10-72-98-58 core # systemctl is-enabled docker.service
disabled
ip-10-72-98-58 core # systemctl is-enabled flanneld.service
disabled
@crawford
Copy link
Member

@crawford crawford commented Nov 14, 2016

Okay, that looks fine. I tried reproducing the issue with the config you provided and was unable to do so. I watched as systemd waited for flanneld.service to start before starting docker.service.

Can you attach the output of journalctl --boot --unit oem-cloudinit --unit docker? This will help me understand the boot ordering.

@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Nov 14, 2016

@crawford Thanks. Here is the output:

ip-10-72-98-58 core # journalctl --boot --unit oem-cloudinit --unit docker
-- Logs begin at Mon 2016-11-14 21:26:32 UTC, end at Mon 2016-11-14 21:35:24 UTC. --
Nov 14 21:27:22 localhost systemd[1]: Starting Cloudinit from EC2-style metadata...
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Checking availability of "cloud-drive"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Checking availability of "ec2-metadata-service"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching user-data from datasource of type "ec2-metadata-service"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/user-data. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching meta-data from datasource of type "ec2-metadata-service"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/public-keys. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Found SSH key for "hurley20141209"
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/hostname. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/local-ipv4. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Fetching data from http://169.254.169.254/2009-04-04/meta-data/public-ipv4. Attempt #1
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Parsing user-data as cloud-config
Nov 14 21:27:22 localhost coreos-cloudinit[917]: 2016/11/14 21:27:22 Merging cloud-config from meta-data and user-data
Nov 14 21:27:22 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:22 Set hostname to ip-10-72-98-58.us-west-2.compute.internal
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Authorized SSH keys for core user
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing file to "/etc/environment"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote file to "/etc/environment"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Updated /etc/environment
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing drop-in unit "50-network-config.conf" to filesystem
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing file to "/etc/systemd/system/flanneld.service.d/50-network-config.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote file to "/etc/systemd/system/flanneld.service.d/50-network-config.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote drop-in unit "50-network-config.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing drop-in unit "50-require-flannel.conf" to filesystem
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Writing file to "/etc/systemd/system/docker.service.d/50-require-flannel.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote file to "/etc/systemd/system/docker.service.d/50-require-flannel.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Wrote drop-in unit "50-require-flannel.conf"
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "etcd.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "etcd2.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "fleet.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Ensuring runtime unit file "locksmithd.service" is unmasked
Nov 14 21:27:23 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:23 Calling unit command "start" on "etcd2.service"'
Nov 14 21:27:28 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:28 Result of "start" on "etcd2.service": done
Nov 14 21:27:28 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:28 Calling unit command "start" on "flanneld.service"'
Nov 14 21:27:44 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:44 Result of "start" on "flanneld.service": done
Nov 14 21:27:44 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:44 Calling unit command "start" on "docker.service"'
Nov 14 21:27:44 ip-10-72-98-58.us-west-2.compute.internal systemd[1]: Starting Docker Application Container Engine...
Nov 14 21:27:57 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:57.031456095Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Nov 14 21:27:57 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:57.032463237Z" level=info msg="Loading containers: start."
Nov 14 21:27:57 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:57.632025873Z" level=info msg="Firewalld running: false"
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.171497401Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/1
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.327286213Z" level=info msg="Loading containers: done."
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.327398052Z" level=info msg="Daemon has completed initialization"
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.327417111Z" level=info msg="Docker daemon" commit=34a2ead graphdriver=overlay version=1.12.3
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal systemd[1]: Started Docker Application Container Engine.
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal coreos-cloudinit[917]: 2016/11/14 21:27:58 Result of "start" on "docker.service": done
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal systemd[1]: Started Cloudinit from EC2-style metadata.
Nov 14 21:27:58 ip-10-72-98-58.us-west-2.compute.internal dockerd[1117]: time="2016-11-14T21:27:58.382285055Z" level=info msg="API listen on /var/run/docker.sock"

Just out of curiosity, did you test on Alpha 1221 version?

@crawford
Copy link
Member

@crawford crawford commented Nov 14, 2016

I was actually able to reproduce the failure.

@crawford
Copy link
Member

@crawford crawford commented Nov 14, 2016

To work around the issue, you can use the following config (I made flannel-docker-opts.service a oneshot):

#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flannel-docker-opts.service
      drop-ins:
        - name: oneshot.conf
          content: |
            [Service]
            Type=oneshot
            Restart=no
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service
@crawford crawford added this to the CoreOS Alpha 1235.0.0 milestone Nov 14, 2016
@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Nov 14, 2016

@crawford Yup that worked. Would you mind explaining the oneshot part?

@crawford
Copy link
Member

@crawford crawford commented Nov 14, 2016

flannel-docker-opts.service actually creates /run/flannel/flannel_docker_opts.env which docker.service sources. By default, it is a simple service. These services are considered "started" as soon as the ExecStart is exec'd. This service actually takes a while to finish and there is enough time for systemd to think that flannel-docker-opts.service is "started" and it therefore starts docker.service. When I make the service a oneshot, systemd doesn't consider the service "started" until the ExecStart returns. That means docker.service is not allowed to start before /run/flannel/flannel_docker_opts.env is written.

@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Nov 14, 2016

@crawford Great. Thanks!

@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Dec 5, 2016

@crawford Can you please consider re-opening this? I'm still able to re-produce this bug in Beta 1235.1.0. Please see this gist.

  • flanneld.service finished starting at 16:40:28.293241
  • docker.service started starting at 16:40:28.293519
  • The timestamp on /run/flannel/flannel_docker_opts.env is 16:40:29.055128842 which is after flanneld finished starting so docker still was not able to source that file
@crawford
Copy link
Member

@crawford crawford commented Dec 6, 2016

@zihaoyu docker needs to wait for flannel-docker-opts.service to finish executing since that it the service which creates /run/flannel/flannel_docker_opts.env. It looks like this isn't correct on your system. Can you run systemctl cat docker.service and systemctl cat flanneld.service?

@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Dec 6, 2016

@crawford

ip-10-72-113-68 core # systemctl cat docker.service
# /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=containerd.service docker.socket early-docker.target network.target
Requires=containerd.service docker.socket early-docker.target

[Service]
Type=notify
EnvironmentFile=-/run/flannel/flannel_docker_opts.env

# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/lib/coreos/dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock $DOCKER_OPTS $DOCKER_CGROUPS $DOCKER_OPT_BIP $DOCKER_OPT_MTU $DOCKER_OPT_IPMASQ
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

[Install]
WantedBy=multi-user.target
ip-10-72-113-68 core # systemctl cat flanneld.service
# /usr/lib/systemd/system/flanneld.service
[Unit]
Description=flannel - Network fabric for containers (System Application Container)
Documentation=https://github.com/coreos/flannel
After=etcd.service etcd2.service etcd-member.service
Before=docker.service flannel-docker-opts.service
Requires=flannel-docker-opts.service

[Service]
Type=notify
Restart=always
RestartSec=10s
LimitNOFILE=40000
LimitNPROC=1048576

Environment="FLANNEL_IMAGE_TAG=v0.6.2"
Environment="FLANNEL_OPTS=--ip-masq=true"
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/lib/coreos/flannel-wrapper.uuid"
EnvironmentFile=-/run/flannel/options.env

ExecStartPre=/sbin/modprobe ip_tables
ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos /run/flannel
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/flannel-wrapper.uuid
ExecStart=/usr/lib/coreos/flannel-wrapper $FLANNEL_OPTS
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/lib/coreos/flannel-wrapper.uuid

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/flanneld.service.d/40-no-local-etcd.conf
[Unit]
After=network.target
# /etc/systemd/system/flanneld.service.d/50-network-config.conf
[Service]
Environment=FLANNELD_ETCD_PREFIX=/coreos.com/network
Environment=FLANNELD_ETCD_ENDPOINTS=http://etcd.mycompany.com:80
ExecStartPre=/usr/bin/etcdctl --endpoint http://etcd.mycompany.com:80 set /coreos.com/network/config "{\"Network\":\"10.4.0.0/16\", \"Backend\": {\"Type\": \"vxlan\"}}"
@crawford
Copy link
Member

@crawford crawford commented Dec 6, 2016

Oh, I forgot you were using coreos-cloudinit. I don't have a good solution for you. Your two options are either switch to Ignition (preferred) or restructure your cloud-config to ensure the options are created. Something like this:

#cloud-config

coreos:
  units:
    - name: etcd2.service
      command: start
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.100.0.0/16" }'
    - name: flannel-docker-opts.service
      command: start
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-require-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service
@zihaoyu
Copy link
Author

@zihaoyu zihaoyu commented Dec 6, 2016

@crawford OK I will check out Ignition.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.