Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

Fleet can not get response from etcd and CPU utilization is very high #1119

@eric-nuaa

Description

@eric-nuaa

I set up a CoreOS cluster (3 VMs) on top of my OpenStack environment, and I found etcd seems running well (I tuned etcd with "peer-election-timeout: 3000" and "peer-heartbeat-interval: 600" to avoid heartbeat timeout:):

core@coreos-1 ~ $ systemctl status etcd -l 
â—� etcd.service - etcd
   Loaded: loaded (/usr/lib64/systemd/system/etcd.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/etcd.service.d
           └─10-oem.conf, 20-cloudinit.conf
   Active: active (running) since Sun 2015-02-08 15:07:43 UTC; 11min ago
 Main PID: 623 (etcd)
   CGroup: /system.slice/etcd.service
           └─623 /usr/bin/etcd

Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:48.754 INFO      | Discovery found peers [http://10.0.0.43:7001 http://10.0.0.42:7001]
Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:48.775 INFO      | Discovery fetched back peer list: [10.0.0.43:7001 10.0.0.42:7001]
Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:48.932 INFO      | Send Join Request to http://10.0.0.43:7001/join
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.155 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af joined the cluster via peer 10.0.0.43:7001
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.450 INFO      | etcd server [name 6de45d8f7ed64103a2e3ef4e2eef87af, listen on :4001, advertised url http://10.0.0.41:4001]
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.473 INFO      | peer server [name 6de45d8f7ed64103a2e3ef4e2eef87af, listen on :7001, advertised url http://10.0.0.41:7001]
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.512 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af starting in peer mode
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.532 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af: state changed from 'initialized' to 'follower'.
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.844 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af: peer added: '18ac2280bac94a7ead141eaaa0f89740'
Feb 08 15:07:51 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:51.242 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af: peer added: 'f9f13445a5464c4f87d7d5436f305cdb'
core@coreos-1 ~ $ systemctl cat etcd       
# /usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd 
Restart=always
RestartSec=10s

# /run/systemd/system/etcd.service.d/10-oem.conf
[Service]
Environment=ETCD_PEER_ELECTION_TIMEOUT=1200

# /run/systemd/system/etcd.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADDR=10.0.0.41:4001"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/87c03e951672c6368f43aef8e0e800d7"
Environment="ETCD_PEER_ADDR=10.0.0.41:7001"
Environment="ETCD_PEER_ELECTION_TIMEOUT=3000"
Environment="ETCD_PEER_HEARTBEAT_INTERVAL=600"

But it seems fleet is not running well:

core@coreos-1 ~ $ systemctl status fleet -l 
â—� fleet.service - fleet daemon
   Loaded: loaded (/usr/lib64/systemd/system/fleet.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/fleet.service.d
           └─20-cloudinit.conf
   Active: active (running) since Sun 2015-02-08 15:07:44 UTC; 17min ago
 Main PID: 627 (fleetd)
   CGroup: /system.slice/fleet.service
           └─627 /usr/bin/fleetd

Feb 08 15:23:52 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:56 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR server.go:185: Server monitor triggered: Monitor timed out before successful heartbeat
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO server.go:153: Establishing etcd connectivity
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: cancelled
Feb 08 15:23:58 coreos-1.novalocal fleetd[627]: INFO server.go:164: Starting server components
Feb 08 15:24:14 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled

And I found the CPU utilization of my CoreOS VMs is very high:

Tasks:  89 total,   2 running,  87 sleeping,   0 stopped,   0 zombie
%Cpu(s): 49.6 us, 18.8 sy,  0.0 ni,  1.1 id,  0.0 wa, 23.9 hi,  6.6 si,  0.0 st
KiB Mem:   2054224 total,  1146488 used,   907736 free,   895760 buffers
KiB Swap:        0 total,        0 used,        0 free.   149880 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                             
  627 root      20   0  478252  16160   6296 S 29.2  0.8   4:27.52 fleetd                                                                              
  476 root      20   0  590472  22920  11892 R 19.8  1.1   3:23.40 update_engine                                                                       
  623 etcd      20   0  294676  15312   8816 S  8.1  0.7   1:09.43 etcd

And "fleet list-machines" can not list machines:

core@coreos-1 ~ $ fleet list-machines 
INFO fleet.go:58: Starting fleetd version 0.9.0
INFO fleet.go:162: No provided or default config file found - proceeding without
INFO server.go:153: Establishing etcd connectivity
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO server.go:164: Starting server components
INFO engine.go:80: Engine leader is 18ac2280bac94a7ead141eaaa0f89740
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled

I am using 557.1.0:

core@coreos-1 ~ $ cat /etc/lsb-release        
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=557.1.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 557.1.0"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions