This repository was archived by the owner on Jan 30, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 298
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
Fleet can not get response from etcd and CPU utilization is very high #1119
Copy link
Copy link
Closed
Description
I set up a CoreOS cluster (3 VMs) on top of my OpenStack environment, and I found etcd seems running well (I tuned etcd with "peer-election-timeout: 3000" and "peer-heartbeat-interval: 600" to avoid heartbeat timeout:):
core@coreos-1 ~ $ systemctl status etcd -l
â—� etcd.service - etcd
Loaded: loaded (/usr/lib64/systemd/system/etcd.service; static; vendor preset: disabled)
Drop-In: /run/systemd/system/etcd.service.d
└─10-oem.conf, 20-cloudinit.conf
Active: active (running) since Sun 2015-02-08 15:07:43 UTC; 11min ago
Main PID: 623 (etcd)
CGroup: /system.slice/etcd.service
└─623 /usr/bin/etcd
Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:48.754 INFO | Discovery found peers [http://10.0.0.43:7001 http://10.0.0.42:7001]
Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:48.775 INFO | Discovery fetched back peer list: [10.0.0.43:7001 10.0.0.42:7001]
Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:48.932 INFO | Send Join Request to http://10.0.0.43:7001/join
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:50.155 INFO | 6de45d8f7ed64103a2e3ef4e2eef87af joined the cluster via peer 10.0.0.43:7001
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:50.450 INFO | etcd server [name 6de45d8f7ed64103a2e3ef4e2eef87af, listen on :4001, advertised url http://10.0.0.41:4001]
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:50.473 INFO | peer server [name 6de45d8f7ed64103a2e3ef4e2eef87af, listen on :7001, advertised url http://10.0.0.41:7001]
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:50.512 INFO | 6de45d8f7ed64103a2e3ef4e2eef87af starting in peer mode
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:50.532 INFO | 6de45d8f7ed64103a2e3ef4e2eef87af: state changed from 'initialized' to 'follower'.
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:50.844 INFO | 6de45d8f7ed64103a2e3ef4e2eef87af: peer added: '18ac2280bac94a7ead141eaaa0f89740'
Feb 08 15:07:51 coreos-1.novalocal etcd[623]: [etcd] Feb 8 15:07:51.242 INFO | 6de45d8f7ed64103a2e3ef4e2eef87af: peer added: 'f9f13445a5464c4f87d7d5436f305cdb'
core@coreos-1 ~ $ systemctl cat etcd
# /usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd
[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s
# /run/systemd/system/etcd.service.d/10-oem.conf
[Service]
Environment=ETCD_PEER_ELECTION_TIMEOUT=1200
# /run/systemd/system/etcd.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADDR=10.0.0.41:4001"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/87c03e951672c6368f43aef8e0e800d7"
Environment="ETCD_PEER_ADDR=10.0.0.41:7001"
Environment="ETCD_PEER_ELECTION_TIMEOUT=3000"
Environment="ETCD_PEER_HEARTBEAT_INTERVAL=600"
But it seems fleet is not running well:
core@coreos-1 ~ $ systemctl status fleet -l
â—� fleet.service - fleet daemon
Loaded: loaded (/usr/lib64/systemd/system/fleet.service; static; vendor preset: disabled)
Drop-In: /run/systemd/system/fleet.service.d
└─20-cloudinit.conf
Active: active (running) since Sun 2015-02-08 15:07:44 UTC; 17min ago
Main PID: 627 (fleetd)
CGroup: /system.slice/fleet.service
└─627 /usr/bin/fleetd
Feb 08 15:23:52 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:56 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR server.go:185: Server monitor triggered: Monitor timed out before successful heartbeat
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO server.go:153: Establishing etcd connectivity
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: cancelled
Feb 08 15:23:58 coreos-1.novalocal fleetd[627]: INFO server.go:164: Starting server components
Feb 08 15:24:14 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
And I found the CPU utilization of my CoreOS VMs is very high:
Tasks: 89 total, 2 running, 87 sleeping, 0 stopped, 0 zombie
%Cpu(s): 49.6 us, 18.8 sy, 0.0 ni, 1.1 id, 0.0 wa, 23.9 hi, 6.6 si, 0.0 st
KiB Mem: 2054224 total, 1146488 used, 907736 free, 895760 buffers
KiB Swap: 0 total, 0 used, 0 free. 149880 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
627 root 20 0 478252 16160 6296 S 29.2 0.8 4:27.52 fleetd
476 root 20 0 590472 22920 11892 R 19.8 1.1 3:23.40 update_engine
623 etcd 20 0 294676 15312 8816 S 8.1 0.7 1:09.43 etcd
And "fleet list-machines" can not list machines:
core@coreos-1 ~ $ fleet list-machines
INFO fleet.go:58: Starting fleetd version 0.9.0
INFO fleet.go:162: No provided or default config file found - proceeding without
INFO server.go:153: Establishing etcd connectivity
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO server.go:164: Starting server components
INFO engine.go:80: Engine leader is 18ac2280bac94a7ead141eaaa0f89740
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
I am using 557.1.0:
core@coreos-1 ~ $ cat /etc/lsb-release
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=557.1.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 557.1.0"
Reactions are currently unavailable