Skip to content
This repository has been archived by the owner. It is now read-only.

Fleet can not get response from etcd and CPU utilization is very high #1119

Closed
eric-nuaa opened this issue Feb 8, 2015 · 9 comments
Closed

Fleet can not get response from etcd and CPU utilization is very high #1119

eric-nuaa opened this issue Feb 8, 2015 · 9 comments

Comments

@eric-nuaa
Copy link

@eric-nuaa eric-nuaa commented Feb 8, 2015

I set up a CoreOS cluster (3 VMs) on top of my OpenStack environment, and I found etcd seems running well (I tuned etcd with "peer-election-timeout: 3000" and "peer-heartbeat-interval: 600" to avoid heartbeat timeout:):

core@coreos-1 ~ $ systemctl status etcd -l 
â—� etcd.service - etcd
   Loaded: loaded (/usr/lib64/systemd/system/etcd.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/etcd.service.d
           └─10-oem.conf, 20-cloudinit.conf
   Active: active (running) since Sun 2015-02-08 15:07:43 UTC; 11min ago
 Main PID: 623 (etcd)
   CGroup: /system.slice/etcd.service
           └─623 /usr/bin/etcd

Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:48.754 INFO      | Discovery found peers [http://10.0.0.43:7001 http://10.0.0.42:7001]
Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:48.775 INFO      | Discovery fetched back peer list: [10.0.0.43:7001 10.0.0.42:7001]
Feb 08 15:07:48 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:48.932 INFO      | Send Join Request to http://10.0.0.43:7001/join
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.155 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af joined the cluster via peer 10.0.0.43:7001
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.450 INFO      | etcd server [name 6de45d8f7ed64103a2e3ef4e2eef87af, listen on :4001, advertised url http://10.0.0.41:4001]
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.473 INFO      | peer server [name 6de45d8f7ed64103a2e3ef4e2eef87af, listen on :7001, advertised url http://10.0.0.41:7001]
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.512 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af starting in peer mode
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.532 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af: state changed from 'initialized' to 'follower'.
Feb 08 15:07:50 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:50.844 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af: peer added: '18ac2280bac94a7ead141eaaa0f89740'
Feb 08 15:07:51 coreos-1.novalocal etcd[623]: [etcd] Feb  8 15:07:51.242 INFO      | 6de45d8f7ed64103a2e3ef4e2eef87af: peer added: 'f9f13445a5464c4f87d7d5436f305cdb'
core@coreos-1 ~ $ systemctl cat etcd       
# /usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd 
Restart=always
RestartSec=10s

# /run/systemd/system/etcd.service.d/10-oem.conf
[Service]
Environment=ETCD_PEER_ELECTION_TIMEOUT=1200

# /run/systemd/system/etcd.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADDR=10.0.0.41:4001"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/87c03e951672c6368f43aef8e0e800d7"
Environment="ETCD_PEER_ADDR=10.0.0.41:7001"
Environment="ETCD_PEER_ELECTION_TIMEOUT=3000"
Environment="ETCD_PEER_HEARTBEAT_INTERVAL=600"

But it seems fleet is not running well:

core@coreos-1 ~ $ systemctl status fleet -l 
â—� fleet.service - fleet daemon
   Loaded: loaded (/usr/lib64/systemd/system/fleet.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/fleet.service.d
           └─20-cloudinit.conf
   Active: active (running) since Sun 2015-02-08 15:07:44 UTC; 17min ago
 Main PID: 627 (fleetd)
   CGroup: /system.slice/fleet.service
           └─627 /usr/bin/fleetd

Feb 08 15:23:52 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:56 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR server.go:185: Server monitor triggered: Monitor timed out before successful heartbeat
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO server.go:153: Establishing etcd connectivity
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Feb 08 15:23:57 coreos-1.novalocal fleetd[627]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: cancelled
Feb 08 15:23:58 coreos-1.novalocal fleetd[627]: INFO server.go:164: Starting server components
Feb 08 15:24:14 coreos-1.novalocal fleetd[627]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled

And I found the CPU utilization of my CoreOS VMs is very high:

Tasks:  89 total,   2 running,  87 sleeping,   0 stopped,   0 zombie
%Cpu(s): 49.6 us, 18.8 sy,  0.0 ni,  1.1 id,  0.0 wa, 23.9 hi,  6.6 si,  0.0 st
KiB Mem:   2054224 total,  1146488 used,   907736 free,   895760 buffers
KiB Swap:        0 total,        0 used,        0 free.   149880 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                             
  627 root      20   0  478252  16160   6296 S 29.2  0.8   4:27.52 fleetd                                                                              
  476 root      20   0  590472  22920  11892 R 19.8  1.1   3:23.40 update_engine                                                                       
  623 etcd      20   0  294676  15312   8816 S  8.1  0.7   1:09.43 etcd

And "fleet list-machines" can not list machines:

core@coreos-1 ~ $ fleet list-machines 
INFO fleet.go:58: Starting fleetd version 0.9.0
INFO fleet.go:162: No provided or default config file found - proceeding without
INFO server.go:153: Establishing etcd connectivity
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
INFO server.go:164: Starting server components
INFO engine.go:80: Engine leader is 18ac2280bac94a7ead141eaaa0f89740
INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled

I am using 557.1.0:

core@coreos-1 ~ $ cat /etc/lsb-release        
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=557.1.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 557.1.0"
@bcwaldon
Copy link
Contributor

@bcwaldon bcwaldon commented Feb 9, 2015

The client tool is fleetctl, not fleet. What is the output of fleetctl list-machines?

@eric-nuaa
Copy link
Author

@eric-nuaa eric-nuaa commented Feb 10, 2015

Sorry, my bad, the output of fleetctl list-machines is good.

core@coreos-1 ~ $ fleetctl list-machines 
MACHINE         IP              METADATA
7bc5032d...     10.0.0.53       -
80f7ecd7...     10.0.0.55       -
b9daca72...     10.0.0.54       -

and just now, I added the following in my cloud-config:

fleet:
    etcd-request-timeout: 5

And then it seems fleet can get response from etcd now, but it has other errors:

core@coreos-1 ~ $ journalctl -u fleet -f         
-- Logs begin at Mon 2015-02-09 13:46:24 UTC. --
Feb 10 02:41:54 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 10 02:42:24 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 10 02:46:55 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input

Are these errors harmful? And I still see the CPU utilization of fleet is very high:

%Cpu(s): 20.2 us,  9.3 sy,  0.0 ni, 59.1 id,  1.0 wa,  0.0 hi, 10.4 si,  0.0 st
KiB Mem:   4052432 total,   331584 used,  3720848 free,     4264 buffers
KiB Swap:        0 total,        0 used,        0 free.   212700 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                            
  638 root      20   0  552952  14960   6404 S  51.9  0.4 337:06.42 fleetd                                                                             
  635 etcd      20   0  300612  30296   8888 S   8.9  0.7  80:54.71 etcd 
@bcwaldon
Copy link
Contributor

@bcwaldon bcwaldon commented Feb 10, 2015

@eric-nuaa I believe the "etcd watcher" errors you see are a known issue (#888) and can be ignored.

@bcwaldon
Copy link
Contributor

@bcwaldon bcwaldon commented Feb 10, 2015

@eric-nuaa I asked for the fleet logs in https://groups.google.com/forum/#!topic/coreos-user/52VxrfhtRTw - could you please drop those logs here?

@eric-nuaa
Copy link
Author

@eric-nuaa eric-nuaa commented Feb 11, 2015

The logs of update-engine and fleet are too big, and it seems I can not attach files here ...

Currently in my CoreOS VMs, I still see the CPU utilization of fleet and etcd are very high but not update-engine, and there are a lot of error in fleet log. However, it seems these error can be ignored, so I do not know why fleet is so busy.

Feb 11 05:02:38 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:04:10 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:07:38 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:09:11 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:12:38 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:14:11 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:17:38 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:19:12 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:22:39 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:24:13 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:27:40 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:29:14 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:32:40 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:34:03 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:37:41 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:38:57 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:42:22 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input
Feb 11 05:43:58 coreos-1.novalocal fleetd[638]: ERROR event.go:107: etcd watcher {Watch /_coreos.com/fleet/job} returned error: unexpected end of JSON input

and for etcd, its log seems OK:

Feb 11 00:42:35 coreos-3.novalocal etcd[635]: [etcd] Feb 11 00:42:35.211 INFO      | 80f7ecd7721f463fbf7d1f5dfe04d018: snapshot of 10004 events at index 290133 completed
Feb 11 01:55:01 coreos-3.novalocal etcd[635]: [etcd] Feb 11 01:55:01.470 INFO      | 80f7ecd7721f463fbf7d1f5dfe04d018: snapshot of 10009 events at index 300142 completed
Feb 11 03:07:11 coreos-3.novalocal etcd[635]: [etcd] Feb 11 03:07:11.026 INFO      | 80f7ecd7721f463fbf7d1f5dfe04d018: snapshot of 10004 events at index 310146 completed
Feb 11 04:19:11 coreos-3.novalocal etcd[635]: [etcd] Feb 11 04:19:11.807 INFO      | 80f7ecd7721f463fbf7d1f5dfe04d018: snapshot of 10003 events at index 320149 completed
Feb 11 05:31:27 coreos-3.novalocal etcd[635]: [etcd] Feb 11 05:31:27.123 INFO      | 80f7ecd7721f463fbf7d1f5dfe04d018: snapshot of 10006 events at index 330155 completed

So I do not know why etcd is so busy too.

%Cpu(s): 35.5 us, 23.5 sy,  0.0 ni, 27.8 id,  1.1 wa,  0.0 hi, 12.1 si,  0.0 st
KiB Mem:   4052432 total,   341788 used,  3710644 free,     4252 buffers
KiB Swap:        0 total,        0 used,        0 free.   209580 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                            
  642 root      20   0  618616  16096   6436 S  62.2  0.4   1242:27 fleetd                                                                             
  635 etcd      20   0  336296  41368   8800 S  46.1  1.0 622:03.42 etcd 

And it seems my CoreOS VMs have been automatically updated from 557.1.0 to 557.2.0.

@bcwaldon
Copy link
Contributor

@bcwaldon bcwaldon commented Feb 11, 2015

@eric-nuaa What type of workload do you have running on fleet? Lots of units?

@eric-nuaa
Copy link
Author

@eric-nuaa eric-nuaa commented Feb 12, 2015

@bcwaldon Actually no any units.

core@coreos-1 ~ $ fleetctl list-units
UNIT MACHINE ACTIVE SUB

@bcwaldon
Copy link
Contributor

@bcwaldon bcwaldon commented May 1, 2015

@eric-nuaa When you shared the above logs, it seems like you pulled the etcd logs from the wrong machine. Are you still experiencing this problem?

@bcwaldon
Copy link
Contributor

@bcwaldon bcwaldon commented Jul 9, 2015

No update in 2 months

@bcwaldon bcwaldon closed this Jul 9, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.