New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet engine stops units when etcd leadership change or has connectivity issues #1289

Open
victorcampos opened this Issue Jul 3, 2015 · 22 comments

Comments

Projects
None yet
@victorcampos

victorcampos commented Jul 3, 2015

EDIT:
fleet is in version 0.10.2 per the stable channel
etcd is in version 0.4.9

I'm running a 5-node cluster for etcd 0.4 in a production environment, currently whenever the leader changes (or has connectivity issues) some of our fleet units are stopped with this kind of message in fleet.service logs:

Jul 02 08:02:29 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: too many redirects
Jul 02 08:02:29 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a2.<OUR_CLUSTER_DOMAIN>:4001/: too many redirects
Jul 02 08:02:29 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: too many redirects
Jul 02 08:02:29 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a2.<OUR_CLUSTER_DOMAIN>:4001/: too many redirects
Jul 02 08:02:30 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: ERROR reconcile.go:120: Failed fetching Units from Registry: timeout reached
Jul 02 08:02:30 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: ERROR reconcile.go:73: Unable to determine agent's desired state: timeout reached
Jul 02 08:02:30 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-b1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 02 08:02:30 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: ERROR engine.go:149: Unable to determine cluster engine version
Jul 02 08:02:30 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-b1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 02 08:02:31 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO engine.go:81: Engine leadership changed from e9fc4f64602545aca6fbab041d4abae9 to 4ddd4e13813f4e11bbf96622bf3aefe0
Jul 02 10:19:55 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: ERROR reconcile.go:120: Failed fetching Units from Registry: timeout reached
Jul 02 10:19:55 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: ERROR reconcile.go:73: Unable to determine agent's desired state: timeout reached
Jul 02 10:19:55 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 02 10:19:55 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 02 10:19:55 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 02 10:19:55 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 02 10:19:55 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 03 13:41:09 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: WARN job.go:272: No Unit found in Registry for Job(spark-tmp.mount)
Jul 03 13:41:09 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: ERROR job.go:109: Failed to parse Unit from etcd: unable to parse Unit in Registry at key /_coreos.com/fleet/job/spark-tmp.mount/object
Jul 03 13:41:09 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO client.go:292: Failed getting response from http://etcd-a1.<OUR_CLUSTER_DOMAIN>:4001/: cancelled
Jul 03 13:41:09 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:138: Triggered systemd unit spark-tmp.mount stop: job=1723248
Jul 03 13:41:09 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:259: Removing systemd unit spark-tmp.mount
Jul 03 13:41:09 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:182: Instructing systemd to reload units
Jul 03 13:41:10 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=spark-tmp.mount reason="unit loaded but not scheduled here"
Jul 03 13:41:10 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Jul 03 13:41:13 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:246: Writing systemd unit spark-tmp.mount (119b)
Jul 03 13:41:13 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:182: Instructing systemd to reload units
Jul 03 13:41:14 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:127: Triggered systemd unit spark-tmp.mount start: job=1723332
Jul 03 13:41:14 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=spark-tmp.mount reason="unit scheduled here but not loaded"
Jul 03 13:41:14 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Jul 03 13:41:14 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=spark-tmp.mount reason="unit currently loaded but desired state is
Jul 03 14:22:11 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:138: Triggered systemd unit spark-master-production@1.service stop: job=1726870
Jul 03 14:22:11 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO reconcile.go:330: AgentReconciler completed task: type=StopUnit job=spark-master-production@1.service reason="unit currently launched bu
Jul 03 14:22:26 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:138: Triggered systemd unit spark-master-production@1.service stop: job=1726871
Jul 03 14:22:26 spark-master-a.<OUR_CLUSTER_DOMAIN> fleetd[615]: INFO manager.go:259: Removing systemd unit spark-master-production@1.service

I had to disable auto-update for our etcd instances because whenever an update was available and rebooting machines we would have outages from some systems that fleet stopped.

I've not yet setup a testing ground for this failure, setting up another 5-node cluster and forcing leadership changes or connectivity issues but it's beginning to impact us badly as my ops team is quite small for the size of our infrastructure and we try to avoid doing manual tasks such as restarts and reprovisionings.

I could not find a bug/issue related to this, if someone can point to something alike or give me some kind of solution it'd be greatly appreciated.

@philk

This comment has been minimized.

Show comment
Hide comment
@philk

philk Jul 9, 2015

Contributor

We're seeing the same behavior with fleet 0.10.1 and etcd 2.0.10. I thought it was etcd proxy's fault ( coreos/etcd#3007 ) and worked around it by specifying more than just the local etcd proxy but #1231 and #1250 make me wonder if I'm going to run into different issues when I deploy that change. We had just etcd election that caused a stampeding herd of units being scheduled and re-scheduled around the cluster due to fleet deciding it couldn't communicate to etcd.

Here are some logs I captured (this is before adding all the etcd instances to fleet, so just using the local etcd proxy still):

fleet:

Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 06:23:08  fleetd[17011]: INFO engine.go:81: Engine leadership changed from 738002bf41ac44b9b93f123e972b7732 to 8157948988ab4ab1a3cf6fe2b35df866
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit hermod@f160512-1.service (911b)
jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit vor@311bd3c-2.service (1166b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit hermod@f160512-3.service (911b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit tyr@b638bbd-1.service (1022b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit skadi@3477f1a-1.service (1025b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit syn@89bbb32-4.service (1195b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit skadi@3477f1a-4.service (1025b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:182: Instructing systemd to reload units
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:127: Triggered systemd unit skadi@3477f1a-1.service start: job=2603018
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit syn@89bbb32-4.service start: job=2603109
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit hermod@f160512-3.service start: job=2603200
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit tyr@b638bbd-1.service start: job=2603291
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit hermod@f160512-1.service start: job=2603382
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit vor@311bd3c-2.service start: job=2603473
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit skadi@3477f1a-4.service start: job=2603564
Jul 09 06:23:13  fleetd[17011]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=hermod@f160512-1.service reason="unit scheduled here but not loaded"
jul 09 06:23:13  fleetd[17011]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=vor@311bd3c-2.service reason="unit scheduled here but not loaded"

etcd0:

Jul 09 06:22:29  etcd2[16125]: 2015/07/09 06:22:29 etcdserver: compacted log at index 38016358
Jul 09 06:22:29  etcd2[16125]: 2015/07/09 06:22:29 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed8-0000000002441567.wal is created
Jul 09 06:22:29  etcd2[16125]: 2015/07/09 06:22:29 etcdserver: saved snapshot at index 38016358
Jul 09 06:22:41  etcd2[16125]: 2015/07/09 06:22:41 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024351f0.snap
Jul 09 06:22:41  etcd2[16125]: 2015/07/09 06:22:41 filePurge: successfully removed file /var/lib/etcd2/member/wal/0000000000000ed3-00000000024351f1.wal
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 is starting a new election at term 3133
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 became candidate at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 received vote from 3a6998a9e8713ee2 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] sent vote request to ff4749355ad8af87 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] sent vote request to fc70c169cc1f5739 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft.node: 3a6998a9e8713ee2 lost leader fc70c169cc1f5739 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 rafthttp: encountered error reading the client log stream: read tcp 10.0.59.127:2380: use of closed network connection
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 rafthttp: client streaming to fc70c169cc1f5739 at term 3133 has been stopped
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 received vote from ff4749355ad8af87 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 [q:2] has received 2 votes and 0 vote rejections
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 became leader at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft.node: 3a6998a9e8713ee2 elected leader 3a6998a9e8713ee2 at term 3134
Jul 09 06:29:13  etcd2[16125]: 2015/07/09 06:29:13 etcdserver: start to snapshot (applied: 38026364, lastsnap: 38016358)
Jul 09 06:29:13  etcd2[16125]: 2015/07/09 06:29:13 etcdserver: compacted log at index 38026364
jul 09 06:29:13  etcd2[16125]: 2015/07/09 06:29:13 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed9-0000000002443c81.wal is created

etcd1:

Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 etcdserver: start to snapshot (applied: 38016258, lastsnap: 38006257)
Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 etcdserver: compacted log at index 38016258
Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed8-0000000002441503.wal is created
Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 etcdserver: saved snapshot at index 38016258
Jul 09 06:19:48  etcd2[15951]: 2015/07/09 06:19:48 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024351ac.snap
Jul 09 06:19:49  etcd2[15951]: 2015/07/09 06:19:49 filePurge: successfully removed file /var/lib/etcd2/member/wal/0000000000000ed3-00000000024351ad.wal
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft: fc70c169cc1f5739 [term: 3133] received a MsgVote message with higher term from 3a6998a9e8713ee2 [term: 3134]
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft: fc70c169cc1f5739 became follower at term 3134
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft: fc70c169cc1f5739 [logterm: 3133, index: 38017080, vote: 0] rejected vote from 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] at term 3134
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft.node: fc70c169cc1f5739 lost leader fc70c169cc1f5739 at term 3134
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raftlog: found conflict at index 38017079 [existing term: 3133, conflicting term: 3134]
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raftlog: replace the unstable entries from index 38017079
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft.node: fc70c169cc1f5739 elected leader 3a6998a9e8713ee2 at term 3134
Jul 09 06:20:22  etcd2[15951]: 2015/07/09 06:20:22 rafthttp: server streaming to 3a6998a9e8713ee2 at term 3133 has been stopped
Jul 09 06:20:22  etcd2[15951]: 2015/07/09 06:20:22 rafthttp: starting client stream to 3a6998a9e8713ee2 at term 3134
Jul 09 06:20:26  etcd2[15951]: 2015/07/09 06:20:26 etcdhttp: unexpected error: etcdserver: request timed out
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 etcdserver: start to snapshot (applied: 38026260, lastsnap: 38016258)
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 etcdserver: compacted log at index 38026260
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed9-0000000002443c1b.wal is created
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 etcdserver: saved snapshot at index 38026260
Jul 09 06:26:49  etcd2[15951]: 2015/07/09 06:26:48 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024378bd.snap
jul 09 06:26:49  etcd2[15951]: 2015/07/09 06:26:49 filePurge: successfully removed file /var/lib/etcd2/member/wal/0000000000000ed4-00000000024378bf.wal

etcd2:

Jul 09 06:22:43 etcd2[1361]: 2015/07/09 06:22:43 wal: segmented wal file /var/lib/etcd2/member/wal/00000000000002de-000000000244166a.wal is created
Jul 09 06:22:43 etcd2[1361]: 2015/07/09 06:22:43 etcdserver: saved snapshot at index 38016617
Jul 09 06:22:51 etcd2[1361]: 2015/07/09 06:22:51 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024352e8.snap
Jul 09 06:22:51 etcd2[1361]: 2015/07/09 06:22:51 filePurge: successfully removed file /var/lib/etcd2/member/wal/00000000000002d9-00000000024352ee.wal
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [term: 3133] received a MsgVote message with higher term from 3a6998a9e8713ee2 [term: 3134]
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 became follower at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [logterm: 3133, index: 38017078, vote: 0] voted for 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft.node: ff4749355ad8af87 lost leader fc70c169cc1f5739 at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft.node: ff4749355ad8af87 elected leader 3a6998a9e8713ee2 at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 rafthttp: starting client stream to 3a6998a9e8713ee2 at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [term: 3134] ignored a MsgApp message with lower term from fc70c169cc1f5739 [term: 3133]
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [term: 3134] ignored a MsgApp message with lower term from fc70c169cc1f5739 [term: 3133]
Jul 09 06:23:12 etcd2[1361]: 2015/07/09 06:23:12 etcdhttp: unexpected error: etcdserver: request timed out
Jul 09 06:23:13 etcd2[1361]: 2015/07/09 06:23:13 rafthttp: encountered error reading the client log stream: read tcp 10.0.59.127:2380: i/o timeout
Jul 09 06:23:13 etcd2[1361]: 2015/07/09 06:23:13 rafthttp: client streaming to fc70c169cc1f5739 at term 3133 has been stopped
Jul 09 06:29:27 etcd2[1361]: 2015/07/09 06:29:27 etcdserver: start to snapshot (applied: 38026618, lastsnap: 38016617)
jul 09 06:29:27 etcd2[1361]: 2015/07/09 06:29:27 etcdserver: compacted log at index 38026618
Contributor

philk commented Jul 9, 2015

We're seeing the same behavior with fleet 0.10.1 and etcd 2.0.10. I thought it was etcd proxy's fault ( coreos/etcd#3007 ) and worked around it by specifying more than just the local etcd proxy but #1231 and #1250 make me wonder if I'm going to run into different issues when I deploy that change. We had just etcd election that caused a stampeding herd of units being scheduled and re-scheduled around the cluster due to fleet deciding it couldn't communicate to etcd.

Here are some logs I captured (this is before adding all the etcd instances to fleet, so just using the local etcd proxy still):

fleet:

Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 05:45:44  fleetd[17011]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Jul 09 06:23:08  fleetd[17011]: INFO engine.go:81: Engine leadership changed from 738002bf41ac44b9b93f123e972b7732 to 8157948988ab4ab1a3cf6fe2b35df866
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit hermod@f160512-1.service (911b)
jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit vor@311bd3c-2.service (1166b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit hermod@f160512-3.service (911b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit tyr@b638bbd-1.service (1022b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit skadi@3477f1a-1.service (1025b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit syn@89bbb32-4.service (1195b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:246: Writing systemd unit skadi@3477f1a-4.service (1025b)
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:182: Instructing systemd to reload units
Jul 09 06:23:12  fleetd[17011]: INFO manager.go:127: Triggered systemd unit skadi@3477f1a-1.service start: job=2603018
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit syn@89bbb32-4.service start: job=2603109
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit hermod@f160512-3.service start: job=2603200
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit tyr@b638bbd-1.service start: job=2603291
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit hermod@f160512-1.service start: job=2603382
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit vor@311bd3c-2.service start: job=2603473
Jul 09 06:23:13  fleetd[17011]: INFO manager.go:127: Triggered systemd unit skadi@3477f1a-4.service start: job=2603564
Jul 09 06:23:13  fleetd[17011]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=hermod@f160512-1.service reason="unit scheduled here but not loaded"
jul 09 06:23:13  fleetd[17011]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=vor@311bd3c-2.service reason="unit scheduled here but not loaded"

etcd0:

Jul 09 06:22:29  etcd2[16125]: 2015/07/09 06:22:29 etcdserver: compacted log at index 38016358
Jul 09 06:22:29  etcd2[16125]: 2015/07/09 06:22:29 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed8-0000000002441567.wal is created
Jul 09 06:22:29  etcd2[16125]: 2015/07/09 06:22:29 etcdserver: saved snapshot at index 38016358
Jul 09 06:22:41  etcd2[16125]: 2015/07/09 06:22:41 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024351f0.snap
Jul 09 06:22:41  etcd2[16125]: 2015/07/09 06:22:41 filePurge: successfully removed file /var/lib/etcd2/member/wal/0000000000000ed3-00000000024351f1.wal
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 is starting a new election at term 3133
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 became candidate at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 received vote from 3a6998a9e8713ee2 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] sent vote request to ff4749355ad8af87 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] sent vote request to fc70c169cc1f5739 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft.node: 3a6998a9e8713ee2 lost leader fc70c169cc1f5739 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 rafthttp: encountered error reading the client log stream: read tcp 10.0.59.127:2380: use of closed network connection
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 rafthttp: client streaming to fc70c169cc1f5739 at term 3133 has been stopped
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 received vote from ff4749355ad8af87 at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 [q:2] has received 2 votes and 0 vote rejections
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft: 3a6998a9e8713ee2 became leader at term 3134
Jul 09 06:23:08  etcd2[16125]: 2015/07/09 06:23:08 raft.node: 3a6998a9e8713ee2 elected leader 3a6998a9e8713ee2 at term 3134
Jul 09 06:29:13  etcd2[16125]: 2015/07/09 06:29:13 etcdserver: start to snapshot (applied: 38026364, lastsnap: 38016358)
Jul 09 06:29:13  etcd2[16125]: 2015/07/09 06:29:13 etcdserver: compacted log at index 38026364
jul 09 06:29:13  etcd2[16125]: 2015/07/09 06:29:13 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed9-0000000002443c81.wal is created

etcd1:

Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 etcdserver: start to snapshot (applied: 38016258, lastsnap: 38006257)
Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 etcdserver: compacted log at index 38016258
Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed8-0000000002441503.wal is created
Jul 09 06:19:41  etcd2[15951]: 2015/07/09 06:19:41 etcdserver: saved snapshot at index 38016258
Jul 09 06:19:48  etcd2[15951]: 2015/07/09 06:19:48 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024351ac.snap
Jul 09 06:19:49  etcd2[15951]: 2015/07/09 06:19:49 filePurge: successfully removed file /var/lib/etcd2/member/wal/0000000000000ed3-00000000024351ad.wal
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft: fc70c169cc1f5739 [term: 3133] received a MsgVote message with higher term from 3a6998a9e8713ee2 [term: 3134]
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft: fc70c169cc1f5739 became follower at term 3134
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft: fc70c169cc1f5739 [logterm: 3133, index: 38017080, vote: 0] rejected vote from 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] at term 3134
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft.node: fc70c169cc1f5739 lost leader fc70c169cc1f5739 at term 3134
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raftlog: found conflict at index 38017079 [existing term: 3133, conflicting term: 3134]
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raftlog: replace the unstable entries from index 38017079
Jul 09 06:20:21  etcd2[15951]: 2015/07/09 06:20:21 raft.node: fc70c169cc1f5739 elected leader 3a6998a9e8713ee2 at term 3134
Jul 09 06:20:22  etcd2[15951]: 2015/07/09 06:20:22 rafthttp: server streaming to 3a6998a9e8713ee2 at term 3133 has been stopped
Jul 09 06:20:22  etcd2[15951]: 2015/07/09 06:20:22 rafthttp: starting client stream to 3a6998a9e8713ee2 at term 3134
Jul 09 06:20:26  etcd2[15951]: 2015/07/09 06:20:26 etcdhttp: unexpected error: etcdserver: request timed out
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 etcdserver: start to snapshot (applied: 38026260, lastsnap: 38016258)
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 etcdserver: compacted log at index 38026260
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 wal: segmented wal file /var/lib/etcd2/member/wal/0000000000000ed9-0000000002443c1b.wal is created
Jul 09 06:26:25  etcd2[15951]: 2015/07/09 06:26:25 etcdserver: saved snapshot at index 38026260
Jul 09 06:26:49  etcd2[15951]: 2015/07/09 06:26:48 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024378bd.snap
jul 09 06:26:49  etcd2[15951]: 2015/07/09 06:26:49 filePurge: successfully removed file /var/lib/etcd2/member/wal/0000000000000ed4-00000000024378bf.wal

etcd2:

Jul 09 06:22:43 etcd2[1361]: 2015/07/09 06:22:43 wal: segmented wal file /var/lib/etcd2/member/wal/00000000000002de-000000000244166a.wal is created
Jul 09 06:22:43 etcd2[1361]: 2015/07/09 06:22:43 etcdserver: saved snapshot at index 38016617
Jul 09 06:22:51 etcd2[1361]: 2015/07/09 06:22:51 filePurge: successfully removed file /var/lib/etcd2/member/snap/0000000000000c3d-00000000024352e8.snap
Jul 09 06:22:51 etcd2[1361]: 2015/07/09 06:22:51 filePurge: successfully removed file /var/lib/etcd2/member/wal/00000000000002d9-00000000024352ee.wal
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [term: 3133] received a MsgVote message with higher term from 3a6998a9e8713ee2 [term: 3134]
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 became follower at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [logterm: 3133, index: 38017078, vote: 0] voted for 3a6998a9e8713ee2 [logterm: 3133, index: 38017078] at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft.node: ff4749355ad8af87 lost leader fc70c169cc1f5739 at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft.node: ff4749355ad8af87 elected leader 3a6998a9e8713ee2 at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 rafthttp: starting client stream to 3a6998a9e8713ee2 at term 3134
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [term: 3134] ignored a MsgApp message with lower term from fc70c169cc1f5739 [term: 3133]
Jul 09 06:23:08 etcd2[1361]: 2015/07/09 06:23:08 raft: ff4749355ad8af87 [term: 3134] ignored a MsgApp message with lower term from fc70c169cc1f5739 [term: 3133]
Jul 09 06:23:12 etcd2[1361]: 2015/07/09 06:23:12 etcdhttp: unexpected error: etcdserver: request timed out
Jul 09 06:23:13 etcd2[1361]: 2015/07/09 06:23:13 rafthttp: encountered error reading the client log stream: read tcp 10.0.59.127:2380: i/o timeout
Jul 09 06:23:13 etcd2[1361]: 2015/07/09 06:23:13 rafthttp: client streaming to fc70c169cc1f5739 at term 3133 has been stopped
Jul 09 06:29:27 etcd2[1361]: 2015/07/09 06:29:27 etcdserver: start to snapshot (applied: 38026618, lastsnap: 38016617)
jul 09 06:29:27 etcd2[1361]: 2015/07/09 06:29:27 etcdserver: compacted log at index 38026618
@bcwaldon

This comment has been minimized.

Show comment
Hide comment
@bcwaldon

bcwaldon Jul 9, 2015

Contributor

This is actually a design decision we made early on. If fleet cannot contact etcd, it cannot tell what work is scheduled to it.

Contributor

bcwaldon commented Jul 9, 2015

This is actually a design decision we made early on. If fleet cannot contact etcd, it cannot tell what work is scheduled to it.

@odain2

This comment has been minimized.

Show comment
Hide comment
@odain2

odain2 Jul 20, 2015

@bcwaldon I can understand the reasoning behind the design decision, but this has significant bad side effects. For us, running a 7-node cluster on AWS we see etcd connectivity issues from otherwise unloaded hosts about once every 2 days. This causes jobs to move all over the cluster creating significant "churn". It can also cause jobs to consolidate on just a few hosts leading to poor cluster resource usage. For example, right now we have 6 hosts and 4 scheduled jobs. Initially all 4 jobs were scheduled on different machines as you'd expect. However, over the last 4 days we've had a few very short network hiccups as described above. As a result the 4 jobs are now running on just 2 hosts with 2 jobs/host which doesn't make very efficient use of our resources. What options do we have? e.g. can we ask fleet to be more tolerant of network issues or disable the default of moving work during a brief etcd outage?

odain2 commented Jul 20, 2015

@bcwaldon I can understand the reasoning behind the design decision, but this has significant bad side effects. For us, running a 7-node cluster on AWS we see etcd connectivity issues from otherwise unloaded hosts about once every 2 days. This causes jobs to move all over the cluster creating significant "churn". It can also cause jobs to consolidate on just a few hosts leading to poor cluster resource usage. For example, right now we have 6 hosts and 4 scheduled jobs. Initially all 4 jobs were scheduled on different machines as you'd expect. However, over the last 4 days we've had a few very short network hiccups as described above. As a result the 4 jobs are now running on just 2 hosts with 2 jobs/host which doesn't make very efficient use of our resources. What options do we have? e.g. can we ask fleet to be more tolerant of network issues or disable the default of moving work during a brief etcd outage?

@Telvary

This comment has been minimized.

Show comment
Hide comment
@Telvary

Telvary Jul 27, 2015

+1 on keeping the server in its state when fleet is losing connectivity with etcd.
We should have an option to either kill everything (default), or stay with current state until fleet reconnect to etcd and reconcile.

Workaround is to rise the fleet timeout (agent_ttl).

Telvary commented Jul 27, 2015

+1 on keeping the server in its state when fleet is losing connectivity with etcd.
We should have an option to either kill everything (default), or stay with current state until fleet reconnect to etcd and reconcile.

Workaround is to rise the fleet timeout (agent_ttl).

@victorcampos

This comment has been minimized.

Show comment
Hide comment
@victorcampos

victorcampos Jul 27, 2015

@bcwaldon if it's a design decision is it possible to rise the fleet timeout per @Telvary suggestion?

Our problem is more than churn, we run some internal services without HA provisioning because they are not worth the cost of doubling the infrastructure although it's a pain in the ass when they are unavailable because were rescheduled. One of them runs a vault service (from http://vaultproject.io) and everytime it's restarted we have to unseal it through a not-so-trivial process (for security we don't store unseal keys in the same computers used as workstations and we need at least 3 of them to unseal the instance and gain access again to shared secrets, etc).

If @Telvary suggestion isn't possible, is there any other workaround?

victorcampos commented Jul 27, 2015

@bcwaldon if it's a design decision is it possible to rise the fleet timeout per @Telvary suggestion?

Our problem is more than churn, we run some internal services without HA provisioning because they are not worth the cost of doubling the infrastructure although it's a pain in the ass when they are unavailable because were rescheduled. One of them runs a vault service (from http://vaultproject.io) and everytime it's restarted we have to unseal it through a not-so-trivial process (for security we don't store unseal keys in the same computers used as workstations and we need at least 3 of them to unseal the instance and gain access again to shared secrets, etc).

If @Telvary suggestion isn't possible, is there any other workaround?

@odain2

This comment has been minimized.

Show comment
Hide comment
@odain2

odain2 Jul 27, 2015

For some services running the same service more than once is a problem (e.g. a database). For others (e.g. a horizontally scaled web service) it's preferable to err on the side of running too many copies than not enough. So the "correct" behavior when etcd status is unknown is job dependent. I suggest making this configurable by adding something like a StopOnUnknownEtcdStatus configuration option to the X-Fleet section of the unit files.

odain2 commented Jul 27, 2015

For some services running the same service more than once is a problem (e.g. a database). For others (e.g. a horizontally scaled web service) it's preferable to err on the side of running too many copies than not enough. So the "correct" behavior when etcd status is unknown is job dependent. I suggest making this configurable by adding something like a StopOnUnknownEtcdStatus configuration option to the X-Fleet section of the unit files.

@philk

This comment has been minimized.

Show comment
Hide comment
@philk

philk Jul 27, 2015

Contributor

An excellent example is global services. If I'm running kubelet as a global service I don't want Fleet to kill it just because it thinks etcd is down. Alternately a system level monitoring app that might tell me why Fleet thinks there's a problem. Instead of using Fleet and all the niceties that come with it (easy service listing, easier updates, etc) we're having to put these services in the cloud-config.

Contributor

philk commented Jul 27, 2015

An excellent example is global services. If I'm running kubelet as a global service I don't want Fleet to kill it just because it thinks etcd is down. Alternately a system level monitoring app that might tell me why Fleet thinks there's a problem. Instead of using Fleet and all the niceties that come with it (easy service listing, easier updates, etc) we're having to put these services in the cloud-config.

@tadas-subonis

This comment has been minimized.

Show comment
Hide comment
@tadas-subonis

tadas-subonis Sep 21, 2015

FYI, we are experiencing the same here - high load causes lost connectivity to etcd and makes fleet restart units.

tadas-subonis commented Sep 21, 2015

FYI, we are experiencing the same here - high load causes lost connectivity to etcd and makes fleet restart units.

@jacobgreenleaf

This comment has been minimized.

Show comment
Hide comment
@jacobgreenleaf

jacobgreenleaf Oct 5, 2015

This is a deal breaker for me. I second @odain2's suggestion.

jacobgreenleaf commented Oct 5, 2015

This is a deal breaker for me. I second @odain2's suggestion.

@jonboulle

This comment has been minimized.

Show comment
Hide comment
@jonboulle

jonboulle Oct 5, 2015

Contributor

We should have an option to either kill everything (default), or stay with current state until fleet reconnect to etcd and reconcile.

I would personally be open to a patch implementing this kind of semantic, but I'm afraid we do not have bandwidth to add it ourselves right now.

Contributor

jonboulle commented Oct 5, 2015

We should have an option to either kill everything (default), or stay with current state until fleet reconnect to etcd and reconcile.

I would personally be open to a patch implementing this kind of semantic, but I'm afraid we do not have bandwidth to add it ourselves right now.

@brandonkruse

This comment has been minimized.

Show comment
Hide comment
@brandonkruse

brandonkruse Oct 7, 2015

Having a problem with this as well. Trying to find a workaround for systems under high load

brandonkruse commented Oct 7, 2015

Having a problem with this as well. Trying to find a workaround for systems under high load

@tadas-subonis

This comment has been minimized.

Show comment
Hide comment
@tadas-subonis

tadas-subonis Oct 8, 2015

@brandonkruse

For me, it helped to set fleet parameters to

coreos:
  fleet:
    public-ip: $private_ipv4
    metadata: 
    engine-reconcile-interval: 10
    etcd-request-timeout: 5
    agent-ttl: 120s

Increasing reconcile interval and etc request timeout should help. In docs these values are explained as:

agent_ttl

An Agent will be considered dead if it exceeds this amount of time to communicate with the Registry. The agent will attempt a heartbeat at half of this value.

Default: "30s"

engine_reconcile_interval

Interval in seconds at which the engine should reconcile the cluster schedule in etcd.

Default: 2

etcd_request_timeout

Amount of time in seconds to allow a single etcd request before considering it failed.

Default: 1.0

tadas-subonis commented Oct 8, 2015

@brandonkruse

For me, it helped to set fleet parameters to

coreos:
  fleet:
    public-ip: $private_ipv4
    metadata: 
    engine-reconcile-interval: 10
    etcd-request-timeout: 5
    agent-ttl: 120s

Increasing reconcile interval and etc request timeout should help. In docs these values are explained as:

agent_ttl

An Agent will be considered dead if it exceeds this amount of time to communicate with the Registry. The agent will attempt a heartbeat at half of this value.

Default: "30s"

engine_reconcile_interval

Interval in seconds at which the engine should reconcile the cluster schedule in etcd.

Default: 2

etcd_request_timeout

Amount of time in seconds to allow a single etcd request before considering it failed.

Default: 1.0

@brandonkruse

This comment has been minimized.

Show comment
Hide comment
@brandonkruse

brandonkruse Oct 8, 2015

@tadas-subonis thank you sir! I will try these now.

brandonkruse commented Oct 8, 2015

@tadas-subonis thank you sir! I will try these now.

@tadas-subonis

This comment has been minimized.

Show comment
Hide comment
@tadas-subonis

tadas-subonis Oct 8, 2015

Also, In my case (Azure needed them badly as doesn't cope well with high CPU usage, but on Google Cloud it was fine without them) it was needed to bump etcd heartbeat values:

coreos:
  etcd2:
    heartbeat-interval: 600
    election-timeout: 6000

tadas-subonis commented Oct 8, 2015

Also, In my case (Azure needed them badly as doesn't cope well with high CPU usage, but on Google Cloud it was fine without them) it was needed to bump etcd heartbeat values:

coreos:
  etcd2:
    heartbeat-interval: 600
    election-timeout: 6000
@kou9

This comment has been minimized.

Show comment
Hide comment
@kou9

kou9 Oct 12, 2015

We're experiencing the exact same pb with etcd 2.1.2 and fleet 0.10.2 (on GCE).
Once a crash occurs on one machine, it impacts the whole cluster: once the CPU spikes on one instance, it breaks the ETCD cluster, fleet restarts the containers on a second instance which itself gets a high CPU load when restarting fleet units, and so on...

I will try @tadas-subonis 's tuning numbers for etcd2 and fleet.

Btw @tadas-subonis thanks once again for sharing the configuration tuning numbers (you already helped me on another problem).

kou9 commented Oct 12, 2015

We're experiencing the exact same pb with etcd 2.1.2 and fleet 0.10.2 (on GCE).
Once a crash occurs on one machine, it impacts the whole cluster: once the CPU spikes on one instance, it breaks the ETCD cluster, fleet restarts the containers on a second instance which itself gets a high CPU load when restarting fleet units, and so on...

I will try @tadas-subonis 's tuning numbers for etcd2 and fleet.

Btw @tadas-subonis thanks once again for sharing the configuration tuning numbers (you already helped me on another problem).

@brandonkruse

This comment has been minimized.

Show comment
Hide comment
@brandonkruse

brandonkruse Oct 12, 2015

@tadas-subonis Thank you for the tweaked numbers! I am simulating some load tests today to see if that worked. I had previously made my ExecStop not actually stop the container running. That was a hack!

@kou9 Let me know your results and I'll be sure to post back here. Now trying to come up with the best way to update cloud-config settings on 128 boxes that were PXE booted :P

brandonkruse commented Oct 12, 2015

@tadas-subonis Thank you for the tweaked numbers! I am simulating some load tests today to see if that worked. I had previously made my ExecStop not actually stop the container running. That was a hack!

@kou9 Let me know your results and I'll be sure to post back here. Now trying to come up with the best way to update cloud-config settings on 128 boxes that were PXE booted :P

@kou9

This comment has been minimized.

Show comment
Hide comment
@kou9

kou9 Oct 20, 2015

I tested @tadas-subonis 's tuning and indeed, it helps greatly (thanks). Even under high load on some of the members of the ETCD/fleet cluster, ETCD is stable and fleet doesn't kill units anymore.
I'm quite satisfied with this setup. We haven't specifically checked the impact on read/write performance on ETCD, but we haven't noticed anything suspicious. Take into account that the load on ETCD is very light in our setup (mainly used by fleet and occasionally by the containerized applications).

I would definitely advise to use these heartbeat/timeout numbers by default even on clusters running on a local network.

kou9 commented Oct 20, 2015

I tested @tadas-subonis 's tuning and indeed, it helps greatly (thanks). Even under high load on some of the members of the ETCD/fleet cluster, ETCD is stable and fleet doesn't kill units anymore.
I'm quite satisfied with this setup. We haven't specifically checked the impact on read/write performance on ETCD, but we haven't noticed anything suspicious. Take into account that the load on ETCD is very light in our setup (mainly used by fleet and occasionally by the containerized applications).

I would definitely advise to use these heartbeat/timeout numbers by default even on clusters running on a local network.

@tadas-subonis

This comment has been minimized.

Show comment
Hide comment
@tadas-subonis

tadas-subonis Oct 20, 2015

Thanks @kou9 . I've figured that it might be useful for others as well so I've written that up here: http://blog.feedpresso.com/2015/10/16/tuning-fleet-and-etcd-on-coreos-to-avoid-unit-failures.html . If anyone notices a place where I've lied, please let me know :)

tadas-subonis commented Oct 20, 2015

Thanks @kou9 . I've figured that it might be useful for others as well so I've written that up here: http://blog.feedpresso.com/2015/10/16/tuning-fleet-and-etcd-on-coreos-to-avoid-unit-failures.html . If anyone notices a place where I've lied, please let me know :)

@brandonkruse

This comment has been minimized.

Show comment
Hide comment
@brandonkruse

brandonkruse Feb 8, 2016

@tadas-subonis just had this plague us again today. Thanks again for the writeup!

Did you ever figure out a way to avoid the rescheduling?

We are testing setting up smaller etcd clusters with more proxies to see if this reduces the overall load of etcd. With 64 etcd nodes in a cluster, keeping consensus can be difficult.

Feb 08 17:04:15 hsv-10a fleetd[795]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=blah@12.service reason="unit loaded but not scheduled here"
Feb 08 17:04:15 hsv-10a fleetd[795]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"

In a perfect world, I'd like to unschedule the discovery (sidekick) but not the underlying service (in my case, a database).

brandonkruse commented Feb 8, 2016

@tadas-subonis just had this plague us again today. Thanks again for the writeup!

Did you ever figure out a way to avoid the rescheduling?

We are testing setting up smaller etcd clusters with more proxies to see if this reduces the overall load of etcd. With 64 etcd nodes in a cluster, keeping consensus can be difficult.

Feb 08 17:04:15 hsv-10a fleetd[795]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=blah@12.service reason="unit loaded but not scheduled here"
Feb 08 17:04:15 hsv-10a fleetd[795]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"

In a perfect world, I'd like to unschedule the discovery (sidekick) but not the underlying service (in my case, a database).

@tadas-subonis

This comment has been minimized.

Show comment
Hide comment
@tadas-subonis

tadas-subonis Feb 9, 2016

@brandonkruse I haven't figured out how to do that. However, we have not been experiencing that issues since. Also, for our database, we've introduced replica on a regular Ubuntu node, that takes over if something happens so we have that covered for us.

tadas-subonis commented Feb 9, 2016

@brandonkruse I haven't figured out how to do that. However, we have not been experiencing that issues since. Also, for our database, we've introduced replica on a regular Ubuntu node, that takes over if something happens so we have that covered for us.

dongsupark pushed a commit to dongsupark/fleet that referenced this issue Jul 8, 2016

Dongsu Park
functional: tune intervals in etcd config to avoid connectivity errors
Increase heartbeat-interval from 100 to 600, also election-timeout from
1000 to 6000. This is a workaround for avoiding occasional failures of
TestSingleNodeConnectivityLoss like below:

===
--- FAIL: TestSingleNodeConnectivityLoss (41.41s)
    connectivity-loss_test.go:98: Failed listing unit files:
        stdout:
        stderr: Error retrieving list of units from repository:
googleapi: Error 503: fleet server unable to  communicate with etcd

        err: exit status 1
===

See also coreos#1289 (comment)
Workaround for coreos#1250

dongsupark pushed a commit to endocode/fleet that referenced this issue Jul 11, 2016

Dongsu Park
functional: tune intervals in etcd config to avoid connectivity errors
Increase heartbeat-interval from 100 to 600, also election-timeout from
1000 to 6000. This is a workaround for avoiding occasional failures of
TestSingleNodeConnectivityLoss like below:

===
--- FAIL: TestSingleNodeConnectivityLoss (41.41s)
    connectivity-loss_test.go:98: Failed listing unit files:
        stdout:
        stderr: Error retrieving list of units from repository:
googleapi: Error 503: fleet server unable to  communicate with etcd

        err: exit status 1
===

See also coreos#1289 (comment)
Workaround for coreos#1250

dongsupark pushed a commit to endocode/fleet that referenced this issue Jul 11, 2016

Dongsu Park
functional: tune intervals in etcd config to avoid connectivity errors
Increase heartbeat-interval from 100 to 600, also election-timeout from
1000 to 6000. This is a workaround for avoiding occasional failures of
TestSingleNodeConnectivityLoss like below:

===
--- FAIL: TestSingleNodeConnectivityLoss (41.41s)
    connectivity-loss_test.go:98: Failed listing unit files:
        stdout:
        stderr: Error retrieving list of units from repository:
googleapi: Error 503: fleet server unable to  communicate with etcd

        err: exit status 1
===

See also coreos#1289 (comment)
Workaround for coreos#1250

dongsupark pushed a commit to endocode/fleet that referenced this issue Jul 11, 2016

Dongsu Park
functional: tune intervals in etcd config to avoid connectivity errors
Increase heartbeat-interval from 100 to 600, also election-timeout from
1000 to 6000. This is a workaround for avoiding occasional failures of
TestSingleNodeConnectivityLoss like below:

===
--- FAIL: TestSingleNodeConnectivityLoss (41.41s)
    connectivity-loss_test.go:98: Failed listing unit files:
        stdout:
        stderr: Error retrieving list of units from repository:
googleapi: Error 503: fleet server unable to  communicate with etcd

        err: exit status 1
===

See also coreos#1289 (comment)
Workaround for coreos#1250

dongsupark pushed a commit to endocode/fleet that referenced this issue Jul 12, 2016

Dongsu Park
functional: tune intervals in etcd config to avoid connectivity errors
Increase heartbeat-interval from 100 to 600, also election-timeout from
1000 to 6000. This is a workaround for avoiding occasional failures of
TestSingleNodeConnectivityLoss like below:

===
--- FAIL: TestSingleNodeConnectivityLoss (41.41s)
    connectivity-loss_test.go:98: Failed listing unit files:
        stdout:
        stderr: Error retrieving list of units from repository:
googleapi: Error 503: fleet server unable to  communicate with etcd

        err: exit status 1
===

See also coreos#1289 (comment)
Workaround for coreos#1250
@klausenbusk

This comment has been minimized.

Show comment
Hide comment
@klausenbusk

klausenbusk Jul 29, 2016

Have any of you tried allocation more CPU time to etcd/fleet with the CPUShares= systemd parameter?

We have the same problem, which I think it caused by high load, but I'm isn't sure yet.

klausenbusk commented Jul 29, 2016

Have any of you tried allocation more CPU time to etcd/fleet with the CPUShares= systemd parameter?

We have the same problem, which I think it caused by high load, but I'm isn't sure yet.

@onlyjob

This comment has been minimized.

Show comment
Hide comment
@onlyjob

onlyjob Jul 30, 2016

Contributor

It may help to increase etcd_request_timeout from (default) 1.0 to somewhat 5.0 in fleet.conf.

etcd_request_timeout controls "amount of time in seconds to allow a single etcd request before considering it failed".

Contributor

onlyjob commented Jul 30, 2016

It may help to increase etcd_request_timeout from (default) 1.0 to somewhat 5.0 in fleet.conf.

etcd_request_timeout controls "amount of time in seconds to allow a single etcd request before considering it failed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment