New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fleet engine leadership losting trigger all units restarting on centos7 #1181

Closed
holmes86 opened this Issue Apr 9, 2015 · 10 comments

Comments

Projects
None yet
6 participants
@holmes86

holmes86 commented Apr 9, 2015

hi, I run fleet 0.8 cluster on centos7. But I found fleet trigger all units restarting in a mchine of cluster recently. There were no rules. sometimes occur once two days, sometimes occur once twenty minutes. And another machine is ok. The log is as follow:

Apr 09 21:22:46 machine1 fleetd[21543]: INFO engine.go:208: Waiting 26s for previous lease to expire before continuing reconciliation
Apr 09 21:22:46 machine1 fleetd[21543]: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr 09 21:22:44 machine1 fleetd[21543]: ERROR engine.go:218: Engine leadership lost, renewal failed: 101: Compare failed ([13679370 != 1
Apr 09 21:22:20 machine1 fleetd[21543]: INFO engine.go:208: Waiting 24s for previous lease to expire before continuing reconciliation
Apr 09 21:22:20 machine1 fleetd[21543]: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr 09 21:22:18 machine1 fleetd[21543]: ERROR engine.go:218: Engine leadership lost, renewal failed: 101: Compare failed ([13678688 != 1
Apr 09 21:21:54 machine1 fleetd[21543]: INFO engine.go:208: Waiting 24s for previous lease to expire before continuing reconciliation
Apr 09 21:21:54 machine1 fleetd[21543]: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr 09 21:21:52 machine1 fleetd[21543]: ERROR engine.go:218: Engine leadership lost, renewal failed: 101: Compare failed ([13677937 != 1
Apr 09 21:21:27 machine1 fleetd[21543]: INFO engine.go:208: Waiting 25s for previous lease to expire before continuing reconciliation
Apr 09 21:21:27 machine1 fleetd[21543]: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr 09 21:21:25 machine1 fleetd[21543]: ERROR engine.go:218: Engine leadership lost, renewal failed: 101: Compare failed ([13677236 != 1
Apr 09 21:20:59 machine1 fleetd[21543]: INFO engine.go:208: Waiting 26s for previous lease to expire before continuing reconciliation
Apr 09 21:20:59 machine1 fleetd[21543]: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr 09 21:20:57 machine1 fleetd[21543]: ERROR engine.go:218: Engine leadership lost, renewal failed: 101: Compare failed ([13676575 != 1
Apr 09 21:20:31 machine1 fleetd[21543]: INFO engine.go:208: Waiting 25s for previous lease to expire before continuing reconciliation
Apr 09 21:20:31 machine1 fleetd[21543]: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr 09 21:20:30 machine1 fleetd[21543]: ERROR engine.go:218: Engine leadership lost, renewal failed: 101: Compare failed ([13675908 != 1
Apr 09 21:20:29 machine1 fleetd[21543]: WARN engine.go:116:
Apr 09 21:20:29 machine1 fleetd[21543]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: 
Apr 09 21:20:29 machine1 fleetd[21543]: ERROR engine.go:268: Failed scheduling Unit(login.service) to Machine(21abf58f4
Apr 09 21:20:29 machine1 fleetd[21543]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: 
Apr 09 21:20:29 machine1 fleetd[21543]: ERROR engine.go:268: Failed scheduling Unit(user.service) to Machine(21abf58f49294
Apr 09 21:20:29 machine1 fleetd[21543]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: 
Apr 09 21:20:29 machine1 fleetd[21543]: ERROR engine.go:268: Failed scheduling Unit(tradecenter.service) to Machine(21abf58f49294
Apr 09 21:20:29 machine1 fleetd[21543]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: 
Apr 09 21:20:29 machine1 fleetd[21543]: ERROR engine.go:268: Failed scheduling Unit(user.service) to Machine(21abf58f49294
Apr 09 21:20:28 machine1 fleetd[21543]: INFO reconciler.go:163: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: 
Apr 09 21:20:28 machine1 fleetd[21543]: ERROR engine.go:268: Failed scheduling Unit(mysql.service) to Machine(21abf58f49
Apr 09 21:20:28 machine1 fleetd[21543]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=vendor.s
Apr 09 21:20:28 machine1 fleetd[21543]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=home.ser
.......
.......

thanks

@holmes86 holmes86 changed the title from fleet trigger all units restarting on centos7 to fleet engine leadership losting trigger all units restarting on centos7 Apr 9, 2015

@holmes86

This comment has been minimized.

Show comment
Hide comment
@holmes86

holmes86 Apr 9, 2015

I check /var/log/message log. if leadership steal failed, then unit will be restart.

Apr  9 21:43:00 machine01 fleetd: ERROR engine.go:198: Engine leadership steal failed: 101: Compare failed ([13711129 != 13711141]) [13711142]
Apr  9 21:43:02 machine01 fleetd: ERROR engine.go:198: Engine leadership steal failed: 101: Compare failed ([13711141 != 13711207]) [13711207]
Apr  9 21:43:04 machine01 fleetd: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr  9 21:43:04 machine01 fleetd: INFO engine.go:208: Waiting 24s for previous lease to expire before continuing reconciliation
Apr  9 21:43:04 machine01 systemd: Stopped xxxxxxxxxxx.
Apr  9 21:43:04 machine01 fleetd: INFO manager.go:145: Triggered systemd unit xxxxxxxxxxx.service stop: job=84151
Apr  9 21:43:04 machine01 fleetd: INFO manager.go:275: Removing systemd unit xxxxxxxxxxx.service
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=xxxxxxxxxxx.service reason="unit loaded but not scheduled here"
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight

holmes86 commented Apr 9, 2015

I check /var/log/message log. if leadership steal failed, then unit will be restart.

Apr  9 21:43:00 machine01 fleetd: ERROR engine.go:198: Engine leadership steal failed: 101: Compare failed ([13711129 != 13711141]) [13711142]
Apr  9 21:43:02 machine01 fleetd: ERROR engine.go:198: Engine leadership steal failed: 101: Compare failed ([13711141 != 13711207]) [13711207]
Apr  9 21:43:04 machine01 fleetd: INFO engine.go:205: Stole engine leadership from Machine(9c33fade6ddf46448fcbacd8ed8495a5)
Apr  9 21:43:04 machine01 fleetd: INFO engine.go:208: Waiting 24s for previous lease to expire before continuing reconciliation
Apr  9 21:43:04 machine01 systemd: Stopped xxxxxxxxxxx.
Apr  9 21:43:04 machine01 fleetd: INFO manager.go:145: Triggered systemd unit xxxxxxxxxxx.service stop: job=84151
Apr  9 21:43:04 machine01 fleetd: INFO manager.go:275: Removing systemd unit xxxxxxxxxxx.service
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=xxxxxxxxxxx.service reason="unit loaded but not scheduled here"
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight
Apr  9 21:43:04 machine01 fleetd: INFO reconcile.go:314: AgentReconciler task chain failed: chain={xxxxxxxxxxx.service (UnloadUnit, "unit loaded but not scheduled here")} err=task already in flight

@bcwaldon

This comment has been minimized.

Show comment
Hide comment
@bcwaldon

bcwaldon May 1, 2015

Contributor

Losing engine leadership is indicative of being unable to communicate with etcd. If fleet cannot communicate with etcd, it assumes it is in the minority of a netsplit and stop work. Are you able to identify any correlation between these events in fleet and events in the local etcd node?

Contributor

bcwaldon commented May 1, 2015

Losing engine leadership is indicative of being unable to communicate with etcd. If fleet cannot communicate with etcd, it assumes it is in the minority of a netsplit and stop work. Are you able to identify any correlation between these events in fleet and events in the local etcd node?

@kelseyhightower

This comment has been minimized.

Show comment
Hide comment
@kelseyhightower

kelseyhightower Jun 5, 2015

Contributor

I'm seeing this issue as well. Seems fleet is able to connect to etcd:

$ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.53    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.56    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.57    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.55    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.55    -
Contributor

kelseyhightower commented Jun 5, 2015

I'm seeing this issue as well. Seems fleet is able to connect to etcd:

$ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.53    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.56    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.57    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.55    -
core@localhost /opt/bin $ /opt/bin/fleetctl --endpoint=http://10.20.132.53:2379 list-machines
MACHINE     IP      METADATA
3f44ca14... 10.20.132.55    -
@kelseyhightower

This comment has been minimized.

Show comment
Hide comment
@kelseyhightower

kelseyhightower Jun 5, 2015

Contributor

Found the root cause. Looks like I got a bunch of machines with the same machine-id:

core@localhost /opt/bin $ cat /etc/machine-id 
3f44ca144f5148a6b704fc3d4fa45220
core@localhost /opt/bin $ exit
logout

Connection to 10.20.132.53 closed.
FIPS mode initialized
Last login: Fri Jun  5 13:38:08 2015 from 10.20.132.29
CoreOS stable (647.2.0)
core@localhost ~ $ cat /etc/machine-id 
3f44ca144f5148a6b704fc3d4fa45220
Contributor

kelseyhightower commented Jun 5, 2015

Found the root cause. Looks like I got a bunch of machines with the same machine-id:

core@localhost /opt/bin $ cat /etc/machine-id 
3f44ca144f5148a6b704fc3d4fa45220
core@localhost /opt/bin $ exit
logout

Connection to 10.20.132.53 closed.
FIPS mode initialized
Last login: Fri Jun  5 13:38:08 2015 from 10.20.132.29
CoreOS stable (647.2.0)
core@localhost ~ $ cat /etc/machine-id 
3f44ca144f5148a6b704fc3d4fa45220
@bcwaldon

This comment has been minimized.

Show comment
Hide comment
@bcwaldon

bcwaldon Jul 9, 2015

Contributor

No update in 2 months, closing

Contributor

bcwaldon commented Jul 9, 2015

No update in 2 months, closing

@star1989

This comment has been minimized.

Show comment
Hide comment
@star1989

star1989 May 6, 2016

@kelseyhightower Did you konw the reason of the same machine id?I got the same error.

star1989 commented May 6, 2016

@kelseyhightower Did you konw the reason of the same machine id?I got the same error.

@dongsupark

This comment has been minimized.

Show comment
Hide comment
@dongsupark

dongsupark May 6, 2016

Contributor

@star1989 Isn't it already fixed by #1561? Can you please try the latest git, which already contains the fix?

Contributor

dongsupark commented May 6, 2016

@star1989 Isn't it already fixed by #1561? Can you please try the latest git, which already contains the fix?

@star1989

This comment has been minimized.

Show comment
Hide comment
@star1989

star1989 May 8, 2016

I use fleet on the latest CoreOS stable (899.17.0),not for git repo.I got the same machine ID in the /etc/machine-id file.The fleet service is still good,but just detected one of the same machine ID servers.So,I want to know the reason and how to generated the machine ID. tks.@dongsupark.

star1989 commented May 8, 2016

I use fleet on the latest CoreOS stable (899.17.0),not for git repo.I got the same machine ID in the /etc/machine-id file.The fleet service is still good,but just detected one of the same machine ID servers.So,I want to know the reason and how to generated the machine ID. tks.@dongsupark.

@kayrus

This comment has been minimized.

Show comment
Hide comment
@kayrus

kayrus May 8, 2016

Contributor

@star1989 unfortunately latest CoreOS doesn't contain latest fleet release. I'd recommend you to compile fleet from the latest git master branch. Similar machineid could take place because of machine cloning, rsync, etc.

Contributor

kayrus commented May 8, 2016

@star1989 unfortunately latest CoreOS doesn't contain latest fleet release. I'd recommend you to compile fleet from the latest git master branch. Similar machineid could take place because of machine cloning, rsync, etc.

@star1989

This comment has been minimized.

Show comment
Hide comment
@star1989

star1989 May 9, 2016

@kayrus thanks.But can you give me some data links about the machine cloning?I want to know more.

star1989 commented May 9, 2016

@kayrus thanks.But can you give me some data links about the machine cloning?I want to know more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment