You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
Install a 3-node cluster running fleet v0.6.2 (CoreOS v410)
Start 3 units that all conflict and spread out across the cluster
Note that the 3 units are spread across all 3 nodes
Upgrade a single node to fleet v0.8.1 (CoreOS v444)
It is likely that the unit is no longer scheduled in the cluster, while you would expect it to be running on the node that just upgraded to v444.
So here's what's happening... All three nodes are trying to acquire a lock in etcd. When the first 3 nodes were deployed, one of the nodes acquired this lock and has not let it go. While it holds this lock, it acts as the Engine, offering jobs and accepting bids (scheduling work). When a machine is upgraded that does not hold this lock, it no longer participates in the job offering mechanism. At this point, since it isn't bidding on any jobs, the engine will not schedule any work back to this machine.
The only workaround right now is to force the lock to transfer ownership to the upgraded machine. This can be done by calling etcdctl rm /_coreos.com/fleet/lease/engine-leader; sudo systemctl restart fleet; etcdctl get /_coreos.com/fleet/lease/engine-leader from that upgraded machine. Only once the output of the etcdctl get shows the machine-id of the upgraded machine can you move forward with upgrading the other machines.
It is likely that the unit is no longer scheduled in the cluster, while you would expect it to be running on the node that just upgraded to v444.
So here's what's happening... All three nodes are trying to acquire a lock in etcd. When the first 3 nodes were deployed, one of the nodes acquired this lock and has not let it go. While it holds this lock, it acts as the Engine, offering jobs and accepting bids (scheduling work). When a machine is upgraded that does not hold this lock, it no longer participates in the job offering mechanism. At this point, since it isn't bidding on any jobs, the engine will not schedule any work back to this machine.
The only workaround right now is to force the lock to transfer ownership to the upgraded machine. This can be done by calling
etcdctl rm /_coreos.com/fleet/lease/engine-leader; sudo systemctl restart fleet; etcdctl get /_coreos.com/fleet/lease/engine-leaderfrom that upgraded machine. Only once the output of theetcdctl getshows the machine-id of the upgraded machine can you move forward with upgrading the other machines.A better fix is in the works.