Poor experience when upgrading from v0.6.2 past v0.7.0

1. Install a 3-node cluster running fleet v0.6.2 (CoreOS v410)
2. Start 3 units that all conflict and spread out across the cluster
3. Note that the 3 units are spread across all 3 nodes
4. Upgrade a single node to fleet v0.8.1 (CoreOS v444)

It is likely that the unit is no longer scheduled in the cluster, while you would expect it to be running on the node that just upgraded to v444.

So here's what's happening... All three nodes are trying to acquire a lock in etcd. When the first 3 nodes were deployed, one of the nodes acquired this lock and has not let it go. While it holds this lock, it acts as the Engine, offering jobs and accepting bids (scheduling work). When a machine is upgraded that does not hold this lock, it no longer participates in the job offering mechanism. At this point, since it isn't bidding on any jobs, the engine will not schedule any work back to this machine.

The only workaround right now is to force the lock to transfer ownership to the upgraded machine. This can be done by calling `etcdctl rm /_coreos.com/fleet/lease/engine-leader; sudo systemctl restart fleet; etcdctl get /_coreos.com/fleet/lease/engine-leader` from that upgraded machine. Only once the output of the `etcdctl get` shows the machine-id of the upgraded machine can you move forward with upgrading the other machines.

A better fix is in the works.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor experience when upgrading from v0.6.2 past v0.7.0 #924

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor experience when upgrading from v0.6.2 past v0.7.0 #924

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions