Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

Poor experience when upgrading from v0.6.2 past v0.7.0 #924

@bcwaldon

Description

@bcwaldon
  1. Install a 3-node cluster running fleet v0.6.2 (CoreOS v410)
  2. Start 3 units that all conflict and spread out across the cluster
  3. Note that the 3 units are spread across all 3 nodes
  4. Upgrade a single node to fleet v0.8.1 (CoreOS v444)

It is likely that the unit is no longer scheduled in the cluster, while you would expect it to be running on the node that just upgraded to v444.

So here's what's happening... All three nodes are trying to acquire a lock in etcd. When the first 3 nodes were deployed, one of the nodes acquired this lock and has not let it go. While it holds this lock, it acts as the Engine, offering jobs and accepting bids (scheduling work). When a machine is upgraded that does not hold this lock, it no longer participates in the job offering mechanism. At this point, since it isn't bidding on any jobs, the engine will not schedule any work back to this machine.

The only workaround right now is to force the lock to transfer ownership to the upgraded machine. This can be done by calling etcdctl rm /_coreos.com/fleet/lease/engine-leader; sudo systemctl restart fleet; etcdctl get /_coreos.com/fleet/lease/engine-leader from that upgraded machine. Only once the output of the etcdctl get shows the machine-id of the upgraded machine can you move forward with upgrading the other machines.

A better fix is in the works.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions