Multi-node decommissioning does not converge #41561
Labels
C-investigation
Further steps needed to qualify. C-label will change.
O-community
Originated from the community
Projects
Describe the problem
When decommissioning commands are started for two nodes, replication never happens.
To Reproduce
What did you do? Describe in your own words.
If possible, provide steps to reproduce the behavior:
Set up CockroachDB cluster of 5 nodes, 3-way replication factor in docker.
Insert few million records, so there are 240 ranges in total.
Look at your stats, see replicas are equally distributed.
Open two terminal windows.
Terminal 1:
docker exec -it roach5 ./cockroach --decommission --insecure
Terminal 2:
docker exec -it roach4 ./cockroach --decommission --insecure
You should run these commands simultaneously, so roach5 is still in process of decommissioning when you start roach4 decommissioning.
Watch the replica count graphs for the nodes.
Both 5 and 4 start to decrease fast, evicting replicas, but never reach zero. The graph looks like asymptotically converging to 19 replicas for both nodes, and both terminals looks like hanging.
After recommissioning cluster looks healthy, but i can not decommission any node. Every decommissioning reach 19 replicas limit and stop.
After re-commissioning every node,
Expected behavior
I can start as many decommissioning jobs at once, as long is remaining cluster has enough resources, and meets the replication factor criteria.
Additional data / screenshots
Environment:
cockroach node, cocroach quit
, teminalProof of concept for server decommissioning.
The text was updated successfully, but these errors were encountered: