Multi-node decommissioning does not converge #41561

george-polevoy · 2019-10-14T13:17:57Z

Describe the problem

When decommissioning commands are started for two nodes, replication never happens.

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

Set up CockroachDB cluster of 5 nodes, 3-way replication factor in docker.
Insert few million records, so there are 240 ranges in total.
Look at your stats, see replicas are equally distributed.
Open two terminal windows.
Terminal 1:
docker exec -it roach5 ./cockroach --decommission --insecure
Terminal 2:
docker exec -it roach4 ./cockroach --decommission --insecure
You should run these commands simultaneously, so roach5 is still in process of decommissioning when you start roach4 decommissioning.
Watch the replica count graphs for the nodes.
Both 5 and 4 start to decrease fast, evicting replicas, but never reach zero. The graph looks like asymptotically converging to 19 replicas for both nodes, and both terminals looks like hanging.
After recommissioning cluster looks healthy, but i can not decommission any node. Every decommissioning reach 19 replicas limit and stop.

After re-commissioning every node,

Expected behavior
I can start as many decommissioning jobs at once, as long is remaining cluster has enough resources, and meets the replication factor criteria.

Additional data / screenshots

Environment:

CockroachDB version [Docker cockroachdb/cockroach:v19.1.5]
Server OS: [Doker on MacOS, 19.03.0-beta3]
Client app cockroach node, cocroach quit, teminal

Impact
Proof of concept for server decommissioning.

The text was updated successfully, but these errors were encountered:

mattcrdb · 2019-10-15T14:37:46Z

Hi George, this looks like the second decommission is creating a loss of quorum for a particular set of replicas.

With that said, we've introduced a feature called atomic rebalancing in our upcoming 19.2 release that should help with this.

The best practice is to not decommission another node until the first one has completed it's process.

ricardocrdb · 2019-11-12T19:08:58Z

Hey @george-polevoy did you have any other questions with this issue? Feel free to let us know.

ricardocrdb · 2019-11-21T15:50:34Z

Closing due to inactivity. If you are still having the issue, please feel free to respond to this thread. We want to help!

ricardocrdb added this to To do in Support via automation Oct 14, 2019

ricardocrdb added O-community Originated from the community C-investigation Further steps needed to qualify. C-label will change. labels Oct 14, 2019

ricardocrdb self-assigned this Oct 14, 2019

mattcrdb moved this from To do to Triaging in Support Oct 15, 2019

ricardocrdb moved this from Triaging to Pending in Support Nov 12, 2019

ricardocrdb closed this as completed Nov 21, 2019

Support automation moved this from Pending to Done Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node decommissioning does not converge #41561

Multi-node decommissioning does not converge #41561

george-polevoy commented Oct 14, 2019

mattcrdb commented Oct 15, 2019 •

edited

ricardocrdb commented Nov 12, 2019

ricardocrdb commented Nov 21, 2019

Multi-node decommissioning does not converge #41561

Multi-node decommissioning does not converge #41561

Comments

george-polevoy commented Oct 14, 2019

mattcrdb commented Oct 15, 2019 • edited

ricardocrdb commented Nov 12, 2019

ricardocrdb commented Nov 21, 2019

mattcrdb commented Oct 15, 2019 •

edited