Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node decommissioning does not converge #41561

Closed
george-polevoy opened this issue Oct 14, 2019 · 3 comments
Closed

Multi-node decommissioning does not converge #41561

george-polevoy opened this issue Oct 14, 2019 · 3 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community
Projects

Comments

@george-polevoy
Copy link

Describe the problem

When decommissioning commands are started for two nodes, replication never happens.

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

  1. Set up CockroachDB cluster of 5 nodes, 3-way replication factor in docker.

  2. Insert few million records, so there are 240 ranges in total.
    Look at your stats, see replicas are equally distributed.
    Open two terminal windows.
    Terminal 1:
    docker exec -it roach5 ./cockroach --decommission --insecure
    Terminal 2:
    docker exec -it roach4 ./cockroach --decommission --insecure
    You should run these commands simultaneously, so roach5 is still in process of decommissioning when you start roach4 decommissioning.

  3. Watch the replica count graphs for the nodes.

  4. Both 5 and 4 start to decrease fast, evicting replicas, but never reach zero. The graph looks like asymptotically converging to 19 replicas for both nodes, and both terminals looks like hanging.
    After recommissioning cluster looks healthy, but i can not decommission any node. Every decommissioning reach 19 replicas limit and stop.

    After re-commissioning every node,

Expected behavior
I can start as many decommissioning jobs at once, as long is remaining cluster has enough resources, and meets the replication factor criteria.

Additional data / screenshots

Environment:

  • CockroachDB version [Docker cockroachdb/cockroach:v19.1.5]
  • Server OS: [Doker on MacOS, 19.03.0-beta3]
  • Client app cockroach node, cocroach quit, teminal
  • Impact
    Proof of concept for server decommissioning.
    Screenshot 2019-10-14 at 16 05 12
    Screenshot 2019-10-14 at 16 13 40
@ricardocrdb ricardocrdb added this to To do in Support via automation Oct 14, 2019
@ricardocrdb ricardocrdb added O-community Originated from the community C-investigation Further steps needed to qualify. C-label will change. labels Oct 14, 2019
@ricardocrdb ricardocrdb self-assigned this Oct 14, 2019
@mattcrdb
Copy link

mattcrdb commented Oct 15, 2019

Hi George, this looks like the second decommission is creating a loss of quorum for a particular set of replicas.

With that said, we've introduced a feature called atomic rebalancing in our upcoming 19.2 release that should help with this.

The best practice is to not decommission another node until the first one has completed it's process.

@mattcrdb mattcrdb moved this from To do to Triaging in Support Oct 15, 2019
@ricardocrdb
Copy link

Hey @george-polevoy did you have any other questions with this issue? Feel free to let us know.

@ricardocrdb ricardocrdb moved this from Triaging to Pending in Support Nov 12, 2019
@ricardocrdb
Copy link

Closing due to inactivity. If you are still having the issue, please feel free to respond to this thread. We want to help!

Support automation moved this from Pending to Done Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community
Projects
Support
  
Done
Development

No branches or pull requests

3 participants