server: fix node decommissioning itself #61356

erikgrinaker · 2021-03-02T18:28:21Z

Previously, if a node was asked to decommission itself, the
decommissioning process would sometimes hang or fail since the node
would become decommissioned and lose RPC access to the rest of the
cluster while it was building the response.

This patch returns an empty response from the
adminServer.Decommission() call when setting the final
DECOMMISSIONED status, thereby avoiding further use of cluster RPC
after decommissioning itself. It also defers self-decommissioning until
the end if multiple nodes are being decommissioned.

This change is backwards-compatible with old CLI versions, which will
simply output the now-empty status result set before completing with
"No more data reported on target nodes". The CLI has been updated to
simply omit the empty response.

Resolves #56718.

A separate commit ensures errors with codes.PermissionDenied abort
RPC retries and return immediately, to prevent operations hanging with
indefinite internal retries when RPC access is lost.

Release justification: bug fixes and low-risk updates to new functionality

Release note (bug fix): Fixed a bug from 21.1-alpha where a node
decommissioning process could sometimes hang or fail when the
decommission request was submitted via the node being decommissioned.

cockroach-teamcity · 2021-03-02T18:28:30Z

This change is

`IsAuthenticationError` is often used to return RPC errors immediately instead of retrying them. Previously, this only considered `codes.Unauthenticated` but not `codes.PermissionDenied`. This could cause various operations to run indefinitely due to internal RPC retries, e.g. after a node has been decommissioned and lost access to the cluster. Since permission errors should also return immediately, this patch changes the function to handle `codes.PermissionDenied` as well. However, since a missing permission is an authorization failure rather than an authentication failure, the function was renamed to `IsAuthError` such that it can cover both authn and authz failures. Release justification: low risk, high benefit changes to existing functionality Release note (bug fix): Fix operation hangs when a node loses access to cluster RPC (e.g. after it has been decommissioned), and immediately return an error instead.

The test `TestDecommissionNodeStatus` was initially placed in a separate file `decommission_test.go`, even though it belonged in `server_test.go`. This was done to avoid import cycles between the `testcluster` package and the `server` package, as the `decommission_test.go` file could use the `server_test` package instead. This commit creates the cluster with `serverutils.StartNewTestCluster()` instead of `testcluster.StartTestCluster()`, which avoids the import cycle and allows it to be moved into `server_test.go`. This required exposing `Server.Decommission()` via `serverutils.TestServerInterface` as well. Release justification: non-production code changes Release note: None

knz

nice thanks

Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 12 of 12 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

tbg

We never released this bug as part of a major release (it is in the 21.1 alphas), so I don't think it needs a release note? Let's keep the release note but add explicitly that this problem was only present in 21.1 alpha releases, and not in 20.2 and earlier.

very nice work.

Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 11 of 12 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)

pkg/server/admin_test.go, line 2017 at r4 (raw file):

func TestDecommissionSelf(t *testing.T) {
	skip.UnderStress(t) // can't handle 7-node clusters

Was this one necessary as well? I think we run other 7node tests under stress. The one to really avoid is race, which you do below. Check out

cockroach/pkg/kv/kvserver/client_merge_test.go

Line 4110 in 9fbbb1a

func TestMergeQueueSeesNonVoters(t *testing.T) {

for an example of a test that also runs under stress and seems to be doing ok (it's even run under stressrace, which, I don't know, bad idea honestly)

This isn't to say that you can't blow your local machine - try make stress STRESSFLAGS='-p 4' to limit the concurrency. This is also what we do in CI when we stress packages.

pkg/testutils/testcluster/testcluster.go, line 1427 at r4 (raw file):

	ctx context.Context, t testing.TB, serverIdx int,
) *grpc.ClientConn {
	srv := tc.Server(serverIdx)

Don't you just want srv.GRPCDialRaw(srv.RPCAddr())?

erikgrinaker

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)

pkg/server/admin_test.go, line 2017 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Was this one necessary as well? I think we run other 7node tests under stress. The one to really avoid is race, which you do below. Check out

cockroach/pkg/kv/kvserver/client_merge_test.go

Line 4110 in 9fbbb1a

func TestMergeQueueSeesNonVoters(t *testing.T) {

for an example of a test that also runs under stress and seems to be doing ok (it's even run under stressrace, which, I don't know, bad idea honestly)

This isn't to say that you can't blow your local machine - try make stress STRESSFLAGS='-p 4' to limit the concurrency. This is also what we do in CI when we stress packages.

You're right, stress runs just fine. 👍

pkg/testutils/testcluster/testcluster.go, line 1427 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't you just want srv.GRPCDialRaw(srv.RPCAddr())?

Maybe, but turns out we no longer need this method at all, since we're not testing error propagation anymore.

erikgrinaker · 2021-03-03T19:40:53Z

bors r=knz,tbg

erikgrinaker · 2021-03-03T20:13:35Z

bors r-

Seeing a CI hang in TestDecommissionedNodeCannotConnect, investigating.

craig · 2021-03-03T20:13:37Z

Canceled.

Previously, if a node was asked to decommission itself, the decommissioning process would sometimes hang or fail since the node would become decommissioned and lose RPC access to the rest of the cluster while it was building the response. This patch returns an empty response from the `adminServer.Decommission()` call when setting the final `DECOMMISSIONED` status, thereby avoiding further use of cluster RPC after decommissioning itself. It also defers self-decommissioning until the end if multiple nodes are being decommissioned. This change is backwards-compatible with old CLI versions, which will simply output the now-empty status result set before completing with "No more data reported on target nodes". The CLI has been updated to simply omit the empty response. Release justification: bug fixes and low-risk updates to new functionality Release note (bug fix): Fixed a bug from 21.1-alpha where a node decommissioning process could sometimes hang or fail when the decommission request was submitted via the node being decommissioned.

erikgrinaker · 2021-03-03T22:53:56Z

bors r=knz,tbg

craig · 2021-03-04T00:20:45Z

Build succeeded:

GitHub CI (Cockroach)

In cockroachdb#61356, `TestDecommissionedNodeCannotConnect` was extended to check error propagation for a `Scan` operation after a node was decommissioned. However, because of cockroachdb#61470 not all failure modes propagate errors properly, causing the test to sometimes hang. This patch disables this check until these problems are addresses. Release justification: non-production code changes Release note: None

59288: rfc: declarative, state-based schema changer r=lucy-zhang a=lucy-zhang https://github.com/lucy-zhang/cockroach/blob/schema-changer-rfc/docs/RFCS/20210115_new_schema_changer.md Release note: none 61459: tracing: cap registry size at 5k r=irfansharif a=tbg This caps the size of the span registry at 5k, evicting "random" (map order) existing entries when at the limit. The purpose of this is to serve as a guardrail against leaked spans, which would otherwise lead to unbounded memory growth. Touches #59188. Release justification: low risk, high benefit changes to existing functionality Release note: None 61471: server: disable flaky check in TestDecommissionedNodeCannotConnect r=tbg a=erikgrinaker In #61356, `TestDecommissionedNodeCannotConnect` was extended to check error propagation for a `Scan` operation after a node was decommissioned. However, because of #61470 not all failure modes propagate errors properly, causing the test to sometimes hang. This patch disables this check until these problems are addressed. Resolves #61452. Release justification: non-production code changes Release note: None Co-authored-by: Lucy Zhang <lucy@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

erikgrinaker added the A-kv-server Relating to the KV-level RPC server label Mar 2, 2021

erikgrinaker requested a review from tbg March 2, 2021 18:28

erikgrinaker requested a review from a team as a code owner March 2, 2021 18:28

erikgrinaker self-assigned this Mar 2, 2021

erikgrinaker force-pushed the decommission-self branch 4 times, most recently from e02e899 to fe702c0 Compare March 3, 2021 10:39

erikgrinaker force-pushed the decommission-self branch from fe702c0 to 1aa3377 Compare March 3, 2021 11:11

erikgrinaker force-pushed the decommission-self branch from 1aa3377 to 156a011 Compare March 3, 2021 11:56

knz approved these changes Mar 3, 2021

View reviewed changes

erikgrinaker force-pushed the decommission-self branch 3 times, most recently from 14d3b88 to 1aa1de3 Compare March 3, 2021 15:20

tbg approved these changes Mar 3, 2021

View reviewed changes

erikgrinaker force-pushed the decommission-self branch from 1aa1de3 to 6c69b25 Compare March 3, 2021 16:51

erikgrinaker commented Mar 3, 2021

View reviewed changes

erikgrinaker force-pushed the decommission-self branch 4 times, most recently from 45d58fb to a55267d Compare March 3, 2021 18:54

erikgrinaker force-pushed the decommission-self branch from a55267d to 516beb8 Compare March 3, 2021 20:31

erikgrinaker force-pushed the decommission-self branch from 516beb8 to badefde Compare March 3, 2021 21:56

craig bot merged commit 0cea9dc into cockroachdb:master Mar 4, 2021

erikgrinaker mentioned this pull request Mar 4, 2021

server: disable flaky check in TestDecommissionedNodeCannotConnect #61471

Merged

erikgrinaker deleted the decommission-self branch March 10, 2021 13:04

This was referenced Mar 16, 2021

roachtest: decommission/randomized failed #61692

Closed

roachtest: decommission/mixed-versions failed #58058

Closed

erikgrinaker mentioned this pull request Jun 17, 2021

kv: operations failing when encountering decommissioned nodes #66586

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix node decommissioning itself #61356

server: fix node decommissioning itself #61356

erikgrinaker commented Mar 2, 2021 •

edited

Loading

cockroach-teamcity commented Mar 2, 2021

knz left a comment

tbg left a comment

erikgrinaker left a comment

erikgrinaker commented Mar 3, 2021

erikgrinaker commented Mar 3, 2021

craig bot commented Mar 3, 2021

erikgrinaker commented Mar 3, 2021

craig bot commented Mar 4, 2021

server: fix node decommissioning itself #61356

server: fix node decommissioning itself #61356

Conversation

erikgrinaker commented Mar 2, 2021 • edited Loading

cockroach-teamcity commented Mar 2, 2021

knz left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 3, 2021

erikgrinaker commented Mar 3, 2021

craig bot commented Mar 3, 2021

erikgrinaker commented Mar 3, 2021

craig bot commented Mar 4, 2021

erikgrinaker commented Mar 2, 2021 •

edited

Loading