Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix node decommissioning itself #61356

Merged
merged 3 commits into from
Mar 4, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 2, 2021

Previously, if a node was asked to decommission itself, the
decommissioning process would sometimes hang or fail since the node
would become decommissioned and lose RPC access to the rest of the
cluster while it was building the response.

This patch returns an empty response from the
adminServer.Decommission() call when setting the final
DECOMMISSIONED status, thereby avoiding further use of cluster RPC
after decommissioning itself. It also defers self-decommissioning until
the end if multiple nodes are being decommissioned.

This change is backwards-compatible with old CLI versions, which will
simply output the now-empty status result set before completing with
"No more data reported on target nodes". The CLI has been updated to
simply omit the empty response.

Resolves #56718.

A separate commit ensures errors with codes.PermissionDenied abort
RPC retries and return immediately, to prevent operations hanging with
indefinite internal retries when RPC access is lost.

Release justification: bug fixes and low-risk updates to new functionality

Release note (bug fix): Fixed a bug from 21.1-alpha where a node
decommissioning process could sometimes hang or fail when the
decommission request was submitted via the node being decommissioned.

@erikgrinaker erikgrinaker added the A-kv-server Relating to the KV-level RPC server label Mar 2, 2021
@erikgrinaker erikgrinaker requested a review from tbg March 2, 2021 18:28
@erikgrinaker erikgrinaker requested a review from a team as a code owner March 2, 2021 18:28
@erikgrinaker erikgrinaker self-assigned this Mar 2, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the decommission-self branch 4 times, most recently from e02e899 to fe702c0 Compare March 3, 2021 10:39
`IsAuthenticationError` is often used to return RPC errors immediately
instead of retrying them. Previously, this only considered
`codes.Unauthenticated` but not `codes.PermissionDenied`. This could
cause various operations to run indefinitely due to internal RPC
retries, e.g. after a node has been decommissioned and lost access to
the cluster.

Since permission errors should also return immediately, this patch
changes the function to handle `codes.PermissionDenied` as well.
However, since a missing permission is an authorization failure rather
than an authentication failure, the function was renamed to
`IsAuthError` such that it can cover both authn and authz failures.

Release justification: low risk, high benefit changes to existing functionality

Release note (bug fix): Fix operation hangs when a node loses access to
cluster RPC (e.g. after it has been decommissioned), and immediately
return an error instead.
The test `TestDecommissionNodeStatus` was initially placed in a separate
file `decommission_test.go`, even though it belonged in
`server_test.go`. This was done to avoid import cycles between the
`testcluster` package and the `server` package, as the
`decommission_test.go` file could use the `server_test` package instead.

This commit creates the cluster with `serverutils.StartNewTestCluster()`
instead of `testcluster.StartTestCluster()`, which avoids the import
cycle and allows it to be moved into `server_test.go`. This required
exposing `Server.Decommission()` via `serverutils.TestServerInterface`
as well.

Release justification: non-production code changes
Release note: None
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice thanks

Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 12 of 12 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

@erikgrinaker erikgrinaker force-pushed the decommission-self branch 3 times, most recently from 14d3b88 to 1aa1de3 Compare March 3, 2021 15:20
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never released this bug as part of a major release (it is in the 21.1 alphas), so I don't think it needs a release note? Let's keep the release note but add explicitly that this problem was only present in 21.1 alpha releases, and not in 20.2 and earlier.

:lgtm: very nice work.

Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 11 of 12 files at r3, 1 of 1 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)


pkg/server/admin_test.go, line 2017 at r4 (raw file):

func TestDecommissionSelf(t *testing.T) {
	skip.UnderStress(t) // can't handle 7-node clusters

Was this one necessary as well? I think we run other 7node tests under stress. The one to really avoid is race, which you do below. Check out

func TestMergeQueueSeesNonVoters(t *testing.T) {
for an example of a test that also runs under stress and seems to be doing ok (it's even run under stressrace, which, I don't know, bad idea honestly)

This isn't to say that you can't blow your local machine - try make stress STRESSFLAGS='-p 4' to limit the concurrency. This is also what we do in CI when we stress packages.


pkg/testutils/testcluster/testcluster.go, line 1427 at r4 (raw file):

	ctx context.Context, t testing.TB, serverIdx int,
) *grpc.ClientConn {
	srv := tc.Server(serverIdx)

Don't you just want srv.GRPCDialRaw(srv.RPCAddr())?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)


pkg/server/admin_test.go, line 2017 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Was this one necessary as well? I think we run other 7node tests under stress. The one to really avoid is race, which you do below. Check out

func TestMergeQueueSeesNonVoters(t *testing.T) {
for an example of a test that also runs under stress and seems to be doing ok (it's even run under stressrace, which, I don't know, bad idea honestly)

This isn't to say that you can't blow your local machine - try make stress STRESSFLAGS='-p 4' to limit the concurrency. This is also what we do in CI when we stress packages.

You're right, stress runs just fine. 👍


pkg/testutils/testcluster/testcluster.go, line 1427 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't you just want srv.GRPCDialRaw(srv.RPCAddr())?

Maybe, but turns out we no longer need this method at all, since we're not testing error propagation anymore.

@erikgrinaker erikgrinaker force-pushed the decommission-self branch 4 times, most recently from 45d58fb to a55267d Compare March 3, 2021 18:54
@erikgrinaker
Copy link
Contributor Author

bors r=knz,tbg

@erikgrinaker
Copy link
Contributor Author

bors r-

Seeing a CI hang in TestDecommissionedNodeCannotConnect, investigating.

@craig
Copy link
Contributor

craig bot commented Mar 3, 2021

Canceled.

Previously, if a node was asked to decommission itself, the
decommissioning process would sometimes hang or fail since the node
would become decommissioned and lose RPC access to the rest of the
cluster while it was building the response.

This patch returns an empty response from the
`adminServer.Decommission()` call when setting the final
`DECOMMISSIONED` status, thereby avoiding further use of cluster RPC
after decommissioning itself. It also defers self-decommissioning until
the end if multiple nodes are being decommissioned.

This change is backwards-compatible with old CLI versions, which will
simply output the now-empty status result set before completing with
"No more data reported on target nodes". The CLI has been updated to
simply omit the empty response.

Release justification: bug fixes and low-risk updates to new functionality

Release note (bug fix): Fixed a bug from 21.1-alpha where a node
decommissioning process could sometimes hang or fail when the
decommission request was submitted via the node being decommissioned.
@erikgrinaker
Copy link
Contributor Author

bors r=knz,tbg

@craig
Copy link
Contributor

craig bot commented Mar 4, 2021

Build succeeded:

@craig craig bot merged commit 0cea9dc into cockroachdb:master Mar 4, 2021
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this pull request Mar 4, 2021
In cockroachdb#61356, `TestDecommissionedNodeCannotConnect` was extended to check
error propagation for a `Scan` operation after a node was decommissioned.
However, because of cockroachdb#61470 not all failure modes propagate errors
properly, causing the test to sometimes hang. This patch disables this
check until these problems are addresses.

Release justification: non-production code changes
Release note: None
craig bot pushed a commit that referenced this pull request Mar 4, 2021
59288: rfc: declarative, state-based schema changer r=lucy-zhang a=lucy-zhang

https://github.com/lucy-zhang/cockroach/blob/schema-changer-rfc/docs/RFCS/20210115_new_schema_changer.md

Release note: none

61459: tracing: cap registry size at 5k r=irfansharif a=tbg

This caps the size of the span registry at 5k, evicting "random" (map
order) existing entries when at the limit.
The purpose of this is to serve as a guardrail against leaked spans,
which would otherwise lead to unbounded memory growth.

Touches #59188.

Release justification: low risk, high benefit changes to existing
functionality
Release note: None


61471: server: disable flaky check in TestDecommissionedNodeCannotConnect r=tbg a=erikgrinaker

In #61356, `TestDecommissionedNodeCannotConnect` was extended to check
error propagation for a `Scan` operation after a node was decommissioned.
However, because of #61470 not all failure modes propagate errors
properly, causing the test to sometimes hang. This patch disables this
check until these problems are addressed.

Resolves #61452.

Release justification: non-production code changes
Release note: None

Co-authored-by: Lucy Zhang <lucy@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@erikgrinaker erikgrinaker deleted the decommission-self branch March 10, 2021 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: decommissioning self can hang
4 participants