-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix node decommissioning itself #61356
Conversation
e02e899
to
fe702c0
Compare
`IsAuthenticationError` is often used to return RPC errors immediately instead of retrying them. Previously, this only considered `codes.Unauthenticated` but not `codes.PermissionDenied`. This could cause various operations to run indefinitely due to internal RPC retries, e.g. after a node has been decommissioned and lost access to the cluster. Since permission errors should also return immediately, this patch changes the function to handle `codes.PermissionDenied` as well. However, since a missing permission is an authorization failure rather than an authentication failure, the function was renamed to `IsAuthError` such that it can cover both authn and authz failures. Release justification: low risk, high benefit changes to existing functionality Release note (bug fix): Fix operation hangs when a node loses access to cluster RPC (e.g. after it has been decommissioned), and immediately return an error instead.
fe702c0
to
1aa3377
Compare
The test `TestDecommissionNodeStatus` was initially placed in a separate file `decommission_test.go`, even though it belonged in `server_test.go`. This was done to avoid import cycles between the `testcluster` package and the `server` package, as the `decommission_test.go` file could use the `server_test` package instead. This commit creates the cluster with `serverutils.StartNewTestCluster()` instead of `testcluster.StartTestCluster()`, which avoids the import cycle and allows it to be moved into `server_test.go`. This required exposing `Server.Decommission()` via `serverutils.TestServerInterface` as well. Release justification: non-production code changes Release note: None
1aa3377
to
156a011
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 12 of 12 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
14d3b88
to
1aa1de3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never released this bug as part of a major release (it is in the 21.1 alphas), so I don't think it needs a release note? Let's keep the release note but add explicitly that this problem was only present in 21.1 alpha releases, and not in 20.2 and earlier.
Reviewed 7 of 7 files at r1, 5 of 5 files at r2, 11 of 12 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)
pkg/server/admin_test.go, line 2017 at r4 (raw file):
func TestDecommissionSelf(t *testing.T) { skip.UnderStress(t) // can't handle 7-node clusters
Was this one necessary as well? I think we run other 7node tests under stress. The one to really avoid is race
, which you do below. Check out
cockroach/pkg/kv/kvserver/client_merge_test.go
Line 4110 in 9fbbb1a
func TestMergeQueueSeesNonVoters(t *testing.T) { |
This isn't to say that you can't blow your local machine - try make stress STRESSFLAGS='-p 4'
to limit the concurrency. This is also what we do in CI when we stress packages.
pkg/testutils/testcluster/testcluster.go, line 1427 at r4 (raw file):
ctx context.Context, t testing.TB, serverIdx int, ) *grpc.ClientConn { srv := tc.Server(serverIdx)
Don't you just want srv.GRPCDialRaw(srv.RPCAddr())
?
1aa1de3
to
6c69b25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
pkg/server/admin_test.go, line 2017 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Was this one necessary as well? I think we run other 7node tests under stress. The one to really avoid is
race
, which you do below. Check outfor an example of a test that also runs under stress and seems to be doing ok (it's even run under stressrace, which, I don't know, bad idea honestly)cockroach/pkg/kv/kvserver/client_merge_test.go
Line 4110 in 9fbbb1a
func TestMergeQueueSeesNonVoters(t *testing.T) { This isn't to say that you can't blow your local machine - try
make stress STRESSFLAGS='-p 4'
to limit the concurrency. This is also what we do in CI when we stress packages.
You're right, stress runs just fine. 👍
pkg/testutils/testcluster/testcluster.go, line 1427 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Don't you just want
srv.GRPCDialRaw(srv.RPCAddr())
?
Maybe, but turns out we no longer need this method at all, since we're not testing error propagation anymore.
45d58fb
to
a55267d
Compare
bors r=knz,tbg |
bors r- Seeing a CI hang in |
Canceled. |
a55267d
to
516beb8
Compare
Previously, if a node was asked to decommission itself, the decommissioning process would sometimes hang or fail since the node would become decommissioned and lose RPC access to the rest of the cluster while it was building the response. This patch returns an empty response from the `adminServer.Decommission()` call when setting the final `DECOMMISSIONED` status, thereby avoiding further use of cluster RPC after decommissioning itself. It also defers self-decommissioning until the end if multiple nodes are being decommissioned. This change is backwards-compatible with old CLI versions, which will simply output the now-empty status result set before completing with "No more data reported on target nodes". The CLI has been updated to simply omit the empty response. Release justification: bug fixes and low-risk updates to new functionality Release note (bug fix): Fixed a bug from 21.1-alpha where a node decommissioning process could sometimes hang or fail when the decommission request was submitted via the node being decommissioned.
516beb8
to
badefde
Compare
bors r=knz,tbg |
Build succeeded: |
In cockroachdb#61356, `TestDecommissionedNodeCannotConnect` was extended to check error propagation for a `Scan` operation after a node was decommissioned. However, because of cockroachdb#61470 not all failure modes propagate errors properly, causing the test to sometimes hang. This patch disables this check until these problems are addresses. Release justification: non-production code changes Release note: None
59288: rfc: declarative, state-based schema changer r=lucy-zhang a=lucy-zhang https://github.com/lucy-zhang/cockroach/blob/schema-changer-rfc/docs/RFCS/20210115_new_schema_changer.md Release note: none 61459: tracing: cap registry size at 5k r=irfansharif a=tbg This caps the size of the span registry at 5k, evicting "random" (map order) existing entries when at the limit. The purpose of this is to serve as a guardrail against leaked spans, which would otherwise lead to unbounded memory growth. Touches #59188. Release justification: low risk, high benefit changes to existing functionality Release note: None 61471: server: disable flaky check in TestDecommissionedNodeCannotConnect r=tbg a=erikgrinaker In #61356, `TestDecommissionedNodeCannotConnect` was extended to check error propagation for a `Scan` operation after a node was decommissioned. However, because of #61470 not all failure modes propagate errors properly, causing the test to sometimes hang. This patch disables this check until these problems are addressed. Resolves #61452. Release justification: non-production code changes Release note: None Co-authored-by: Lucy Zhang <lucy@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Previously, if a node was asked to decommission itself, the
decommissioning process would sometimes hang or fail since the node
would become decommissioned and lose RPC access to the rest of the
cluster while it was building the response.
This patch returns an empty response from the
adminServer.Decommission()
call when setting the finalDECOMMISSIONED
status, thereby avoiding further use of cluster RPCafter decommissioning itself. It also defers self-decommissioning until
the end if multiple nodes are being decommissioned.
This change is backwards-compatible with old CLI versions, which will
simply output the now-empty status result set before completing with
"No more data reported on target nodes". The CLI has been updated to
simply omit the empty response.
Resolves #56718.
A separate commit ensures errors with
codes.PermissionDenied
abortRPC retries and return immediately, to prevent operations hanging with
indefinite internal retries when RPC access is lost.
Release justification: bug fixes and low-risk updates to new functionality
Release note (bug fix): Fixed a bug from 21.1-alpha where a node
decommissioning process could sometimes hang or fail when the
decommission request was submitted via the node being decommissioned.