Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: TestAddNewStoresToExistingNodes failed #106706

Closed
cockroach-teamcity opened this issue Jul 12, 2023 · 8 comments · Fixed by #107265 or #107124
Closed

server: TestAddNewStoresToExistingNodes failed #106706

cockroach-teamcity opened this issue Jul 12, 2023 · 8 comments · Fixed by #107265 or #107124
Assignees
Labels
A-kv-server Relating to the KV-level RPC server A-server-start-drain Pertains to server startup and shutdown sequences branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv-replication KV Replication Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 12, 2023

server.TestAddNewStoresToExistingNodes failed with artifacts on master @ 665084ed379a25fcde92ee8b4c9dc48e192876e5:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 131489 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

=== RUN   TestAddNewStoresToExistingNodes
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestAddNewStoresToExistingNodes363450179
    test_log_scope.go:81: use -show-logs to present logs inline

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-29688

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team T-storage Storage Team labels Jul 12, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 12, 2023
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV Jul 12, 2023
@blathers-crl blathers-crl bot added this to Incoming in Storage Jul 12, 2023
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jul 12, 2023
@shralex
Copy link
Contributor

shralex commented Jul 12, 2023

seems related to #106355 that got closed

@andrewbaptist
Copy link
Collaborator

@tbg did you have further thoughts on this one. This failure included the fix to 106402 (ffa0a74), but still seemed to fail. However it only happened once and so I'm inclined to close this out.

I ran this on my GCE worker for 6 hours (unintentionally as I forgot it was running :) ) - and it never failed for me on the latest master.

@tbg tbg added T-kv-replication KV Replication Team and removed T-kv KV Team T-storage Storage Team labels Jul 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 17, 2023

cc @cockroachdb/replication

@cockroach-teamcity
Copy link
Member Author

server.TestAddNewStoresToExistingNodes failed with artifacts on master @ b3ae46d39098088ca33844b3cb87d18c60cde0d5:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 117612 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

=== RUN   TestAddNewStoresToExistingNodes
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/dbd7076bfe20e3183065a44fa4f7e406/logTestAddNewStoresToExistingNodes3219261074
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:104: 
        Test server was configured to route SQL queries to a secondary tenant (virtual cluster).
        If you are only seeing a test failure when this message appears, there may be a problem
        specific to cluster virtualization or multi-tenancy.
        
        To investigate, consider using "COCKROACH_TEST_TENANT=true" to force-enable just
        the secondary tenant in all runs (or, alternatively, "false" to force-disable), or use
        "COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=false" to disable all random test variables altogether.
*
* ERROR: test tenant requested by configuration, but code organization prevents start!
* OSS binaries do not include enterprise features
*
*
* ERROR: test tenant requested by configuration, but code organization prevents start!
* OSS binaries do not include enterprise features
*
*
* ERROR: test tenant requested by configuration, but code organization prevents start!
* OSS binaries do not include enterprise features
*
    test_server_shim.go:104: 
        Test server was configured to route SQL queries to a secondary tenant (virtual cluster).
        If you are only seeing a test failure when this message appears, there may be a problem
        specific to cluster virtualization or multi-tenancy.
        
        To investigate, consider using "COCKROACH_TEST_TENANT=true" to force-enable just
        the secondary tenant in all runs (or, alternatively, "false" to force-disable), or use
        "COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=false" to disable all random test variables altogether.
    testcluster.go:415: 
        Test server was configured to route SQL queries to a secondary tenant (virtual cluster).
        If you are only seeing a test failure when this message appears, there may be a problem
        specific to cluster virtualization or multi-tenancy.
        
        To investigate, consider using "COCKROACH_TEST_TENANT=true" to force-enable just
        the secondary tenant in all runs (or, alternatively, "false" to force-disable), or use
        "COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=false" to disable all random test variables altogether.

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jul 18, 2023

Yep looks like the test is hanging during start-up. I'll take a look.

TestCluster.Start
01:40:26     goroutine 38686 [chan receive, 14 minutes]:
01:40:26     github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0x4002a84700, {0x5d1c250?, 0x40145409c0?})
01:40:26       github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:420 +0xa98
01:40:26     github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
01:40:26       github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:236 +0x58
01:40:26     github.com/cockroachdb/cockroach/pkg/server_test.TestAddNewStoresToExistingNodes(0x40145409c0)
01:40:26       github.com/cockroachdb/cockroach/pkg/server_test/pkg/server/multi_store_test.go:95 +0x324
01:40:26     testing.tRunner(0x40145409c0, 0x4bc8988)
01:40:26       GOROOT/src/testing/testing.go:1446 +0x10c
01:40:26     created by testing.(*T).Run
01:40:26       GOROOT/src/testing/testing.go:1493 +0x328

waits for

Node.waitForAdditionalStoreInit
goroutine 51538 [chan receive, 14 minutes]:
github.com/cockroachdb/cockroach/pkg/server.(*Node).waitForAdditionalStoreInit(...)
	github.com/cockroachdb/cockroach/pkg/server/node.go:736
github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart(0x4002f1f500, {0x5cb2bd0?, 0x400014e000?})
	github.com/cockroachdb/cockroach/pkg/server/server.go:1911 +0x1b9c
github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start(0x4002a85500, {0x5cb2bd0, 0x400014e000})
	github.com/cockroachdb/cockroach/pkg/server/testserver.go:626 +0x30
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer(_, _, {{{0x5c6da20, 0x40075da700}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:608 +0x60
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func1(0x0?)
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:401 +0x84
created by github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:400 +0x4b0

which is stuck in DistSender trying to send the Increment to get a StoreID

Node.initializeAdditionalStores
goroutine 51921 [select]:
google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0x4004fcb9e0)
        google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:328 +0x6c
google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
        google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:343
google.golang.org/grpc.(*csAttempt).recvMsg(0x40037b5110, {0x46e72a0?, 0x40075f0b80}, 0xcc001c?)
        google.golang.org/grpc/external/org_golang_google_grpc/stream.go:1046 +0x78
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x400675cdc8?)
        google.golang.org/grpc/external/org_golang_google_grpc/stream.go:900 +0x28
google.golang.org/grpc.(*clientStream).withRetry(0x4004fcb7a0, 0x400675cdf8, 0x400675cdc8)
        google.golang.org/grpc/external/org_golang_google_grpc/stream.go:751 +0x1a0
google.golang.org/grpc.(*clientStream).RecvMsg(0x4004fcb7a0, {0x46e72a0?, 0x40075f0b80?})
        google.golang.org/grpc/external/org_golang_google_grpc/stream.go:899 +0xbc
google.golang.org/grpc.invoke({0x5cb2c40?, 0x40074fa870?}, {0x48ecaf1?, 0x48ecaf1?}, {0x4804ee0, 0x4004fcb200}, {0x46e72a0, 0x40075f0b80}, 0xffffa58c35b8?, {0x4009586640?, ...})
        google.golang.org/grpc/external/org_golang_google_grpc/call.go:73 +0xb4
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ClientInterceptor.func2({0x5cb2c40, 0x40074fa870}, {0x48ecaf1, 0x21}, {0x4804ee0, 0x4004fcb200}, {0x46e72a0, 0x40075f0b80}, 0x0?, 0x4bd08b8, ...)
        github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:248 +0x30c
google.golang.org/grpc.(*ClientConn).Invoke(0x0?, {0x5cb2c40?, 0x40074fa870?}, {0x48ecaf1?, 0x276538f0d7b40ecc?}, {0x4804ee0?, 0x4004fcb200?}, {0x46e72a0?, 0x40075f0b80?}, {0x0?, ...})
        google.golang.org/grpc/external/org_golang_google_grpc/call.go:35 +0x164
github.com/cockroachdb/cockroach/pkg/kv/kvpb.(*internalClient).Batch(0x4000bbb3f8, {0x5cb2c40, 0x40074fa870}, 0x0?, {0x0, 0x0, 0x0})
        github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/aarch64-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:9866 +0x8c
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.TracingInternalClient.Batch({{0x5d0b438?, 0x4000bbb3f8?}}, {0x5cb2c40, 0x40074fa870}, 0x4004fcb0e0, {0x0, 0x0, 0x0})
        github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:282 +0x180
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0x4008f315c0, {0x5cb2c40, 0x40074fa870}, 0x6c18cd8?, {0x5cb0900, 0x4005b91840?}, 0x4004fcb0e0)
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:210 +0xc8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0x4008f315c0, {0x5cb2c40, 0x40074fa870}, 0x1?)
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:189 +0x90
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0x4008106a00, {0x5cb2c40, 0x40074fa870}, 0x4004a15200?, {0x4000fafc20, 0x40037b5040, 0x40037b50b0, 0xffffffff, 0x0}, 0x0)
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2283 +0xd58
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0x4008106a00, {0x5cb2c40?, 0x40074fa870}, 0x4004a15200, {{0x4000040a30, 0xc, 0xc}, {0x4005470010, 0xd, 0xd}}, ...)
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1801 +0x588
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0x4008106a00, {0x5cb2c40, 0x40074fa870}, 0x4004a15200, {{0x4000040a30, 0xc, 0xc}, {0x4005470010, 0xd, 0xd}}, ...)
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1372 +0x2a4
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0x4008106a00, {0x5cb2c40, 0x4013ed4e40}, 0x4004a15200)
        github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:992 +0x49c
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0x4002efa578, {0x5cb2c40, 0x4013ed4e40}, 0x4004a15200)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:223 +0x94
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0x4002efa510, {0x5cb2c40, 0x4013ed4e40}, 0x4004a15200, {0x5c6cfc0, 0x4002efa578})
        github.com/cockroachdb/cockroach/pkg/kv/db.go:1033 +0x98
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(0x110?, {0x5cb2c40?, 0x4013ed4e40?}, 0x125a1fc?)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:1016 +0x34
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill({0x5cb2c40, 0x4013ed4e40}, 0x4002988de8, 0x400ac12000)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:863 +0xe4
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0x4002efa510, {0x5cb2c40, 0x4013ed4e40}, 0x2?)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:886 +0x6c
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Inc(0x4000040a30?, {0x5cb2c40, 0x4013ed4e40}, {0x45fd060, 0x400d9861e0}, 0x0?)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:468 +0x6c
github.com/cockroachdb/cockroach/pkg/kv.IncrementValRetryable({0x5cb2c40, 0x4013ed4e40}, 0x4?, {0x4000040a30, 0xc, 0xc}, 0xcd0000e17c?)
        github.com/cockroachdb/cockroach/pkg/kv/db.go:1087 +0x174
github.com/cockroachdb/cockroach/pkg/server.allocateStoreIDs({0x5cb2c40?, 0x4013ed4e40?}, 0x1?, 0x2, 0x0?)
        github.com/cockroachdb/cockroach/pkg/server/node.go:400 +0x64
github.com/cockroachdb/cockroach/pkg/server.(*Node).initializeAdditionalStores(0x4002a21000, {0x5cb2c40, 0x4013ed4e40}, {0x40056c0540, 0x2, 0x2?}, 0x400113ed20?)

There are a few served-side Node.Batch goroutines, they're all in

`redirectOnOrAcquireLease`
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).redirectOnOrAcquireLeaseForRequest.func2(0x4009db3e00, 0x0, 0x4008d45200, 0x4009db3dd0, 0x400b5a5198, {0xffff7a770f28, 0x4005df9380}, 0x400b5a5060)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_range_lease.go:1365 +0x1cc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).redirectOnOrAcquireLeaseForRequest(0x4008d45200, {0x5cb2c40, 0x4013331860}, {0x400b5a5518?, 0x3480ac0?, 0x0?}, {_, _})
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_range_lease.go:1439 +0x414
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleInvalidLeaseError(0x4009b9d3c0?, {0x5cb2c40, 0x4013331860}, 0x4004448fc0)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:847 +0x60
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0x4008d45200, {0x5cb2c40, 0x4013331860}, 0x4004448fc0, 0x4bc7448)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:590 +0x454
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0x4008d45200, {0x5cb2c40?, 0x4013331830?}, 0x4004448fc0)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:189 +0x4f0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes(0x40047c7500, {0x5cb2c40?, 0x4013331800?}, 0x4004448fc0)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:193 +0x44c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes(0x400b131ce0?, {0x5cb2c40, 0x4013331800}, 0x4004448fc0)
        github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:202 +0x9c
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0x400a082c00, {0x5cb2c40?, 0x40133317d0?}, {0x4008bbab40?}, 0x4004448fc0)
        github.com/cockroachdb/cockroach/pkg/server/node.go:1296 +0x2e4
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0x400a082c00, {0x5cb2c40, 0x4013331710}, 0x4004448fc0)
[...]

@tbg tbg removed this from Incoming in Storage Jul 18, 2023
@tbg
Copy link
Member

tbg commented Jul 18, 2023

I think I see a problem.

In (*Server).PreStart, we wait for additional store init here

s.node.waitForAdditionalStoreInit()

before starting liveness, which happens a few lines down below:

https://github.com/cockroachdb/cockroach/blob/0b2f1cce5e22a4faa35addb243bbd0a55fa32cc1/pkg/server/server.go#L1959-L1958

That seems unwise, since now the following can happen:

  • all nodes restart (in parallel) with two additional stores
  • they all block in waitForAdditionalStoreInit
  • the goroutine that should unblock it needs to send an Increment
  • but nobody has started their liveness heartbeat yet (since that only happens after unblocking)
  • nobody acquires a lease:

    W230718 01:27:05.307050 51797 kv/kvclient/kvcoord/dist_sender.go:1808 ⋮ [T1,n3] 540 slow range RPC: have been waiting 65.21s (9 attempts) for RPC Increment [/System‹/›‹"store-idgen›‹"›,/Min) to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: [NotLeaseHolderError] failed to manipulate liveness record: not incrementing epoch on n1 because next leaseholder (n2) not live; r3: replica (n2,s2):2 not lease holder; lease holder unknown›

  • stuck forever

@tbg tbg removed this from roachtest/unit test backlog in KV Jul 18, 2023
@tbg
Copy link
Member

tbg commented Jul 18, 2023

I see why this isn't happening frequently, though. I added a 10s sleep before the Increment but mostly the test still passes. But with this (the started check is something I added):

image

it does hang. So without that check we usually get away with it. We have to reliably hit the second branch on all (or two) nodes for the hang to occur "naturally", since only the second branch below requires nodeliveness to have been started on the other node:

if status.Lease.Type() == roachpb.LeaseEpoch && status.State == kvserverpb.LeaseState_EXPIRED {
var err error
// If this replica is previous & next lease holder, manually heartbeat to become live.
if status.OwnedBy(nextLeaseHolder.StoreID) &&
p.repl.store.StoreID() == nextLeaseHolder.StoreID {
if err = p.repl.store.cfg.NodeLiveness.Heartbeat(ctx, status.Liveness); err != nil {
log.Errorf(ctx, "failed to heartbeat own liveness record: %s", err)
}
} else if status.Liveness.Epoch == status.Lease.Epoch {
// If not owner, increment epoch if necessary to invalidate lease.
// However, we only do so in the event that the next leaseholder is
// considered live at this time. If not, there's no sense in
// incrementing the expired leaseholder's epoch.
if !p.repl.store.cfg.NodeLiveness.GetNodeVitalityFromCache(nextLeaseHolder.NodeID).IsLive(livenesspb.EpochLease) {
err = errors.Errorf("not incrementing epoch on n%d because next leaseholder (n%d) not live",
status.Liveness.NodeID, nextLeaseHolder.NodeID)
log.VEventf(ctx, 1, "%v", err)

@tbg
Copy link
Member

tbg commented Jul 18, 2023

Now that I have a deterministic way of introducing the hang, I can see about rearranging the startup sequence so that we wait for additional stores only after starting liveness.

@tbg tbg added A-server-start-drain Pertains to server startup and shutdown sequences and removed A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jul 18, 2023
@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server labels Jul 19, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 20, 2023
I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix cockroachdb#106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

[^1]: cockroachdb#106706 (comment)

Epic: None
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 20, 2023
I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix cockroachdb#106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

[^1]: cockroachdb#106706 (comment)

Epic: None
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jul 20, 2023
Otherwise, we can end up in a situation where each node is sitting on the
channel and nobody has started their liveness yet. The sender to the channel
will first have to get an Increment through KV, but if nobody acquires the
lease (since nobody's heartbeat loop is running), this will never happen.

In practice, *most of the time*, there is no deadlock because the lease
acquisition path performs a synchronous heartbeat to the own entry in most
cases (ignoring the fact that liveness hasn't been started yet). But there is
also another path where someone else's epoch needs to be incremented, and this
path also checks if the node itself is live - which it won't necessarily be
(liveness loop is not running yet).

Fixes cockroachdb#106706

Epic: None
Release note (bug fix): a rare (!) situation in which nodes would get stuck
during start-up was addressed. This is unlikely to have been encountered by
production users This is unlikely to have been encountered by users. If so, it
would manifest itself through a stack frame sitting on a select in
`waitForAdditionalStoreInit` for extended periods of time (i.e. minutes).
craig bot pushed a commit that referenced this issue Jul 26, 2023
107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg

I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix #106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

[^1]: #106706 (comment)

Epic: None
Release note: None


107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker

**kvserver: remove stale mayCampaignOnWake comment**

The comment is about a parameter that no longer exists.

**kvserver: revamp shouldCampaign/Forget tests**

**kvserver: ignore RPC conn in `shouldCampaignOnWake`**

Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case.

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

**kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`**

Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't).

This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state.

Resolves #107060.
Epic: none
Release note: None

**kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`**

107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker

The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate.

This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache.

This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate).

Resolves #98666.
Epic: none

Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`.

107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg

The test previously relied on aggressive liveness heartbeat expirations to
avoid running for too long. As a result, it was flaky since liveness wasn't
reliably pinned in the way the test wanted.

The hybrid manual clock allows time to jump forward at an opportune moment.
Use it here to avoid running with a tight lease interval.

On my gceworker, previously flaked within a few minutes. As of this commit, I
ran it for double-digit minutes without issue.

Fixes #107200.

Epic: None
Release note: None


107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg

We saw this test hang in CI. What likely happened (according to the stacks) is
that a lease transfer that was supposed to be caught by an interceptor never
showed up in the interceptor. The most likely explanation is that it errored
out before it got to evaluation. It then signaled a channel the test was only
prepared to check later, so the test hung (waiting for a channel that was now
never to be touched).

This test is hard to maintain. It would be great (though, for now, out of reach)
to write tests like it in a deterministic framework[^1]

[^1]: see #105177.

For now, fix the test so that when the (so far unknown) error rears its
head again, it will fail properly, so we get to see the error and can
take another pass at fixing the test (separately). Stressing
this commit[^2], we get:

> transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected
> error: refusing to transfer lease to (n3,s3):3 because target may need a Raft
> snapshot: replica in StateProbe

This makes sense. The test wants to exercise the below-raft mechanism, but
the above-raft mechanism also exists and while we didn't want to interact
with it, we sometimes do[^1]

The second commit introduces a testing knob that disables the above-raft
mechanism selectively. I've stressed the test for 15 minutes without issues
after this change.

[^1]: somewhat related to #107524
[^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s

Fixes #106383.

Epic: None
Release note: None


107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg

For a more holistic suggestion on how to fix this for the likely many other
tests susceptible to similar issues, see:
#107528

> 1171 runs so far, 0 failures, over 15m55s

Fixes #101824.

Release note: None
Epic: none


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@craig craig bot closed this as completed in f7e72f9 Jul 26, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 26, 2023
Otherwise, we can end up in a situation where each node is sitting on the
channel and nobody has started their liveness yet. The sender to the channel
will first have to get an Increment through KV, but if nobody acquires the
lease (since nobody's heartbeat loop is running), this will never happen.

In practice, *most of the time*, there is no deadlock because the lease
acquisition path performs a synchronous heartbeat to the own entry in most
cases (ignoring the fact that liveness hasn't been started yet). But there is
also another path where someone else's epoch needs to be incremented, and this
path also checks if the node itself is live - which it won't necessarily be
(liveness loop is not running yet).

Fixes cockroachdb#106706

Epic: None
Release note (bug fix): a rare (!) situation in which nodes would get stuck
during start-up was addressed. This is unlikely to have been encountered by
production users This is unlikely to have been encountered by users. If so, it
would manifest itself through a stack frame sitting on a select in
`waitForAdditionalStoreInit` for extended periods of time (i.e. minutes).
tbg added a commit to tbg/cockroach that referenced this issue Jul 26, 2023
If we rely on sync heartbeats, there's an issue. They were very effective in
hiding the problem in cockroachdb#106706 so at least in our testing, allow sync heartbeats
only once there are also async heartbeats.

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Jul 26, 2023
96144: server: honor and validate the service mode for SQL pods r=yuzefovich,stevendanna a=knz

Rebased on top of #105441.
Fixes #93145.
Fixes #83650.

Epic: CRDB-26691

TLDR: this makes SQL servers only accept to start if their data state
is "ready" and their deployment type (e.g. separate process) matches
the configured service mode in the tenant record.
Additionally, the SQL server spontaneously terminates if
their service mode or data state is not valid any more (e.g
as a result of DROP TENANT or ALTER TENANT STOP SERVICE).

----

Prior to this patch, there wasn't a good story about the lifecycle
of separate-process SQL servers ("SQL pods"):

- if a SQL server was started against a non-existent tenant,
  an obscure error would be raised (`database "[1]" does not exist`)
- if a SQL server was started while a tenant was being added
  (keyspace not yet valid), no check would be performed and
  data corruption could ensue.
- if the tenant record/keyspace was dropped while the SQL server was
  running, SQL clients would start encountering obscure errors.

This commit fixes the situation by checking the tenant metadata:

- once, during SQL server startup, at which point server startup
  is prevented if the service check fails;
- then, asynchronously, whenever the metadata is updated, such that
  any service check failure results in a graceful shutdown of the SQL
  service.

The check proper validates:
- that the tenant record exists;
- that the data state is "ready";
- that the configured service mode matches that requested by the SQL
  server.

Examples output upon error:

- non-existent tenant:
  ```
  tenant service check failed: missing record
  ```

- attempting to start separate-process server while
  tenant is running as shared-process:
  ```
  tenant service check failed: service mode check failed: expected external, record says shared
  ```

- after ALTER TENANT STOP SERVICE:
  ```
  tenant service check failed: service mode check failed: expected external, record says none
  ```

- after DROP TENANT:
  ```
  tenant service check failed: service mode check failed: expected external, record says dropping
  ```

Release note: None



107124: server: avoid deadlock when initing additional stores r=erikgrinaker a=tbg

We need to start node liveness before waiting for additional store init.

Otherwise, we can end up in a situation where each node is sitting on the
channel and nobody has started their liveness yet. The sender to the channel
will first have to get an Increment through KV, but if nobody acquires the
lease (since nobody's heartbeat loop is running), this will never happen.

In practice, *most of the time*, there is no deadlock because the lease
acquisition path performs a synchronous heartbeat to the own entry in most
cases (ignoring the fact that liveness hasn't been started yet). But there is
also another path where someone else's epoch needs to be incremented, and this
path also checks if the node itself is live - which it won't necessarily be
(liveness loop is not running yet).

Fixes #106706

Epic: None
Release note (bug fix): a rare (!) situation in which nodes would get stuck
during start-up was addressed. This is unlikely to have been encountered by
production users This is unlikely to have been encountered by users. If so, it
would manifest itself through a stack frame sitting on a select in
`waitForAdditionalStoreInit` for extended periods of time (i.e. minutes).


107216: sql: disallow schema changes for READ COMMITTED transactions r=michae2 a=rafiss

Due to how descriptor leasing works, schema changes are not safe in weaker isolation transactions. Until they are safe, we disallow them.

fixes #100143
Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server A-server-start-drain Pertains to server startup and shutdown sequences branch-master Failures on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv-replication KV Replication Team
Projects
None yet
4 participants