roachtest: add test for authentication under network partition #66768

knz · 2021-06-23T14:19:27Z

Supersedes #66550

cockroach-teamcity · 2021-06-23T14:19:38Z

This change is

knz · 2021-06-23T14:20:06Z

@rafiss with your test and my fixes I am able to readily reproduce the customer's issue. I'm now looking into the log files to see if there's anything interesting.

cc @tbg

knz · 2021-06-23T15:24:27Z

Here are the node logs from the repro (1 is the node with all the leases originally, that got shut off):

logs.tar.gz

Immediately after blocking traffic with iptables at 14:16:07, the test saw:

14:16:09: 9.58s connection delay from n3 ending at 14:16:20
(too long, but not enough to trigger the error)
14:16:20: 0.14s connection delay from n2 ending at 14:16:22
14:16:22: 6.51s connection delay from n3 ending at 14:16:31
(again too long, but not enough to trigger the error)
14:16:31: 10.07s connection delay from n2 ending at 14:16:43

that last failure failed with the customer's reported error message

ERROR: internal error while retrieving user account

then:

14:16:43: 3.31s connection delay from n3 ending at 14:16:48

and then after that all further connection attempts were resolved quickly (in less than 0.2s)

As detailed in the timeline above, both n2 and n3 were encountering a >4.5s delay for up to 30 seconds. This strongly confirms the customer's observation.

So the period of interest is from the iptables command at 14:16:07 to when connections became normal again at 14:16:48.

Here are the "interesting" heartbeat failures from n2:

cockroach.log:W210623 14:15:53.240839 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.606109623s; err=<nil>
cockroach.log:W210623 14:16:16.635103 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500253127s; err=context deadline exceeded
cockroach.log:W210623 14:16:20.604874 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.9696735s; err=<nil>
cockroach.log:W210623 14:16:25.635146 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500288899s; err=context deadline exceeded
cockroach.log:W210623 14:16:30.135478 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500233573s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:16:30.189462 33594 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r8/3:‹/Table/1{2-3}›] slow heartbeat took 4.549790544s; err=<nil>
cockroach.log:W210623 14:16:30.189555 33604 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 4.51348012s; err=<nil>
cockroach.log:W210623 14:16:39.135067 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500236273s; err=context deadline exceeded
cockroach.log:W210623 14:16:43.635375 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500234435s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:16:48.135586 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500118935s; err=context deadline exceeded
cockroach.log:W210623 14:16:48.561551 45860 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r5/3:‹/{Systemtse-Table/System…}›] slow heartbeat took 9.415041492s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:16:48.562160 46402 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 8.849017769s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:16:48.562830 50141 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r3/2:‹/System/{NodeLive…-tsd}›] slow heartbeat took 4.653593527s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:16:48.563274 50205 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r4/2:‹/System{/tsd-tse}›] slow heartbeat took 4.514427495s; err=heartbeat failed on epoch increment

-- after this point, the client doesn't experience login delay any more --

cockroach.log:W210623 14:16:57.135077 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500238113s; err=context deadline exceeded
cockroach.log:W210623 14:17:01.635347 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500200418s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:17:05.578698 63614 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 7.007826896s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:05.579189 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.943750846s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:05.579565 68377 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r3/2:‹/System/{NodeLive…-tsd}›] slow heartbeat took 1.671525872s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:05.579989 68424 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r4/2:‹/System{/tsd-tse}›] slow heartbeat took 1.528752237s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:15.135057 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500228331s; err=context deadline exceeded
cockroach.log:W210623 14:17:19.635231 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.802752131s; err=context deadline exceeded
cockroach.log:W210623 14:17:24.135498 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500160088s; err=context deadline exceeded
cockroach.log:W210623 14:17:25.034981 79866 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 9.206951399s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:33.135094 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500235161s; err=context deadline exceeded
cockroach.log:W210623 14:17:35.237710 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 2.102542378s; err=<nil>
cockroach.log:W210623 14:17:35.237950 96823 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r4/2:‹/System{/tsd-tse}›] slow heartbeat took 1.11815483s; err=<nil>
cockroach.log:W210623 14:17:35.238176 96869 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r3/2:‹/System/{NodeLive…-tsd}›] slow heartbeat took 1.118141729s; err=<nil>
cockroach.log:W210623 14:17:46.635080 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.50018725s; err=context deadline exceeded
cockroach.log:W210623 14:17:55.635182 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500354615s; err=context deadline exceeded
cockroach.log:W210623 14:18:00.135594 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500244238s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:18:04.635823 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500120754s; err=context deadline exceeded
cockroach.log:W210623 14:18:04.980352 116520 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 9.314270655s; err=<nil>

The log paste above separates the two phases of slow heartbeats, one "while the client is experiencing delays" and another "after this point the client seems fine"

What's different between the two?

both phases see heartbeat failures on r3, r4, r6
only the first phase (the one with problem) sees hearbeat failures on r5 and r8
later the second phase (the one without a problem) also sees failures on r12, r20, r33

What are these ranges?

Range	Keys	Note
r3	/System/NodeLivenessMax - /System/tsd	empty range?
r4	/System/tsd - /System/"tse"	timeseries
r5	/System/"tse" - /Table/SystemConfigSpan/Start	empty range?
r6	/Table/SystemConfigSpan/Start - /Table/11	system config
r8	/Table/12 - /Table/13	system.eventlog
r12	/Table/16 - /Table/17	??
r20	/Table/20 - /Table/21	system.table_statistics
r33	/Table/37 - /Table/38	system.scheduled_jobs

knz · 2021-06-23T15:27:12Z

I only looked at the hearbeat failures. There are more events in the logs about the circuit breaker, updating node descriptors etc but I haven't investigated those yet.

rafiss · 2021-06-23T15:53:24Z

This is great!

We should also break this down into two categories of slow heartbeats:

The ones that take 4.5 seconds make sense to me, since the heartbeat is run with a context timeout of 4.5 seconds.

cockroach/pkg/kv/kvserver/node_liveness.go

Line 630 in 62196e0

if err := contextutil.RunWithTimeout(ctx, "node liveness heartbeat", nl.livenessThreshold-nl.heartbeatInterval,
There are also heartbeats that take 8-9 seconds, and those are much more confusing and concerning to me. It indicates we aren't respecting the context timeout somewhere.

knz · 2021-06-23T16:09:21Z

yes you're right

but even the 4.5s timeouts can be a problem if they are cascading. I'm thinking that there's likely a cascading failure with first timeout on meta, then timeout on liveness, then timeout on system.users, which will force a >15s timeout on at least one client conn.

rafiss · 2021-06-23T18:23:54Z

Definitely agree that it would be ideal to improve the 4.5s behavior and fail faster if we can.

Sharing one snippet from these gRPC docs that may be relevant:

WaitForReady configures the action to take when an RPC is attempted on broken connections or unreachable servers. If waitForReady is false, the RPC will fail immediately. Otherwise, the RPC client will block the call until a connection is available (or the call is canceled or times out) and will retry the call if it fails due to a transient error. gRPC will not retry if data was written to the wire unless the server indicates it did not process the data. Please refer to https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md.

By default, RPCs don't "wait for ready".

So it seems like gRPC is supposed to fail more quickly. Would it be worth investigating this further?

And also possibly related issues grpc/grpc-go#1443 and grpc/grpc-go#2663

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):

		"--args=--accept-sql-without-tls",
		"--env=COCKROACH_SCAN_INTERVAL=200ms",
		"--env=COCKROACH_SCAN_MAX_IDLE_TIME=20ms",

consider commenting why we use these environment variables

pkg/cmd/roachtest/network.go, line 243 at r5 (raw file):

				url := fmt.Sprintf("postgres://testuser:password@%s/defaultdb?sslmode=require", serverAddrs[i-1])
				b, err := c.RunWithBuffer(ctx, t.l, clientNode, "time", "-p", "./cockroach", "sql",

ah interesting, so you've changed it to connect from another GCE instance to all the server nodes. any particular reason why? the testing done by the customer was just logging directly onto the server and connecting to localhost

pkg/cmd/roachtest/network.go, line 267 at r5 (raw file):

sudo iptables -P OUTPUT ACCEPT;

# Drop any node-to-node crdb traffic.

i had set up more aggressive rules in my draft, since i thought we'd want to drop all ICMP traffic as well. i guess it's not necessary to repro though, but maybe still useful to have the more aggressive rule? blocking all traffic (or as much as we can) matches more closely with the kind of testing done by the customer

pkg/cmd/roachtest/network.go, line 277 at r5 (raw file):

	// (attempt to) restore iptables when test end, so that cluster
	// can be investigated afterwards.

well, we'd still be able to SSH into the node without undoing these rules. or did you mean something else?

pkg/cmd/roachtest/network.go, line 291 at r5 (raw file):

	t.l.Printf("waiting for client to do its thing...")

	wg.Wait()

do we still need wg if we move the m.Wait() call to here?

pkg/cmd/roachtest/network.go, line 292 at r5 (raw file):

	wg.Wait()
	if errorCount >= 1 /*c.Spec().NodeCount*/ {

i had left this commented as a TODO/reminder for myself. i don't really know what the expected number of failures is. should it be 0? (if so, change to errorCount > 0 and remove the commented code)

pkg/cmd/roachtest/network.go, line 450 at r5 (raw file):

	})
	r.Add(testSpec{
		Name:    fmt.Sprintf("network/authentication/nodes=%d", numNodes),

it wasn't entirely clear to me why these tests use 4 nodes and not 3. i suppose it's fine though.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cmd/roachtest/network.go, line 450 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

it wasn't entirely clear to me why these tests use 4 nodes and not 3. i suppose it's fine though.

retracting this -- i see it's because one of the nodes is only used for trying to connect to the others.

nvanbenschoten · 2021-06-23T20:01:30Z

As a data point, I've been running this with the settings from #66550 (comment) to see the full extent of one of these failover delays. I have yet to see any login attempt take longer than 22 seconds on v20.2.

Unfortunately, I am running into trouble in a few areas with trying to get this running with v20.1. @rafiss, have you had any luck with that?

nvanbenschoten · 2021-06-23T21:58:42Z

I upgraded from v20.2.10 to release-20.2's HEAD and saw this delay drop down to about 9 seconds on each attempt. I believe (but have not confirmed) that this is because of @lunevalex's #66552. While that change only shaves off 4.5 during the recovery of a single range, it has the potential to compound across the multiple dependent ranges at play here. I'll try to narrow the improvement down to just this commit.

rafiss · 2021-06-23T22:04:21Z

For this issue (https://github.com/cockroachlabs/support/issues/1022) the customer reported the behavior on 20.2.10 as well as 20.1.8; so I've been focusing on 20.2.10 only so far.

I just tried with 20.1.8 now, and did get an error:

22:00:18 cluster.go:388: > /Users/rafiss/go/src/github.com/cockroachdb/cockroach/bin/roachprod start --secure --args=--locality=node=1 --args=--accept-sql-without-tls --env=COCKROACH_SCAN_INTERVAL=200ms --env=COCKROACH_SCAN_MAX_IDLE_TIME=20ms rafi-1624485452-01-n4cpu4:1
rafi-1624485452-01-n4cpu4: checking certs
rafi-1624485452-01-n4cpu4: starting nodes
0: ~ #!/bin/bash: exit status 1
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
  | 	/Users/rafiss/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:373
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
  | 	/Users/rafiss/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:160
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
  | 	/Users/rafiss/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1690
  | runtime.goexit
  | 	/usr/local/opt/go@1.15/libexec/src/runtime/asm_amd64.s:1374

  | sudo systemd-run --unit cockroach \
  |   --same-dir --uid $(id -u) --gid $(id -g) \
  |   --service-type=notify -p NotifyAccess=all \
  |   -p MemoryMax=95% \
  |   -p LimitCORE=infinity \
  |   -p LimitNOFILE=65536 \
  | 	bash $0 run
  | EOF
  | Job for cockroach.service failed because the control process exited with error code.
  | See "systemctl status cockroach.service" and "journalctl -xe" for details.

$ journalctl -xe

Jun 23 22:00:17 rafi-1624485452-01-n4cpu4-0001 bash[10414]: /home/ubuntu/cockroach-helper.sh: line 20: 10422 Killed                  ${HOME}/cockroach start --certs-dir=${HOME>
Jun 23 22:00:17 rafi-1624485452-01-n4cpu4-0001 bash[10748]: cockroach exited with code 137: Wed Jun 23 22:00:17 UTC 2021
Jun 23 22:00:17 rafi-1624485452-01-n4cpu4-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a

nvanbenschoten · 2021-06-23T22:17:06Z

I just tried with 20.1.8 now, and did get an error:

I found that this error goes away if you use a roachprod binary from the release-20.1 branch. But I then found that the --accept-sql-without-tls flag was unsupported on v20.1.

I'll try to narrow the improvement down to just this commit.

I can confirm. When running 0fcebbf, I see a single delay of about 9 to 10 seconds. When running on its parent SHA, I see a single delay that lasts somewhere between 20 and 24 seconds.

nvanbenschoten · 2021-06-23T22:40:10Z

After reverting #66550 (comment) and running with 0fcebbf, I still do see one login attempt hit the 10s timeout, as expected. However, I then see no subsequent slowdown at all on any other connection attempt.

I'll see how stable the test is when I bump server.user_login.timeout from 10s to 15s.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)

pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

consider commenting why we use these environment variables

will do

pkg/cmd/roachtest/network.go, line 243 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

ah interesting, so you've changed it to connect from another GCE instance to all the server nodes. any particular reason why? the testing done by the customer was just logging directly onto the server and connecting to localhost

Yes it's to exclude any variability due to the network distance between my computer where roachtest is running and the gce instance.

pkg/cmd/roachtest/network.go, line 267 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i had set up more aggressive rules in my draft, since i thought we'd want to drop all ICMP traffic as well. i guess it's not necessary to repro though, but maybe still useful to have the more aggressive rule? blocking all traffic (or as much as we can) matches more closely with the kind of testing done by the customer

We should focus on reducing the scope of the test as much as possible to get to what the customer observes.
Once we have that, we can fix that issue, and then step over to a wider test scope to see if other issues remain.

Also:

I see your point wrt ICMP but I'm pretty sure crdb does not use ICMP.
regardless of which rules you use we should be careful to only block the traffic between the server nodes and not the traffic from the roachtest process (otherwise the monitor will bark, and it won't be possible to properly interact with the gce instance from the gce console and roachprod)

pkg/cmd/roachtest/network.go, line 277 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

well, we'd still be able to SSH into the node without undoing these rules. or did you mean something else?

I want the ability to re-start the test using the same GCE instance, to save up on reinitialization costs.

pkg/cmd/roachtest/network.go, line 291 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

do we still need wg if we move the m.Wait() call to here?

the monitor code is generally unreliable. I wouldn't trust it to organize synchronization between phases of the test code.

pkg/cmd/roachtest/network.go, line 292 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i had left this commented as a TODO/reminder for myself. i don't really know what the expected number of failures is. should it be 0? (if so, change to errorCount > 0 and remove the commented code)

I'll reply to this in a next comment.

knz · 2021-06-24T08:46:50Z

Here are a few things I'd like to see happen:

do the additional logging inside the crdb code as described inside this comment. Also this will come useful in rpc: add metrics for client-side and server-side RPC request latencies and log events for latency anomalies #66777.
explore how much the test result changes when we remove the env vars during server initialization. Right now I'm using these env vars because they dramatically reduce the time needed to wait for the initial rebalancing and lease migration (from ~1 minute to <5 seconds).
change the structure of the async client goroutine. Right now, the code does a round-robin connection attempt to all the non-partitioned nodes. This is not what the customer is doing. Instead, I want to have 1 goroutine per server node, with each goroutine attempting to connect to the same node in a sequence, retrying with no delay if they observe a connection error. I want to see this confirm that:
1. it can establish connections without delay, OR
2. if there is a delay/error once, there is no delay afterwards.
evolve the test to run longer, and try partitioning multiple nodes over time (e.g. perhaps a round-robin partition, one node split from the network at a time), and each time verify the client connection behavior.

The last two points are important to bring this test code into production:

we'll want to have it running in nightlies
in the long term, I believe it's OK for clients to experience a connection delay once after a network partition, during the time needed by crdb to heal. The test should thus tolerate at least one failure in the short amount of time after the network partition occurs. Not more than one failure though; that's what the customer doesn't want.

knz · 2021-06-24T10:17:06Z

I'll see how stable the test is when I bump server.user_login.timeout from 10s to 15s.

@nvanbenschoten did you mean bump it down to 5s? Presumably a longer timeout will tolerate more errors, not less.

@asubiotto

66919: sql: cache hashed password and auth info during authentication r=knz,RichardJCai a=rafiss See individual commits fixes #58869 --- This improves authentication performance in multiregion clusters and improves availability when the system range leaseholder goes down. I ran the test from #66768 and observed that there was no period of authentication unavailability. Previously, each authentication attempt would require one read from each of system.users and system.role_options. In multiregion clusters, this adds quite a bit of excess latency. Also, if those tables' leaseholders were not available, then authentication to all nodes would be affected. This cache is meant to alleviate both problems. It's kept up to date by updating the table descriptors' versions whenever authentication related information is changed with CREATE/ALTER/DROP ROLE. Release note (performance improvement): The latency of authenticating a user has been improved by adding a cache for lookups of authentication related information. Release note (ops change): There is now a server.authentication_cache.enabled cluster setting. The setting defaults to true. When enabled, this cache stores authentication-related data and will improve the latency of authentication attempts. Keeping the cache up to date adds additional overhead when using the CREATE, UPDATE, and DELETE ROLE commands. To minimize the overhead, any bulk ROLE operations should be run inside of a transaction. To make the cache more effective, any regularly-scheduled ROLE updates should be done all together, rather than occurring throughout the day at all times. Release note (security update): There is now a cache for per-user authentication related information. The data in the cache is always kept up-to-date because it checks if any change to the underlying authentication tables has been made since the last time the cache was updated. The cached data includes the user's hashed password, the NOLOGIN role option, and the VALID UNTIL role option. 67135: sql/pgwire: fix statement buffer memory leak when using suspended portals r=rafiss a=joesankey The connection statement buffer grows indefinitely when the client uses the execute portal with limit feature of the Postgres protocol, eventually causing the node to crash out of memory. Any long running query that uses the limit feature will cause this memory leak such as the `EXPERIMENTAL CHANGEFEED FOR` statement. The execute portal with limit feature of the Postgres protocol is used by the JDBC Postgres driver to fetch a limited number of rows at a time. The leak is caused by commands accumulating in the buffer and never getting cleared out. The client sends 2 commands every time it wants to receive more rows: - `Execute {"Portal": "C_1", "MaxRows": 1}` - `Sync` The server processes the commands and leaves them in the buffer, every iteration causes 2 more commands to leak. A similar memory leak was fixed by #48859, however the execute with limit feature is implemented by a side state machine in limitedCommandResult. The cleanup routine added by #48859 is never executed for suspended portals as they never return to the main conn_executor loop. After this change the statement buffer gets trimmed to reclaim memory after each client command is processed in the limitedCommandResult side state machine. The StmtBuf.Ltrim function was changed to be public visibility to enable this. While this is not ideal, it does scope the fix to the limitedCommandResult side state machine and could be addressed when the limitedCommandResult functionality is refactored into the conn_executor. Added a unit test which causes the leak, used the PGWire client in the test as neither the pg or pgx clients use execute with limit, so cant be used to demonstrate the leak. Also tested the fix in a cluster by following the steps outlined in #66849. Resolves: #66849 See also: #48859 Release note (bug fix): fix statement buffer memory leak when using suspended portals. @asubiotto @andreimatei Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: joesankey <joesankey@gmail.com>

66919: sql: cache hashed password and auth info during authentication r=knz,RichardJCai a=rafiss See individual commits fixes #58869 --- This improves authentication performance in multiregion clusters and improves availability when the system range leaseholder goes down. I ran the test from #66768 and observed that there was no period of authentication unavailability. Previously, each authentication attempt would require one read from each of system.users and system.role_options. In multiregion clusters, this adds quite a bit of excess latency. Also, if those tables' leaseholders were not available, then authentication to all nodes would be affected. This cache is meant to alleviate both problems. It's kept up to date by updating the table descriptors' versions whenever authentication related information is changed with CREATE/ALTER/DROP ROLE. Release note (performance improvement): The latency of authenticating a user has been improved by adding a cache for lookups of authentication related information. Release note (ops change): There is now a server.authentication_cache.enabled cluster setting. The setting defaults to true. When enabled, this cache stores authentication-related data and will improve the latency of authentication attempts. Keeping the cache up to date adds additional overhead when using the CREATE, UPDATE, and DELETE ROLE commands. To minimize the overhead, any bulk ROLE operations should be run inside of a transaction. To make the cache more effective, any regularly-scheduled ROLE updates should be done all together, rather than occurring throughout the day at all times. Release note (security update): There is now a cache for per-user authentication related information. The data in the cache is always kept up-to-date because it checks if any change to the underlying authentication tables has been made since the last time the cache was updated. The cached data includes the user's hashed password, the NOLOGIN role option, and the VALID UNTIL role option. Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

knz · 2021-07-26T15:03:16Z

This is ready for review now. RFAL

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)

pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):

Previously, knz (kena) wrote…

will do

Removed them.

rafiss

this is wonderful. thanks for taking this over and bringing it to a finish and all the cleanup+enhancements in this test. lgtm!

i wonder two things:

now that i've merged the AuthInfoCache PR, would there be value in running this test once with the cache enabled, and once disabled? the setting is server.authentication_cache.enabled. i'd be able to add that additional test myself, so we can merge this PR now
is there still value in adding the logging you mentioned in roachtest: add test for authentication under network partition #66550 (comment) ? i'm not familiar enough with that part of the code to say. if the logging is useful, then i think it'd be better to spin that off into its own issue and keep this PR as is

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)

rafiss · 2021-08-24T18:00:29Z

@knz friendly ping -- can this be merged?

the CI failures is

pkg/cmd/roachtest/tests/network.go:32:2: a blank import should be only in a main or test package, or have a comment justifying it

rafiss · 2021-09-14T18:57:30Z

fixed the lint issue and merging after checking with knz

bors r=rafiss,knz

craig · 2021-09-14T20:18:30Z

Build failed (retrying...):

GitHub CI (Cockroach)

yuzefovich · 2021-09-14T20:29:58Z

Seems like this needs an update.

bors r-

craig · 2021-09-14T20:30:00Z

Canceled.

rafiss · 2021-09-14T22:50:13Z

i don't think it needs an update. Github CI is passing after I retriggered. I'll give it another shot

bors r=rafiss,knz

yuzefovich · 2021-09-14T22:52:40Z

I don't think so - it calls a method that doesn't exist in the interface anymore:

db, err := c.ConnSecure(ctx, 1, "root", localCertsDir, 26257)

and I couldn't immediately figure out which one to call, so I didn't update the PR myself.

bors r-

craig · 2021-09-14T22:52:42Z

Canceled.

Customers have raised major complaints about this situation. If a node disappears under a network partition (or the host is hard-killed), then it is possible that no connections can be established to any other node. Specifically, if the system.users leaseholder is the node that disappears, authentication can fail. We expect on the order of ~10 seconds of authentication unavailability, but customers have showed us an outage period of up to 40 seconds. This test tries to repro what our customers have shown us so that we can iterate on the root cause more quickly. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

rafiss · 2021-09-14T23:09:02Z

I see, thanks for the explanation. Fixed

rafiss · 2021-09-14T23:09:52Z

bors r=rafiss,knz

craig · 2021-09-15T00:42:03Z

Build succeeded:

GitHub CI (Cockroach)

knz requested review from rafiss and tbg June 23, 2021 14:19

knz mentioned this pull request Jun 23, 2021

roachtest: add test for authentication under network partition #66550

Closed

knz force-pushed the 20210622-network-test branch from 973fb0a to 2ba0fc7 Compare June 23, 2021 14:24

knz mentioned this pull request Jun 23, 2021

rpc: add metrics for client-side and server-side RPC request latencies and log events for latency anomalies #66777

Open

rafiss reviewed Jun 23, 2021

View reviewed changes

knz commented Jun 24, 2021

View reviewed changes

rafiss mentioned this pull request Jun 27, 2021

sql: cache hashed password and auth info during authentication #66919

Merged

rafiss mentioned this pull request Jul 15, 2021

Document why a dead/disconnected node means you can not create new connections on other nodes for a while cockroachdb/docs#10921

Closed

knz force-pushed the 20210622-network-test branch from 2ba0fc7 to c71d8f6 Compare July 26, 2021 15:02

knz changed the title ~~[WIP] roachtest: add test for authentication under network partition~~ roachtest: add test for authentication under network partition Jul 26, 2021

knz marked this pull request as ready for review July 26, 2021 15:03

knz commented Jul 26, 2021

View reviewed changes

rafiss approved these changes Jul 27, 2021

View reviewed changes

tbg removed their request for review August 26, 2021 08:25

rafiss force-pushed the 20210622-network-test branch from c71d8f6 to ce6a330 Compare September 14, 2021 18:56

rafiss force-pushed the 20210622-network-test branch from ce6a330 to ffc4ef9 Compare September 14, 2021 23:08

craig bot merged commit a043a34 into cockroachdb:master Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: add test for authentication under network partition #66768

roachtest: add test for authentication under network partition #66768

knz commented Jun 23, 2021 •

edited

cockroach-teamcity commented Jun 23, 2021

knz commented Jun 23, 2021

knz commented Jun 23, 2021

knz commented Jun 23, 2021

rafiss commented Jun 23, 2021

knz commented Jun 23, 2021

rafiss commented Jun 23, 2021 •

edited

rafiss left a comment

rafiss left a comment

nvanbenschoten commented Jun 23, 2021

nvanbenschoten commented Jun 23, 2021 •

edited

rafiss commented Jun 23, 2021 •

edited

nvanbenschoten commented Jun 23, 2021

nvanbenschoten commented Jun 23, 2021

knz left a comment

knz commented Jun 24, 2021

knz commented Jun 24, 2021

knz commented Jul 26, 2021

knz left a comment

rafiss left a comment

rafiss commented Aug 24, 2021

rafiss commented Sep 14, 2021

craig bot commented Sep 14, 2021

yuzefovich commented Sep 14, 2021

craig bot commented Sep 14, 2021

rafiss commented Sep 14, 2021

yuzefovich commented Sep 14, 2021

craig bot commented Sep 14, 2021

rafiss commented Sep 14, 2021

rafiss commented Sep 14, 2021

craig bot commented Sep 15, 2021

roachtest: add test for authentication under network partition #66768

roachtest: add test for authentication under network partition #66768

Conversation

knz commented Jun 23, 2021 • edited

cockroach-teamcity commented Jun 23, 2021

knz commented Jun 23, 2021

knz commented Jun 23, 2021

knz commented Jun 23, 2021

rafiss commented Jun 23, 2021

knz commented Jun 23, 2021

rafiss commented Jun 23, 2021 • edited

rafiss left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jun 23, 2021

nvanbenschoten commented Jun 23, 2021 • edited

rafiss commented Jun 23, 2021 • edited

nvanbenschoten commented Jun 23, 2021

nvanbenschoten commented Jun 23, 2021

knz left a comment

Choose a reason for hiding this comment

knz commented Jun 24, 2021

knz commented Jun 24, 2021

knz commented Jul 26, 2021

knz left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

rafiss commented Aug 24, 2021

rafiss commented Sep 14, 2021

craig bot commented Sep 14, 2021

yuzefovich commented Sep 14, 2021

craig bot commented Sep 14, 2021

rafiss commented Sep 14, 2021

yuzefovich commented Sep 14, 2021

craig bot commented Sep 14, 2021

rafiss commented Sep 14, 2021

rafiss commented Sep 14, 2021

craig bot commented Sep 15, 2021

knz commented Jun 23, 2021 •

edited

rafiss commented Jun 23, 2021 •

edited

nvanbenschoten commented Jun 23, 2021 •

edited

rafiss commented Jun 23, 2021 •

edited