Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add test for authentication under network partition #66768

Merged
merged 1 commit into from Sep 15, 2021

Conversation

knz
Copy link
Contributor

@knz knz commented Jun 23, 2021

Supersedes #66550

@knz knz requested review from rafiss and tbg June 23, 2021 14:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jun 23, 2021

@rafiss with your test and my fixes I am able to readily reproduce the customer's issue. I'm now looking into the log files to see if there's anything interesting.

cc @tbg

@knz
Copy link
Contributor Author

knz commented Jun 23, 2021

Here are the node logs from the repro (1 is the node with all the leases originally, that got shut off):

logs.tar.gz

Immediately after blocking traffic with iptables at 14:16:07, the test saw:

  • 14:16:09: 9.58s connection delay from n3 ending at 14:16:20
    (too long, but not enough to trigger the error)
  • 14:16:20: 0.14s connection delay from n2 ending at 14:16:22
  • 14:16:22: 6.51s connection delay from n3 ending at 14:16:31
    (again too long, but not enough to trigger the error)
  • 14:16:31: 10.07s connection delay from n2 ending at 14:16:43

that last failure failed with the customer's reported error message

ERROR: internal error while retrieving user account

then:

  • 14:16:43: 3.31s connection delay from n3 ending at 14:16:48

and then after that all further connection attempts were resolved quickly (in less than 0.2s)

As detailed in the timeline above, both n2 and n3 were encountering a >4.5s delay for up to 30 seconds. This strongly confirms the customer's observation.

So the period of interest is from the iptables command at 14:16:07 to when connections became normal again at 14:16:48.

Here are the "interesting" heartbeat failures from n2:

cockroach.log:W210623 14:15:53.240839 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.606109623s; err=<nil>
cockroach.log:W210623 14:16:16.635103 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500253127s; err=context deadline exceeded
cockroach.log:W210623 14:16:20.604874 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.9696735s; err=<nil>
cockroach.log:W210623 14:16:25.635146 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500288899s; err=context deadline exceeded
cockroach.log:W210623 14:16:30.135478 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500233573s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:16:30.189462 33594 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r8/3:‹/Table/1{2-3}›] slow heartbeat took 4.549790544s; err=<nil>
cockroach.log:W210623 14:16:30.189555 33604 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 4.51348012s; err=<nil>
cockroach.log:W210623 14:16:39.135067 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500236273s; err=context deadline exceeded
cockroach.log:W210623 14:16:43.635375 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500234435s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:16:48.135586 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500118935s; err=context deadline exceeded
cockroach.log:W210623 14:16:48.561551 45860 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r5/3:‹/{Systemtse-Table/System…}›] slow heartbeat took 9.415041492s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:16:48.562160 46402 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 8.849017769s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:16:48.562830 50141 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r3/2:‹/System/{NodeLive…-tsd}›] slow heartbeat took 4.653593527s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:16:48.563274 50205 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r4/2:‹/System{/tsd-tse}›] slow heartbeat took 4.514427495s; err=heartbeat failed on epoch increment

-- after this point, the client doesn't experience login delay any more --

cockroach.log:W210623 14:16:57.135077 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500238113s; err=context deadline exceeded
cockroach.log:W210623 14:17:01.635347 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500200418s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:17:05.578698 63614 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 7.007826896s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:05.579189 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.943750846s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:05.579565 68377 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r3/2:‹/System/{NodeLive…-tsd}›] slow heartbeat took 1.671525872s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:05.579989 68424 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r4/2:‹/System{/tsd-tse}›] slow heartbeat took 1.528752237s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:15.135057 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500228331s; err=context deadline exceeded
cockroach.log:W210623 14:17:19.635231 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 3.802752131s; err=context deadline exceeded
cockroach.log:W210623 14:17:24.135498 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500160088s; err=context deadline exceeded
cockroach.log:W210623 14:17:25.034981 79866 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 9.206951399s; err=heartbeat failed on epoch increment
cockroach.log:W210623 14:17:33.135094 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500235161s; err=context deadline exceeded
cockroach.log:W210623 14:17:35.237710 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 2.102542378s; err=<nil>
cockroach.log:W210623 14:17:35.237950 96823 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r4/2:‹/System{/tsd-tse}›] slow heartbeat took 1.11815483s; err=<nil>
cockroach.log:W210623 14:17:35.238176 96869 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r3/2:‹/System/{NodeLive…-tsd}›] slow heartbeat took 1.118141729s; err=<nil>
cockroach.log:W210623 14:17:46.635080 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.50018725s; err=context deadline exceeded
cockroach.log:W210623 14:17:55.635182 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500354615s; err=context deadline exceeded
cockroach.log:W210623 14:18:00.135594 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500244238s; err=‹aborted in DistSender: context deadline exceeded›
cockroach.log:W210623 14:18:04.635823 182 kv/kvserver/node_liveness.go:748 ⋮ [n2,liveness-hb] slow heartbeat took 4.500120754s; err=context deadline exceeded
cockroach.log:W210623 14:18:04.980352 116520 kv/kvserver/node_liveness.go:748 ⋮ [n2,s2,r6/3:‹/Table/{SystemCon…-11}›] slow heartbeat took 9.314270655s; err=<nil>

The log paste above separates the two phases of slow heartbeats, one "while the client is experiencing delays" and another "after this point the client seems fine"

What's different between the two?

  • both phases see heartbeat failures on r3, r4, r6
  • only the first phase (the one with problem) sees hearbeat failures on r5 and r8
  • later the second phase (the one without a problem) also sees failures on r12, r20, r33

What are these ranges?

Range Keys Note
r3 /System/NodeLivenessMax - /System/tsd empty range?
r4 /System/tsd - /System/"tse" timeseries
r5 /System/"tse" - /Table/SystemConfigSpan/Start empty range?
r6 /Table/SystemConfigSpan/Start - /Table/11 system config
r8 /Table/12 - /Table/13 system.eventlog
r12 /Table/16 - /Table/17 ??
r20 /Table/20 - /Table/21 system.table_statistics
r33 /Table/37 - /Table/38 system.scheduled_jobs

@knz
Copy link
Contributor Author

knz commented Jun 23, 2021

I only looked at the hearbeat failures. There are more events in the logs about the circuit breaker, updating node descriptors etc but I haven't investigated those yet.

@rafiss
Copy link
Collaborator

rafiss commented Jun 23, 2021

This is great!

We should also break this down into two categories of slow heartbeats:

  • The ones that take 4.5 seconds make sense to me, since the heartbeat is run with a context timeout of 4.5 seconds.
    if err := contextutil.RunWithTimeout(ctx, "node liveness heartbeat", nl.livenessThreshold-nl.heartbeatInterval,
  • There are also heartbeats that take 8-9 seconds, and those are much more confusing and concerning to me. It indicates we aren't respecting the context timeout somewhere.

@knz
Copy link
Contributor Author

knz commented Jun 23, 2021

yes you're right

but even the 4.5s timeouts can be a problem if they are cascading. I'm thinking that there's likely a cascading failure with first timeout on meta, then timeout on liveness, then timeout on system.users, which will force a >15s timeout on at least one client conn.

@rafiss
Copy link
Collaborator

rafiss commented Jun 23, 2021

Definitely agree that it would be ideal to improve the 4.5s behavior and fail faster if we can.

Sharing one snippet from these gRPC docs that may be relevant:

WaitForReady configures the action to take when an RPC is attempted on broken connections or unreachable servers. If waitForReady is false, the RPC will fail immediately. Otherwise, the RPC client will block the call until a connection is available (or the call is canceled or times out) and will retry the call if it fails due to a transient error. gRPC will not retry if data was written to the wire unless the server indicates it did not process the data. Please refer to https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md.

By default, RPCs don't "wait for ready".

So it seems like gRPC is supposed to fail more quickly. Would it be worth investigating this further?

And also possibly related issues grpc/grpc-go#1443 and grpc/grpc-go#2663

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):

		"--args=--accept-sql-without-tls",
		"--env=COCKROACH_SCAN_INTERVAL=200ms",
		"--env=COCKROACH_SCAN_MAX_IDLE_TIME=20ms",

consider commenting why we use these environment variables


pkg/cmd/roachtest/network.go, line 243 at r5 (raw file):

				url := fmt.Sprintf("postgres://testuser:password@%s/defaultdb?sslmode=require", serverAddrs[i-1])
				b, err := c.RunWithBuffer(ctx, t.l, clientNode, "time", "-p", "./cockroach", "sql",

ah interesting, so you've changed it to connect from another GCE instance to all the server nodes. any particular reason why? the testing done by the customer was just logging directly onto the server and connecting to localhost


pkg/cmd/roachtest/network.go, line 267 at r5 (raw file):

sudo iptables -P OUTPUT ACCEPT;

# Drop any node-to-node crdb traffic.

i had set up more aggressive rules in my draft, since i thought we'd want to drop all ICMP traffic as well. i guess it's not necessary to repro though, but maybe still useful to have the more aggressive rule? blocking all traffic (or as much as we can) matches more closely with the kind of testing done by the customer


pkg/cmd/roachtest/network.go, line 277 at r5 (raw file):

	// (attempt to) restore iptables when test end, so that cluster
	// can be investigated afterwards.

well, we'd still be able to SSH into the node without undoing these rules. or did you mean something else?


pkg/cmd/roachtest/network.go, line 291 at r5 (raw file):

	t.l.Printf("waiting for client to do its thing...")

	wg.Wait()

do we still need wg if we move the m.Wait() call to here?


pkg/cmd/roachtest/network.go, line 292 at r5 (raw file):

	wg.Wait()
	if errorCount >= 1 /*c.Spec().NodeCount*/ {

i had left this commented as a TODO/reminder for myself. i don't really know what the expected number of failures is. should it be 0? (if so, change to errorCount > 0 and remove the commented code)


pkg/cmd/roachtest/network.go, line 450 at r5 (raw file):

	})
	r.Add(testSpec{
		Name:    fmt.Sprintf("network/authentication/nodes=%d", numNodes),

it wasn't entirely clear to me why these tests use 4 nodes and not 3. i suppose it's fine though.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cmd/roachtest/network.go, line 450 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

it wasn't entirely clear to me why these tests use 4 nodes and not 3. i suppose it's fine though.

retracting this -- i see it's because one of the nodes is only used for trying to connect to the others.

@nvanbenschoten
Copy link
Member

As a data point, I've been running this with the settings from #66550 (comment) to see the full extent of one of these failover delays. I have yet to see any login attempt take longer than 22 seconds on v20.2.

Unfortunately, I am running into trouble in a few areas with trying to get this running with v20.1. @rafiss, have you had any luck with that?

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 23, 2021

I upgraded from v20.2.10 to release-20.2's HEAD and saw this delay drop down to about 9 seconds on each attempt. I believe (but have not confirmed) that this is because of @lunevalex's #66552. While that change only shaves off 4.5 during the recovery of a single range, it has the potential to compound across the multiple dependent ranges at play here. I'll try to narrow the improvement down to just this commit.

@rafiss
Copy link
Collaborator

rafiss commented Jun 23, 2021

For this issue (https://github.com/cockroachlabs/support/issues/1022) the customer reported the behavior on 20.2.10 as well as 20.1.8; so I've been focusing on 20.2.10 only so far.

I just tried with 20.1.8 now, and did get an error:

22:00:18 cluster.go:388: > /Users/rafiss/go/src/github.com/cockroachdb/cockroach/bin/roachprod start --secure --args=--locality=node=1 --args=--accept-sql-without-tls --env=COCKROACH_SCAN_INTERVAL=200ms --env=COCKROACH_SCAN_MAX_IDLE_TIME=20ms rafi-1624485452-01-n4cpu4:1
rafi-1624485452-01-n4cpu4: checking certs
rafi-1624485452-01-n4cpu4: starting nodes
0: ~ #!/bin/bash: exit status 1
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*crdbInstallHelper).startNode
  | 	/Users/rafiss/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:373
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func1
  | 	/Users/rafiss/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:160
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).ParallelE.func1.1
  | 	/Users/rafiss/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1690
  | runtime.goexit
  | 	/usr/local/opt/go@1.15/libexec/src/runtime/asm_amd64.s:1374

  | sudo systemd-run --unit cockroach \
  |   --same-dir --uid $(id -u) --gid $(id -g) \
  |   --service-type=notify -p NotifyAccess=all \
  |   -p MemoryMax=95% \
  |   -p LimitCORE=infinity \
  |   -p LimitNOFILE=65536 \
  | 	bash $0 run
  | EOF
  | Job for cockroach.service failed because the control process exited with error code.
  | See "systemctl status cockroach.service" and "journalctl -xe" for details.
$ journalctl -xe

Jun 23 22:00:17 rafi-1624485452-01-n4cpu4-0001 bash[10414]: /home/ubuntu/cockroach-helper.sh: line 20: 10422 Killed                  ${HOME}/cockroach start --certs-dir=${HOME>
Jun 23 22:00:17 rafi-1624485452-01-n4cpu4-0001 bash[10748]: cockroach exited with code 137: Wed Jun 23 22:00:17 UTC 2021
Jun 23 22:00:17 rafi-1624485452-01-n4cpu4-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=137/n/a

@nvanbenschoten
Copy link
Member

I just tried with 20.1.8 now, and did get an error:

I found that this error goes away if you use a roachprod binary from the release-20.1 branch. But I then found that the --accept-sql-without-tls flag was unsupported on v20.1.

I'll try to narrow the improvement down to just this commit.

I can confirm. When running 0fcebbf, I see a single delay of about 9 to 10 seconds. When running on its parent SHA, I see a single delay that lasts somewhere between 20 and 24 seconds.

@nvanbenschoten
Copy link
Member

After reverting #66550 (comment) and running with 0fcebbf, I still do see one login attempt hit the 10s timeout, as expected. However, I then see no subsequent slowdown at all on any other connection attempt.

I'll see how stable the test is when I bump server.user_login.timeout from 10s to 15s.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)


pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

consider commenting why we use these environment variables

will do


pkg/cmd/roachtest/network.go, line 243 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

ah interesting, so you've changed it to connect from another GCE instance to all the server nodes. any particular reason why? the testing done by the customer was just logging directly onto the server and connecting to localhost

Yes it's to exclude any variability due to the network distance between my computer where roachtest is running and the gce instance.


pkg/cmd/roachtest/network.go, line 267 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i had set up more aggressive rules in my draft, since i thought we'd want to drop all ICMP traffic as well. i guess it's not necessary to repro though, but maybe still useful to have the more aggressive rule? blocking all traffic (or as much as we can) matches more closely with the kind of testing done by the customer

We should focus on reducing the scope of the test as much as possible to get to what the customer observes.
Once we have that, we can fix that issue, and then step over to a wider test scope to see if other issues remain.

Also:

  • I see your point wrt ICMP but I'm pretty sure crdb does not use ICMP.
  • regardless of which rules you use we should be careful to only block the traffic between the server nodes and not the traffic from the roachtest process (otherwise the monitor will bark, and it won't be possible to properly interact with the gce instance from the gce console and roachprod)

pkg/cmd/roachtest/network.go, line 277 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

well, we'd still be able to SSH into the node without undoing these rules. or did you mean something else?

I want the ability to re-start the test using the same GCE instance, to save up on reinitialization costs.


pkg/cmd/roachtest/network.go, line 291 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

do we still need wg if we move the m.Wait() call to here?

the monitor code is generally unreliable. I wouldn't trust it to organize synchronization between phases of the test code.


pkg/cmd/roachtest/network.go, line 292 at r5 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i had left this commented as a TODO/reminder for myself. i don't really know what the expected number of failures is. should it be 0? (if so, change to errorCount > 0 and remove the commented code)

I'll reply to this in a next comment.

@knz
Copy link
Contributor Author

knz commented Jun 24, 2021

Here are a few things I'd like to see happen:

  • do the additional logging inside the crdb code as described inside this comment. Also this will come useful in rpc: add metrics for client-side and server-side RPC request latencies and log events for latency anomalies #66777.

  • explore how much the test result changes when we remove the env vars during server initialization. Right now I'm using these env vars because they dramatically reduce the time needed to wait for the initial rebalancing and lease migration (from ~1 minute to <5 seconds).

  • change the structure of the async client goroutine. Right now, the code does a round-robin connection attempt to all the non-partitioned nodes. This is not what the customer is doing. Instead, I want to have 1 goroutine per server node, with each goroutine attempting to connect to the same node in a sequence, retrying with no delay if they observe a connection error. I want to see this confirm that:

    1. it can establish connections without delay, OR
    2. if there is a delay/error once, there is no delay afterwards.
  • evolve the test to run longer, and try partitioning multiple nodes over time (e.g. perhaps a round-robin partition, one node split from the network at a time), and each time verify the client connection behavior.

The last two points are important to bring this test code into production:

  • we'll want to have it running in nightlies
  • in the long term, I believe it's OK for clients to experience a connection delay once after a network partition, during the time needed by crdb to heal. The test should thus tolerate at least one failure in the short amount of time after the network partition occurs. Not more than one failure though; that's what the customer doesn't want.

@knz
Copy link
Contributor Author

knz commented Jun 24, 2021

I'll see how stable the test is when I bump server.user_login.timeout from 10s to 15s.

@nvanbenschoten did you mean bump it down to 5s? Presumably a longer timeout will tolerate more errors, not less.

craig bot pushed a commit that referenced this pull request Jul 7, 2021
66919: sql: cache hashed password and auth info during authentication r=knz,RichardJCai a=rafiss

See individual commits
fixes #58869

---

This improves authentication performance in multiregion clusters and
improves availability when the system range leaseholder goes down.

I ran the test from #66768
and observed that there was no period of authentication unavailability.

Previously, each authentication attempt would require one read from each
of system.users and system.role_options. In multiregion clusters, this
adds quite a bit of excess latency. Also, if those tables' leaseholders
were not available, then authentication to all nodes would be affected.

This cache is meant to alleviate both problems. It's kept up to date by
updating the table descriptors' versions whenever authentication related
information is changed with CREATE/ALTER/DROP ROLE.

Release note (performance improvement): The latency of authenticating a
user has been improved by adding a cache for lookups of authentication
related information.

Release note (ops change): There is now a
server.authentication_cache.enabled cluster setting. The setting
defaults to true. When enabled, this cache stores authentication-related
data and will improve the latency of authentication attempts. Keeping
the cache up to date adds additional overhead when using the CREATE,
UPDATE, and DELETE ROLE commands. To minimize the overhead, any bulk
ROLE operations should be run inside of a transaction. To make the cache
more effective, any regularly-scheduled ROLE updates should be done all
together, rather than occurring throughout the day at all times.

Release note (security update): There is now a cache for per-user
authentication related information. The data in the cache is always kept
up-to-date because it checks if any change to the underlying
authentication tables has been made since the last time the cache was
updated. The cached data includes the user's hashed password, the
NOLOGIN role option, and the VALID UNTIL role option.

67135: sql/pgwire: fix statement buffer memory leak when using suspended portals r=rafiss a=joesankey

The connection statement buffer grows indefinitely when the client uses the
execute portal with limit feature of the Postgres protocol, eventually causing
the node to crash out of memory. Any long running query that uses the limit
feature will cause this memory leak such as the `EXPERIMENTAL CHANGEFEED FOR`
statement. The execute portal with limit feature of the Postgres protocol is
used by the JDBC Postgres driver to fetch a limited number of rows at a time.

The leak is caused by commands accumulating in the buffer and never getting
cleared out. The client sends 2 commands every time it wants to receive more
rows:

- `Execute {"Portal": "C_1", "MaxRows": 1}`
- `Sync`

The server processes the commands and leaves them in the buffer, every
iteration causes 2 more commands to leak.

A similar memory leak was fixed by #48859, however the execute with limit
feature is implemented by a side state machine in limitedCommandResult. The
cleanup routine added by #48859 is never executed for suspended portals as
they never return to the main conn_executor loop.

After this change the statement buffer gets trimmed to reclaim memory after
each client command is processed in the limitedCommandResult side state
machine. The StmtBuf.Ltrim function was changed to be public visibility to
enable this. While this is not ideal, it does scope the fix to the
limitedCommandResult side state machine and could be addressed when the
limitedCommandResult functionality is refactored into the conn_executor.

Added a unit test which causes the leak, used the PGWire client in the test as
neither the pg or pgx clients use execute with limit, so cant be used to
demonstrate the leak. Also tested the fix in a cluster by following the steps
outlined in #66849.

Resolves: #66849

See also: #48859

Release note (bug fix): fix statement buffer memory leak when using suspended
portals.

@asubiotto @andreimatei 


Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: joesankey <joesankey@gmail.com>
craig bot pushed a commit that referenced this pull request Jul 7, 2021
66919: sql: cache hashed password and auth info during authentication r=knz,RichardJCai a=rafiss

See individual commits
fixes #58869

---

This improves authentication performance in multiregion clusters and
improves availability when the system range leaseholder goes down.

I ran the test from #66768
and observed that there was no period of authentication unavailability.

Previously, each authentication attempt would require one read from each
of system.users and system.role_options. In multiregion clusters, this
adds quite a bit of excess latency. Also, if those tables' leaseholders
were not available, then authentication to all nodes would be affected.

This cache is meant to alleviate both problems. It's kept up to date by
updating the table descriptors' versions whenever authentication related
information is changed with CREATE/ALTER/DROP ROLE.

Release note (performance improvement): The latency of authenticating a
user has been improved by adding a cache for lookups of authentication
related information.

Release note (ops change): There is now a
server.authentication_cache.enabled cluster setting. The setting
defaults to true. When enabled, this cache stores authentication-related
data and will improve the latency of authentication attempts. Keeping
the cache up to date adds additional overhead when using the CREATE,
UPDATE, and DELETE ROLE commands. To minimize the overhead, any bulk
ROLE operations should be run inside of a transaction. To make the cache
more effective, any regularly-scheduled ROLE updates should be done all
together, rather than occurring throughout the day at all times.

Release note (security update): There is now a cache for per-user
authentication related information. The data in the cache is always kept
up-to-date because it checks if any change to the underlying
authentication tables has been made since the last time the cache was
updated. The cached data includes the user's hashed password, the
NOLOGIN role option, and the VALID UNTIL role option.

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@knz knz changed the title [WIP] roachtest: add test for authentication under network partition roachtest: add test for authentication under network partition Jul 26, 2021
@knz knz marked this pull request as ready for review July 26, 2021 15:03
@knz
Copy link
Contributor Author

knz commented Jul 26, 2021

This is ready for review now. RFAL

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)


pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):

Previously, knz (kena) wrote…

will do

Removed them.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wonderful. thanks for taking this over and bringing it to a finish and all the cleanup+enhancements in this test. lgtm!

i wonder two things:

  • now that i've merged the AuthInfoCache PR, would there be value in running this test once with the cache enabled, and once disabled? the setting is server.authentication_cache.enabled. i'd be able to add that additional test myself, so we can merge this PR now
  • is there still value in adding the logging you mentioned in roachtest: add test for authentication under network partition #66550 (comment) ? i'm not familiar enough with that part of the code to say. if the logging is useful, then i think it'd be better to spin that off into its own issue and keep this PR as is

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)

@rafiss
Copy link
Collaborator

rafiss commented Aug 24, 2021

@knz friendly ping -- can this be merged?

the CI failures is

pkg/cmd/roachtest/tests/network.go:32:2: a blank import should be only in a main or test package, or have a comment justifying it

@tbg tbg removed their request for review August 26, 2021 08:25
@rafiss
Copy link
Collaborator

rafiss commented Sep 14, 2021

fixed the lint issue and merging after checking with knz

bors r=rafiss,knz

@craig
Copy link
Contributor

craig bot commented Sep 14, 2021

Build failed (retrying...):

@yuzefovich
Copy link
Member

Seems like this needs an update.

bors r-

@craig
Copy link
Contributor

craig bot commented Sep 14, 2021

Canceled.

@rafiss
Copy link
Collaborator

rafiss commented Sep 14, 2021

i don't think it needs an update. Github CI is passing after I retriggered. I'll give it another shot

bors r=rafiss,knz

@yuzefovich
Copy link
Member

I don't think so - it calls a method that doesn't exist in the interface anymore:

db, err := c.ConnSecure(ctx, 1, "root", localCertsDir, 26257)

and I couldn't immediately figure out which one to call, so I didn't update the PR myself.

bors r-

@craig
Copy link
Contributor

craig bot commented Sep 14, 2021

Canceled.

Customers have raised major complaints about this situation. If a node
disappears under a network partition (or the host is hard-killed), then
it is possible that no connections can be established to any other node.
Specifically, if the system.users leaseholder is the node that
disappears, authentication can fail. We expect on the order of ~10
seconds of authentication unavailability, but customers have showed us
an outage period of up to 40 seconds.

This test tries to repro what our customers have shown us so that we can
iterate on the root cause more quickly.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@rafiss
Copy link
Collaborator

rafiss commented Sep 14, 2021

I see, thanks for the explanation. Fixed

@rafiss
Copy link
Collaborator

rafiss commented Sep 14, 2021

bors r=rafiss,knz

@craig
Copy link
Contributor

craig bot commented Sep 15, 2021

Build succeeded:

@craig craig bot merged commit a043a34 into cockroachdb:master Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants