New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add test for authentication under network partition #66768
Conversation
973fb0a
to
2ba0fc7
Compare
Here are the node logs from the repro (1 is the node with all the leases originally, that got shut off): Immediately after blocking traffic with
that last failure failed with the customer's reported error message
then:
and then after that all further connection attempts were resolved quickly (in less than 0.2s) As detailed in the timeline above, both n2 and n3 were encountering a >4.5s delay for up to 30 seconds. This strongly confirms the customer's observation. So the period of interest is from the iptables command at 14:16:07 to when connections became normal again at 14:16:48. Here are the "interesting" heartbeat failures from n2:
The log paste above separates the two phases of slow heartbeats, one "while the client is experiencing delays" and another "after this point the client seems fine" What's different between the two?
What are these ranges?
|
I only looked at the hearbeat failures. There are more events in the logs about the circuit breaker, updating node descriptors etc but I haven't investigated those yet. |
This is great! We should also break this down into two categories of slow heartbeats:
|
yes you're right but even the 4.5s timeouts can be a problem if they are cascading. I'm thinking that there's likely a cascading failure with first timeout on meta, then timeout on liveness, then timeout on system.users, which will force a >15s timeout on at least one client conn. |
Definitely agree that it would be ideal to improve the 4.5s behavior and fail faster if we can. Sharing one snippet from these gRPC docs that may be relevant:
So it seems like gRPC is supposed to fail more quickly. Would it be worth investigating this further? And also possibly related issues grpc/grpc-go#1443 and grpc/grpc-go#2663 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):
"--args=--accept-sql-without-tls", "--env=COCKROACH_SCAN_INTERVAL=200ms", "--env=COCKROACH_SCAN_MAX_IDLE_TIME=20ms",
consider commenting why we use these environment variables
pkg/cmd/roachtest/network.go, line 243 at r5 (raw file):
url := fmt.Sprintf("postgres://testuser:password@%s/defaultdb?sslmode=require", serverAddrs[i-1]) b, err := c.RunWithBuffer(ctx, t.l, clientNode, "time", "-p", "./cockroach", "sql",
ah interesting, so you've changed it to connect from another GCE instance to all the server nodes. any particular reason why? the testing done by the customer was just logging directly onto the server and connecting to localhost
pkg/cmd/roachtest/network.go, line 267 at r5 (raw file):
sudo iptables -P OUTPUT ACCEPT; # Drop any node-to-node crdb traffic.
i had set up more aggressive rules in my draft, since i thought we'd want to drop all ICMP traffic as well. i guess it's not necessary to repro though, but maybe still useful to have the more aggressive rule? blocking all traffic (or as much as we can) matches more closely with the kind of testing done by the customer
pkg/cmd/roachtest/network.go, line 277 at r5 (raw file):
// (attempt to) restore iptables when test end, so that cluster // can be investigated afterwards.
well, we'd still be able to SSH into the node without undoing these rules. or did you mean something else?
pkg/cmd/roachtest/network.go, line 291 at r5 (raw file):
t.l.Printf("waiting for client to do its thing...") wg.Wait()
do we still need wg
if we move the m.Wait()
call to here?
pkg/cmd/roachtest/network.go, line 292 at r5 (raw file):
wg.Wait() if errorCount >= 1 /*c.Spec().NodeCount*/ {
i had left this commented as a TODO/reminder for myself. i don't really know what the expected number of failures is. should it be 0? (if so, change to errorCount > 0
and remove the commented code)
pkg/cmd/roachtest/network.go, line 450 at r5 (raw file):
}) r.Add(testSpec{ Name: fmt.Sprintf("network/authentication/nodes=%d", numNodes),
it wasn't entirely clear to me why these tests use 4 nodes and not 3. i suppose it's fine though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/cmd/roachtest/network.go, line 450 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
it wasn't entirely clear to me why these tests use 4 nodes and not 3. i suppose it's fine though.
retracting this -- i see it's because one of the nodes is only used for trying to connect to the others.
As a data point, I've been running this with the settings from #66550 (comment) to see the full extent of one of these failover delays. I have yet to see any login attempt take longer than 22 seconds on v20.2. Unfortunately, I am running into trouble in a few areas with trying to get this running with v20.1. @rafiss, have you had any luck with that? |
I upgraded from |
For this issue (https://github.com/cockroachlabs/support/issues/1022) the customer reported the behavior on 20.2.10 as well as 20.1.8; so I've been focusing on 20.2.10 only so far. I just tried with 20.1.8 now, and did get an error:
|
I found that this error goes away if you use a roachprod binary from the release-20.1 branch. But I then found that the
I can confirm. When running 0fcebbf, I see a single delay of about 9 to 10 seconds. When running on its parent SHA, I see a single delay that lasts somewhere between 20 and 24 seconds. |
After reverting #66550 (comment) and running with 0fcebbf, I still do see one login attempt hit the 10s timeout, as expected. However, I then see no subsequent slowdown at all on any other connection attempt. I'll see how stable the test is when I bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)
pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
consider commenting why we use these environment variables
will do
pkg/cmd/roachtest/network.go, line 243 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
ah interesting, so you've changed it to connect from another GCE instance to all the server nodes. any particular reason why? the testing done by the customer was just logging directly onto the server and connecting to localhost
Yes it's to exclude any variability due to the network distance between my computer where roachtest
is running and the gce instance.
pkg/cmd/roachtest/network.go, line 267 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i had set up more aggressive rules in my draft, since i thought we'd want to drop all ICMP traffic as well. i guess it's not necessary to repro though, but maybe still useful to have the more aggressive rule? blocking all traffic (or as much as we can) matches more closely with the kind of testing done by the customer
We should focus on reducing the scope of the test as much as possible to get to what the customer observes.
Once we have that, we can fix that issue, and then step over to a wider test scope to see if other issues remain.
Also:
- I see your point wrt ICMP but I'm pretty sure crdb does not use ICMP.
- regardless of which rules you use we should be careful to only block the traffic between the server nodes and not the traffic from the roachtest process (otherwise the monitor will bark, and it won't be possible to properly interact with the gce instance from the gce console and roachprod)
pkg/cmd/roachtest/network.go, line 277 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
well, we'd still be able to SSH into the node without undoing these rules. or did you mean something else?
I want the ability to re-start the test using the same GCE instance, to save up on reinitialization costs.
pkg/cmd/roachtest/network.go, line 291 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
do we still need
wg
if we move them.Wait()
call to here?
the monitor code is generally unreliable. I wouldn't trust it to organize synchronization between phases of the test code.
pkg/cmd/roachtest/network.go, line 292 at r5 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i had left this commented as a TODO/reminder for myself. i don't really know what the expected number of failures is. should it be 0? (if so, change to
errorCount > 0
and remove the commented code)
I'll reply to this in a next comment.
Here are a few things I'd like to see happen:
The last two points are important to bring this test code into production:
|
@nvanbenschoten did you mean bump it down to 5s? Presumably a longer timeout will tolerate more errors, not less. |
66919: sql: cache hashed password and auth info during authentication r=knz,RichardJCai a=rafiss See individual commits fixes #58869 --- This improves authentication performance in multiregion clusters and improves availability when the system range leaseholder goes down. I ran the test from #66768 and observed that there was no period of authentication unavailability. Previously, each authentication attempt would require one read from each of system.users and system.role_options. In multiregion clusters, this adds quite a bit of excess latency. Also, if those tables' leaseholders were not available, then authentication to all nodes would be affected. This cache is meant to alleviate both problems. It's kept up to date by updating the table descriptors' versions whenever authentication related information is changed with CREATE/ALTER/DROP ROLE. Release note (performance improvement): The latency of authenticating a user has been improved by adding a cache for lookups of authentication related information. Release note (ops change): There is now a server.authentication_cache.enabled cluster setting. The setting defaults to true. When enabled, this cache stores authentication-related data and will improve the latency of authentication attempts. Keeping the cache up to date adds additional overhead when using the CREATE, UPDATE, and DELETE ROLE commands. To minimize the overhead, any bulk ROLE operations should be run inside of a transaction. To make the cache more effective, any regularly-scheduled ROLE updates should be done all together, rather than occurring throughout the day at all times. Release note (security update): There is now a cache for per-user authentication related information. The data in the cache is always kept up-to-date because it checks if any change to the underlying authentication tables has been made since the last time the cache was updated. The cached data includes the user's hashed password, the NOLOGIN role option, and the VALID UNTIL role option. 67135: sql/pgwire: fix statement buffer memory leak when using suspended portals r=rafiss a=joesankey The connection statement buffer grows indefinitely when the client uses the execute portal with limit feature of the Postgres protocol, eventually causing the node to crash out of memory. Any long running query that uses the limit feature will cause this memory leak such as the `EXPERIMENTAL CHANGEFEED FOR` statement. The execute portal with limit feature of the Postgres protocol is used by the JDBC Postgres driver to fetch a limited number of rows at a time. The leak is caused by commands accumulating in the buffer and never getting cleared out. The client sends 2 commands every time it wants to receive more rows: - `Execute {"Portal": "C_1", "MaxRows": 1}` - `Sync` The server processes the commands and leaves them in the buffer, every iteration causes 2 more commands to leak. A similar memory leak was fixed by #48859, however the execute with limit feature is implemented by a side state machine in limitedCommandResult. The cleanup routine added by #48859 is never executed for suspended portals as they never return to the main conn_executor loop. After this change the statement buffer gets trimmed to reclaim memory after each client command is processed in the limitedCommandResult side state machine. The StmtBuf.Ltrim function was changed to be public visibility to enable this. While this is not ideal, it does scope the fix to the limitedCommandResult side state machine and could be addressed when the limitedCommandResult functionality is refactored into the conn_executor. Added a unit test which causes the leak, used the PGWire client in the test as neither the pg or pgx clients use execute with limit, so cant be used to demonstrate the leak. Also tested the fix in a cluster by following the steps outlined in #66849. Resolves: #66849 See also: #48859 Release note (bug fix): fix statement buffer memory leak when using suspended portals. @asubiotto @andreimatei Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: joesankey <joesankey@gmail.com>
66919: sql: cache hashed password and auth info during authentication r=knz,RichardJCai a=rafiss See individual commits fixes #58869 --- This improves authentication performance in multiregion clusters and improves availability when the system range leaseholder goes down. I ran the test from #66768 and observed that there was no period of authentication unavailability. Previously, each authentication attempt would require one read from each of system.users and system.role_options. In multiregion clusters, this adds quite a bit of excess latency. Also, if those tables' leaseholders were not available, then authentication to all nodes would be affected. This cache is meant to alleviate both problems. It's kept up to date by updating the table descriptors' versions whenever authentication related information is changed with CREATE/ALTER/DROP ROLE. Release note (performance improvement): The latency of authenticating a user has been improved by adding a cache for lookups of authentication related information. Release note (ops change): There is now a server.authentication_cache.enabled cluster setting. The setting defaults to true. When enabled, this cache stores authentication-related data and will improve the latency of authentication attempts. Keeping the cache up to date adds additional overhead when using the CREATE, UPDATE, and DELETE ROLE commands. To minimize the overhead, any bulk ROLE operations should be run inside of a transaction. To make the cache more effective, any regularly-scheduled ROLE updates should be done all together, rather than occurring throughout the day at all times. Release note (security update): There is now a cache for per-user authentication related information. The data in the cache is always kept up-to-date because it checks if any change to the underlying authentication tables has been made since the last time the cache was updated. The cached data includes the user's hashed password, the NOLOGIN role option, and the VALID UNTIL role option. Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
2ba0fc7
to
c71d8f6
Compare
This is ready for review now. RFAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)
pkg/cmd/roachtest/network.go, line 128 at r5 (raw file):
Previously, knz (kena) wrote…
will do
Removed them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wonderful. thanks for taking this over and bringing it to a finish and all the cleanup+enhancements in this test. lgtm!
i wonder two things:
- now that i've merged the
AuthInfoCache
PR, would there be value in running this test once with the cache enabled, and once disabled? the setting isserver.authentication_cache.enabled
. i'd be able to add that additional test myself, so we can merge this PR now - is there still value in adding the logging you mentioned in roachtest: add test for authentication under network partition #66550 (comment) ? i'm not familiar enough with that part of the code to say. if the logging is useful, then i think it'd be better to spin that off into its own issue and keep this PR as is
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @tbg)
@knz friendly ping -- can this be merged? the CI failures is
|
c71d8f6
to
ce6a330
Compare
fixed the lint issue and merging after checking with knz bors r=rafiss,knz |
Build failed (retrying...): |
Seems like this needs an update. bors r- |
Canceled. |
i don't think it needs an update. Github CI is passing after I retriggered. I'll give it another shot bors r=rafiss,knz |
I don't think so - it calls a method that doesn't exist in the interface anymore:
and I couldn't immediately figure out which one to call, so I didn't update the PR myself. bors r- |
Canceled. |
Customers have raised major complaints about this situation. If a node disappears under a network partition (or the host is hard-killed), then it is possible that no connections can be established to any other node. Specifically, if the system.users leaseholder is the node that disappears, authentication can fail. We expect on the order of ~10 seconds of authentication unavailability, but customers have showed us an outage period of up to 40 seconds. This test tries to repro what our customers have shown us so that we can iterate on the root cause more quickly. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
ce6a330
to
ffc4ef9
Compare
I see, thanks for the explanation. Fixed |
bors r=rafiss,knz |
Build succeeded: |
Supersedes #66550