roachtest: ensure reasonable time-to-recovery after node outage #21536

nvanbenschoten · 2018-01-18T06:04:59Z

While playing with https://github.com/rystsov/perseus I observed the loss of a leaseholder to have varying effects on throughput. In some cases, a leaseholder being partitioned would cause only a few seconds of unavailability. In others, it would cause upwards of 45 seconds of unavailability. During the testing, the load generator was performing a single write at a time in a closed loop. This is important because it meant that if any writes got stuck, it was blatantly clear because throughput would drop to 0.

I did some digging through traces to try to find out why writes were getting stuck for so long. The following trace is representative of most of the long traces I saw.

root@:26258/crdb_internal> select duration, SUBSTRING(message, 0, 140) message from session_trace where txn_idx = 1;
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|      duration      |                                                                   message                                                                   |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| 42s756ms271µs564ns | === SPAN START: sql txn implicit ===                                                                                                        |
| NULL               | executing 1/1: UPDATE lily.storage SET value = '1'::TEXT WHERE key = '1'::TEXT                                                              |
| NULL               | added table 'lily.storage' to table collection                                                                                              |
| NULL               | found table in table collection for table 'lily.storage'                                                                                    |
| NULL               | query not supported for distSQL: mutations not supported                                                                                    |
| NULL               | Scan /Table/51/1/"1"{-/#}                                                                                                                   |
| NULL               | querying next range at /Table/51/1/"1"                                                                                                      |
| NULL               | r32: sending batch 1 Scan to (n3,s3):3                                                                                                      |
| NULL               | sending request to 172.18.0.2:26257                                                                                                         |
| 2s742ms797µs215ns  | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | error                                                                                                                                       |
| NULL               | RPC error: rpc error: code = Unavailable desc = transport is closing                                                                        |
| NULL               | error: {<nil> rpc error: code = Unavailable desc = transport is closing}; trying next peer (n1,s1):1                                        |
| NULL               | sending request to local server                                                                                                             |
| 302µs249ns         | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | 1 Scan                                                                                                                                      |
| NULL               | read has no clock uncertainty                                                                                                               |
| NULL               | executing 1 requests                                                                                                                        |
| NULL               | read-only path                                                                                                                              |
| NULL               | replica.Send got error: [NotLeaseHolderError] r32: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=5 start=15162507 |
| NULL               | *roachpb.NotLeaseHolderError                                                                                                                |
| NULL               | application error: [NotLeaseHolderError] r32: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=5 start=1516250793.44 |
| NULL               | error: {(err: [NotLeaseHolderError] r32: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=5 start=1516250793.4407557 |
| NULL               | sending request to 172.18.0.2:26257                                                                                                         |
| 40s1ms119µs130ns   | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | error                                                                                                                                       |
| NULL               | RPC error: rpc error: code = Unavailable desc = grpc: the connection is unavailable                                                         |
| NULL               | error: {<nil> rpc error: code = Unavailable desc = grpc: the connection is unavailable}; trying next peer (n2,s2):2                         |
| NULL               | sending request to 172.18.0.3:26257                                                                                                         |
| 10ms156µs219ns     | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | 1 Scan                                                                                                                                      |
| NULL               | executing 1 requests                                                                                                                        |
| NULL               | read-only path                                                                                                                              |
| NULL               | command queue                                                                                                                               |
| NULL               | waiting for read lock                                                                                                                       |
| NULL               | read completed                                                                                                                              |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------+

What we have happening here all takes place in the context of DistSender and grpcTransport:

a first request is sent to the leaseholder
the request fails with a transport is closing error after a few seconds
a second request is sent to the next replica in line
the request fails with a NotLeaseHolderError, pointing back to the first replica
a third request is sent to the original leaseholder. This time, the transport is closed so it attempts to re-establish a connection
the request can take 30 or more seconds before it eventually fails with a connection is unavailable error
a fourth request is sent to the next replica in line
the request succeeds

In the past we've had debates over whether we should continue to iterate through candidate replicas indefinitely (or until a context cancels) or whether we should iterate a maximum number of times. I'm going to avoid this debate for now. My question is whether we should be more aggressive in failing fast on the attempted reconnection to a node we've already observed to be down (step 5 above). This would allow us to avoid the long delay we see in step 6. I can see us being more aggressive in a few different ways:

using gRPC's FailFast call option
setting a deadline for the attempted connection to the expiration time of the lease returned in the NotLeaseHolderError

cc. @andreimatei @bdarnell - you two have had strong opinions on this in the past

The text was updated successfully, but these errors were encountered:

andreimatei · 2018-01-18T16:24:27Z

:) What happened to your "why would you ever give up if the client is still waiting" stance?
As you know, I think it's hard to do worse than what we do currently, so I think either of your suggestions would be good (although I'm not sure what fail-fast technically means for establishing of connections; I'd be curious what timeouts are involved).

But, of course, your attempt of side-stepping the million dollar question is shallow. You have observed that your read takes too long, and you're essentially proposing that we fail it (since you're not proposing we increase the number of replica attempts to infinity). So, then, consider doing it right and don't try any replica twice. Except in a "we're chasing the lease that's moving around" scenario, for which we can introduce detection.

nvanbenschoten · 2018-01-18T22:30:54Z

What happened to your "why would you ever give up if the client is still waiting" stance?

To clarify because my original post may have been unclear and misinterpreted, I'm not advocating that we give up while the client is still waiting. I still think we should continue until the client tells us to stop trying either through their own fail-fast flag or a deadline. All that is separate from the root of this issue though.

What I'm basically saying here is that in the process of iterating through different replicas, we should make sure that we never waste too much time trying to send a request to any one replica. In the case of a partition, gRPC seems to do a good job detecting connection failures and throwing a transport is closing error quickly. This usually takes a few seconds. However, it doesn't do a good job deciding that a new connection will not succeed and quickly returning a the connection is unavailable error. I was seeing this take 30-40 seconds. It seems pretty bad that we would notice a connection fails and then still spend so much time trying to re-establish the connection when we could go back to trying other nodes.

andreimatei · 2018-01-18T22:58:09Z

OK, but then you're also saying that we ought to change the way the transport currently gets exhausted after trying every replica, and make the retries infinite. And then if you do that blindly you get into a hot loop where you keep ping-ponging between two replicas. And so what we discussed that you'd really want is a new mechanism in DistSender for waiting for a lease to expire so you can start anew when someone else can grab a new lease.

nvanbenschoten · 2018-01-18T23:24:37Z

And so what we discussed that you'd really want is a new mechanism in DistSender for waiting for a lease to expire so you can start anew when someone else can grab a new lease.

Yeah, that's kind of what I was getting at with the idea of:

setting a deadline for the attempted connection to the expiration time of the lease returned in the NotLeaseHolderError

bdarnell · 2018-01-19T20:55:04Z

What build were you testing? Shouldn't #21376 take care of this?

using gRPC's FailFast call option

I thought we already did use FailFast in all gRPC calls from the DistSender (I claimed we did in grpc/grpc-go#1443). If not, we probably should.

setting a deadline for the attempted connection to the expiration time of the lease returned in the NotLeaseHolderError

Yes, we discussed making better use of this expiration information way back in #3196 (comment)

nvanbenschoten · 2018-01-19T21:34:58Z

What build were you testing? Shouldn't #21376 take care of this?

1.1.3 and v2.0-alpha.20180116. #21376 was not included in either of these, so it may help the situation. I'll take a look.

I thought we already did use FailFast in all gRPC calls from the DistSender

I'm having trouble finding evidence of that:

Nathans-MBP:cockroach$ git grep 'FailFast' pkg/ | wc -l
       0

The other way I see failfast being enabled is through the waitForReady option specified https://github.com/grpc/grpc/blob/master/doc/service_config.md, but it doesn't look like we use that either.

Nathans-MBP:cockroach$ git grep 'waitForReady' pkg/ | wc -l
       0

That said, it looks like the only way errConnUnavailable can ever be returned is if we
either use a balancer or have failfast enabled. We don't use a gRPC Balancer, so I must be missing something.

bdarnell · 2018-01-19T23:08:12Z

I thought we already did use FailFast in all gRPC calls from the DistSender

I'm having trouble finding evidence of that:

The API is weird - IIRC we never set an option named FailFast, but the fail-fast mode was implied by the absence of the WithBlock option. It's possible that when they introduced the actual FailFast option this assumption no longer holds.

tbg · 2018-10-11T10:19:03Z

@nvanbenschoten we use circuit breakers in the nodeDialer. This doesn't solve the problem at all, but if we told the circuit breaker about the connection closing (as a "fail" event) we could rely on the counters for the next connection attempt and use a more restrictive connection timeout.

But shouldn't we generally use a connection timeout that's ~0.5*liveness timeout?

I think this should also be a roachtest. I should finally get #23141 over the finish line and use it for this test.

nvanbenschoten · 2020-08-12T18:38:02Z

This focus of this issue is still noble, but the issue itself is unactionable. Closing.

petermattis added A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jul 21, 2018

nvanbenschoten added this to the 2.1 milestone Jul 22, 2018

nvanbenschoten added this to Milestone D - Planning in Core Performance Jul 22, 2018

tbg added this to Backlog in KV Jul 31, 2018

nvanbenschoten self-assigned this Jul 31, 2018

nvanbenschoten added the A-coreperf label Oct 2, 2018

petermattis removed this from the 2.1 milestone Oct 5, 2018

tbg self-assigned this Oct 11, 2018

tbg moved this from Backlog to On the horizon in KV Oct 11, 2018

tbg moved this from On the horizon to Testing in KV Oct 11, 2018

tbg changed the title ~~rpc: previously closed transports should fail-fast on attempted reconnection~~ roachtest: ensure reasonable time-to-recovery after node outage Oct 11, 2018

nvanbenschoten moved this from Milestone D - In Progress to Experiments / Proposals in Core Performance Oct 12, 2018

nvanbenschoten moved this from Experiments to Backlog in Core Performance Oct 12, 2018

nvanbenschoten removed the A-coreperf label Oct 16, 2018

nvanbenschoten removed this from Backlog in Core Performance Apr 16, 2019

nvanbenschoten closed this as completed Aug 12, 2020

KV automation moved this from Ideas to improve Testing to Closed Aug 12, 2020

rafiss mentioned this issue Jun 15, 2021

rpc: strenghthen the behavior of CockroachDB in "fail slow" TCP connect timeouts #53410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: ensure reasonable time-to-recovery after node outage #21536

roachtest: ensure reasonable time-to-recovery after node outage #21536

nvanbenschoten commented Jan 18, 2018

andreimatei commented Jan 18, 2018

nvanbenschoten commented Jan 18, 2018

andreimatei commented Jan 18, 2018

nvanbenschoten commented Jan 18, 2018

bdarnell commented Jan 19, 2018

nvanbenschoten commented Jan 19, 2018

bdarnell commented Jan 19, 2018

tbg commented Oct 11, 2018

nvanbenschoten commented Aug 12, 2020

roachtest: ensure reasonable time-to-recovery after node outage #21536

roachtest: ensure reasonable time-to-recovery after node outage #21536

Comments

nvanbenschoten commented Jan 18, 2018

andreimatei commented Jan 18, 2018

nvanbenschoten commented Jan 18, 2018

andreimatei commented Jan 18, 2018

nvanbenschoten commented Jan 18, 2018

bdarnell commented Jan 19, 2018

nvanbenschoten commented Jan 19, 2018

bdarnell commented Jan 19, 2018

tbg commented Oct 11, 2018

nvanbenschoten commented Aug 12, 2020