Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: ensure reasonable time-to-recovery after node outage #21536

Closed
nvanbenschoten opened this issue Jan 18, 2018 · 9 comments
Closed

roachtest: ensure reasonable time-to-recovery after node outage #21536

nvanbenschoten opened this issue Jan 18, 2018 · 9 comments
Assignees
Labels
A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@nvanbenschoten
Copy link
Member

While playing with https://github.com/rystsov/perseus I observed the loss of a leaseholder to have varying effects on throughput. In some cases, a leaseholder being partitioned would cause only a few seconds of unavailability. In others, it would cause upwards of 45 seconds of unavailability. During the testing, the load generator was performing a single write at a time in a closed loop. This is important because it meant that if any writes got stuck, it was blatantly clear because throughput would drop to 0.

I did some digging through traces to try to find out why writes were getting stuck for so long. The following trace is representative of most of the long traces I saw.

root@:26258/crdb_internal> select duration, SUBSTRING(message, 0, 140) message from session_trace where txn_idx = 1;
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|      duration      |                                                                   message                                                                   |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| 42s756ms271µs564ns | === SPAN START: sql txn implicit ===                                                                                                        |
| NULL               | executing 1/1: UPDATE lily.storage SET value = '1'::TEXT WHERE key = '1'::TEXT                                                              |
| NULL               | added table 'lily.storage' to table collection                                                                                              |
| NULL               | found table in table collection for table 'lily.storage'                                                                                    |
| NULL               | query not supported for distSQL: mutations not supported                                                                                    |
| NULL               | Scan /Table/51/1/"1"{-/#}                                                                                                                   |
| NULL               | querying next range at /Table/51/1/"1"                                                                                                      |
| NULL               | r32: sending batch 1 Scan to (n3,s3):3                                                                                                      |
| NULL               | sending request to 172.18.0.2:26257                                                                                                         |
| 2s742ms797µs215ns  | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | error                                                                                                                                       |
| NULL               | RPC error: rpc error: code = Unavailable desc = transport is closing                                                                        |
| NULL               | error: {<nil> rpc error: code = Unavailable desc = transport is closing}; trying next peer (n1,s1):1                                        |
| NULL               | sending request to local server                                                                                                             |
| 302µs249ns         | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | 1 Scan                                                                                                                                      |
| NULL               | read has no clock uncertainty                                                                                                               |
| NULL               | executing 1 requests                                                                                                                        |
| NULL               | read-only path                                                                                                                              |
| NULL               | replica.Send got error: [NotLeaseHolderError] r32: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=5 start=15162507 |
| NULL               | *roachpb.NotLeaseHolderError                                                                                                                |
| NULL               | application error: [NotLeaseHolderError] r32: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=5 start=1516250793.44 |
| NULL               | error: {(err: [NotLeaseHolderError] r32: replica (n1,s1):1 not lease holder; current lease is repl=(n3,s3):3 seq=5 start=1516250793.4407557 |
| NULL               | sending request to 172.18.0.2:26257                                                                                                         |
| 40s1ms119µs130ns   | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | error                                                                                                                                       |
| NULL               | RPC error: rpc error: code = Unavailable desc = grpc: the connection is unavailable                                                         |
| NULL               | error: {<nil> rpc error: code = Unavailable desc = grpc: the connection is unavailable}; trying next peer (n2,s2):2                         |
| NULL               | sending request to 172.18.0.3:26257                                                                                                         |
| 10ms156µs219ns     | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | === SPAN START: /cockroach.roachpb.Internal/Batch ===                                                                                       |
| NULL               | 1 Scan                                                                                                                                      |
| NULL               | executing 1 requests                                                                                                                        |
| NULL               | read-only path                                                                                                                              |
| NULL               | command queue                                                                                                                               |
| NULL               | waiting for read lock                                                                                                                       |
| NULL               | read completed                                                                                                                              |
+--------------------+---------------------------------------------------------------------------------------------------------------------------------------------+

What we have happening here all takes place in the context of DistSender and grpcTransport:

  1. a first request is sent to the leaseholder
  2. the request fails with a transport is closing error after a few seconds
  3. a second request is sent to the next replica in line
  4. the request fails with a NotLeaseHolderError, pointing back to the first replica
  5. a third request is sent to the original leaseholder. This time, the transport is closed so it attempts to re-establish a connection
  6. the request can take 30 or more seconds before it eventually fails with a connection is unavailable error
  7. a fourth request is sent to the next replica in line
  8. the request succeeds

In the past we've had debates over whether we should continue to iterate through candidate replicas indefinitely (or until a context cancels) or whether we should iterate a maximum number of times. I'm going to avoid this debate for now. My question is whether we should be more aggressive in failing fast on the attempted reconnection to a node we've already observed to be down (step 5 above). This would allow us to avoid the long delay we see in step 6. I can see us being more aggressive in a few different ways:

  1. using gRPC's FailFast call option
  2. setting a deadline for the attempted connection to the expiration time of the lease returned in the NotLeaseHolderError

cc. @andreimatei @bdarnell - you two have had strong opinions on this in the past

@andreimatei
Copy link
Contributor

:) What happened to your "why would you ever give up if the client is still waiting" stance?
As you know, I think it's hard to do worse than what we do currently, so I think either of your suggestions would be good (although I'm not sure what fail-fast technically means for establishing of connections; I'd be curious what timeouts are involved).

But, of course, your attempt of side-stepping the million dollar question is shallow. You have observed that your read takes too long, and you're essentially proposing that we fail it (since you're not proposing we increase the number of replica attempts to infinity). So, then, consider doing it right and don't try any replica twice. Except in a "we're chasing the lease that's moving around" scenario, for which we can introduce detection.

@nvanbenschoten
Copy link
Member Author

What happened to your "why would you ever give up if the client is still waiting" stance?

To clarify because my original post may have been unclear and misinterpreted, I'm not advocating that we give up while the client is still waiting. I still think we should continue until the client tells us to stop trying either through their own fail-fast flag or a deadline. All that is separate from the root of this issue though.

What I'm basically saying here is that in the process of iterating through different replicas, we should make sure that we never waste too much time trying to send a request to any one replica. In the case of a partition, gRPC seems to do a good job detecting connection failures and throwing a transport is closing error quickly. This usually takes a few seconds. However, it doesn't do a good job deciding that a new connection will not succeed and quickly returning a the connection is unavailable error. I was seeing this take 30-40 seconds. It seems pretty bad that we would notice a connection fails and then still spend so much time trying to re-establish the connection when we could go back to trying other nodes.

@andreimatei
Copy link
Contributor

OK, but then you're also saying that we ought to change the way the transport currently gets exhausted after trying every replica, and make the retries infinite. And then if you do that blindly you get into a hot loop where you keep ping-ponging between two replicas. And so what we discussed that you'd really want is a new mechanism in DistSender for waiting for a lease to expire so you can start anew when someone else can grab a new lease.

@nvanbenschoten
Copy link
Member Author

And so what we discussed that you'd really want is a new mechanism in DistSender for waiting for a lease to expire so you can start anew when someone else can grab a new lease.

Yeah, that's kind of what I was getting at with the idea of:

setting a deadline for the attempted connection to the expiration time of the lease returned in the NotLeaseHolderError

@bdarnell
Copy link
Member

What build were you testing? Shouldn't #21376 take care of this?

using gRPC's FailFast call option

I thought we already did use FailFast in all gRPC calls from the DistSender (I claimed we did in grpc/grpc-go#1443). If not, we probably should.

setting a deadline for the attempted connection to the expiration time of the lease returned in the NotLeaseHolderError

Yes, we discussed making better use of this expiration information way back in #3196 (comment)

@nvanbenschoten
Copy link
Member Author

What build were you testing? Shouldn't #21376 take care of this?

1.1.3 and v2.0-alpha.20180116. #21376 was not included in either of these, so it may help the situation. I'll take a look.

I thought we already did use FailFast in all gRPC calls from the DistSender

I'm having trouble finding evidence of that:

Nathans-MBP:cockroach$ git grep 'FailFast' pkg/ | wc -l
       0

The other way I see failfast being enabled is through the waitForReady option specified https://github.com/grpc/grpc/blob/master/doc/service_config.md, but it doesn't look like we use that either.

Nathans-MBP:cockroach$ git grep 'waitForReady' pkg/ | wc -l
       0

That said, it looks like the only way errConnUnavailable can ever be returned is if we
either use a balancer or have failfast enabled. We don't use a gRPC Balancer, so I must be missing something.

@bdarnell
Copy link
Member

I thought we already did use FailFast in all gRPC calls from the DistSender

I'm having trouble finding evidence of that:

The API is weird - IIRC we never set an option named FailFast, but the fail-fast mode was implied by the absence of the WithBlock option. It's possible that when they introduced the actual FailFast option this assumption no longer holds.

@petermattis petermattis added A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jul 21, 2018
@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Jul 22, 2018
@nvanbenschoten nvanbenschoten added this to Milestone D - Planning in Core Performance Jul 22, 2018
@tbg tbg added this to Backlog in KV Jul 31, 2018
@nvanbenschoten nvanbenschoten self-assigned this Jul 31, 2018
@petermattis petermattis removed this from the 2.1 milestone Oct 5, 2018
@tbg
Copy link
Member

tbg commented Oct 11, 2018

@nvanbenschoten we use circuit breakers in the nodeDialer. This doesn't solve the problem at all, but if we told the circuit breaker about the connection closing (as a "fail" event) we could rely on the counters for the next connection attempt and use a more restrictive connection timeout.

But shouldn't we generally use a connection timeout that's ~0.5*liveness timeout?

I think this should also be a roachtest. I should finally get #23141 over the finish line and use it for this test.

@tbg tbg self-assigned this Oct 11, 2018
@tbg tbg moved this from Backlog to On the horizon in KV Oct 11, 2018
@tbg tbg moved this from On the horizon to Testing in KV Oct 11, 2018
@tbg tbg changed the title rpc: previously closed transports should fail-fast on attempted reconnection roachtest: ensure reasonable time-to-recovery after node outage Oct 11, 2018
@nvanbenschoten nvanbenschoten moved this from Milestone D - In Progress to Experiments / Proposals in Core Performance Oct 12, 2018
@nvanbenschoten nvanbenschoten moved this from Experiments to Backlog in Core Performance Oct 12, 2018
@nvanbenschoten nvanbenschoten removed this from Backlog in Core Performance Apr 16, 2019
@nvanbenschoten
Copy link
Member Author

This focus of this issue is still noble, but the issue itself is unactionable. Closing.

KV automation moved this from Ideas to improve Testing to Closed Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants