New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: ensure reasonable time-to-recovery after node outage #21536
Comments
:) What happened to your "why would you ever give up if the client is still waiting" stance? But, of course, your attempt of side-stepping the million dollar question is shallow. You have observed that your read takes too long, and you're essentially proposing that we fail it (since you're not proposing we increase the number of replica attempts to infinity). So, then, consider doing it right and don't try any replica twice. Except in a "we're chasing the lease that's moving around" scenario, for which we can introduce detection. |
To clarify because my original post may have been unclear and misinterpreted, I'm not advocating that we give up while the client is still waiting. I still think we should continue until the client tells us to stop trying either through their own fail-fast flag or a deadline. All that is separate from the root of this issue though. What I'm basically saying here is that in the process of iterating through different replicas, we should make sure that we never waste too much time trying to send a request to any one replica. In the case of a partition, gRPC seems to do a good job detecting connection failures and throwing a |
OK, but then you're also saying that we ought to change the way the transport currently gets exhausted after trying every replica, and make the retries infinite. And then if you do that blindly you get into a hot loop where you keep ping-ponging between two replicas. And so what we discussed that you'd really want is a new mechanism in DistSender for waiting for a lease to expire so you can start anew when someone else can grab a new lease. |
Yeah, that's kind of what I was getting at with the idea of:
|
What build were you testing? Shouldn't #21376 take care of this?
I thought we already did use FailFast in all gRPC calls from the DistSender (I claimed we did in grpc/grpc-go#1443). If not, we probably should.
Yes, we discussed making better use of this expiration information way back in #3196 (comment) |
I'm having trouble finding evidence of that:
The other way I see failfast being enabled is through the
That said, it looks like the only way |
The API is weird - IIRC we never set an option named FailFast, but the fail-fast mode was implied by the absence of the WithBlock option. It's possible that when they introduced the actual FailFast option this assumption no longer holds. |
@nvanbenschoten we use circuit breakers in the nodeDialer. This doesn't solve the problem at all, but if we told the circuit breaker about the connection closing (as a "fail" event) we could rely on the counters for the next connection attempt and use a more restrictive connection timeout. But shouldn't we generally use a connection timeout that's ~0.5*liveness timeout? I think this should also be a roachtest. I should finally get #23141 over the finish line and use it for this test. |
This focus of this issue is still noble, but the issue itself is unactionable. Closing. |
While playing with https://github.com/rystsov/perseus I observed the loss of a leaseholder to have varying effects on throughput. In some cases, a leaseholder being partitioned would cause only a few seconds of unavailability. In others, it would cause upwards of 45 seconds of unavailability. During the testing, the load generator was performing a single write at a time in a closed loop. This is important because it meant that if any writes got stuck, it was blatantly clear because throughput would drop to 0.
I did some digging through traces to try to find out why writes were getting stuck for so long. The following trace is representative of most of the long traces I saw.
What we have happening here all takes place in the context of
DistSender
andgrpcTransport
:transport is closing
error after a few secondsNotLeaseHolderError
, pointing back to the first replicaconnection is unavailable
errorIn the past we've had debates over whether we should continue to iterate through candidate replicas indefinitely (or until a context cancels) or whether we should iterate a maximum number of times. I'm going to avoid this debate for now. My question is whether we should be more aggressive in failing fast on the attempted reconnection to a node we've already observed to be down (step 5 above). This would allow us to avoid the long delay we see in step 6. I can see us being more aggressive in a few different ways:
FailFast
call optionNotLeaseHolderError
cc. @andreimatei @bdarnell - you two have had strong opinions on this in the past
The text was updated successfully, but these errors were encountered: