-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: avoid need for manual tuning of rebalance rate setting #14768
Comments
Perhaps something like weighted fair queueing. Here's a go implementation. |
Would an ideal solution have the limiting sit on the incoming bandwidth and not the outgoing? I'm thinking of the case in which node 1 and 2 both send to node 3, but node 1 streams a snapshot while node 2 has foreground traffic. It would really have to be node 3 which backs off the other two; node 1 and 2 each would try to go at full speed without a chance of determining their relative priorities. Naively, I'd think that WFQ for reading from the connection should work since there's flow control and the sender has to wait once they've filled up their window. I've searched around for precedent which I'm sure must exist somewhere, but haven't really been successful. |
I think we'd want foreground/background traffic prioritization on both the sender and receiver. The receiver might not have any foreground traffic, but the sender might. And vice versa. |
@m-schneider Can you take this on during 1.2? The issue is a little vague now, and this will require an RFC. So, it'd be helpful if you worked with @tschottdorf on making this specific enough to be actionable as a first step. |
I'd estimate that the coding portion of this is a small fraction of the work. The bigger task is to test the prioritization mechanism under a variety of rebalancing and recovery scenarios on real clusters. |
|
@cuongdo Sure sounds like a great project for 1.2! |
Toby and I discussed this and looked into what we can do with gRPC. Prioritizing traffic on a sender is fairly straight forward, we can use gRPC interceptors and the WFQ implementation that @petermattis linked. However on a receiver there doesn't seem to be any straight forward way to do this. Before a recent optimization in gRPC(https://grpc.io/2017/08/22/grpc-go-perf-improvements.html) we could have blocked connections by stalling reads from on connections that we wished to deprioritize via interceptors. By the time the interceptor is invoked, the message has already been consumed and its connection window quota released, so there is an off-by-one that makes this an off-by-infinity for non-streaming RPCs (which we think use a new stream each time). If we fork gRPC we can modify the connection and interceptor code to give us everything we need to block on a connection to throttle on the receiver side. We're following issue #17370 because it touches many of the same pathways. |
Do we need to prioritize traffic at both the sender and recipient? I was imagining that we'd only prioritize traffic on the sender side, though I haven't thought this through in depth. |
Change of heart from #14768 (comment)? :-) The basic problem is that if we only throttle on the sender and a node sends snapshots at full volume (because nothing else is going on), it doesn't matter what's going on on the recipients -- the foreground traffic will be impacted. There's also a question of scope here: is this specifically to put something into place that allows snapshots to go fast when nothing else is going on, or is the next thing we're going to want a prioritization of Raft traffic vs foreground traffic also? If there's a relatively clean workable solution for snapshots only that's not as invasive, that might be something to consider, but to "really" solve the problem it seems that we'd be adding some hooks to grpc's internals where we need them, and live with a fork. |
cc @rytaft, who may have some wisdom to share |
Heh, I knew I had thought about this before.
The initial scope was all about snapshots: allowing snapshots to be sent as fast as possible as long as there is nothing else going on. Prioritizing Raft traffic vs foreground traffic seems trickier as sometimes that Raft traffic is necessary to service the foreground traffic. |
@a-robinson Happy to help, but I think I need a bit more context. Is the main issue that multiple nodes are sending snapshots to the same recipient simultaneously? If so, would it be a problem to have them coordinate? Also, is the bottleneck happening at the RPC layer, network, CPU utilization or something else? I could also talk about this offline with someone if that would be easier... |
@rytaft the TL;DR is that we currently have two restrictions in place:
The main goal here is to relax 2) by allowing snapshots to be transferred faster, but without impacting foreground traffic (as in, use the bandwidth if nobody else is using it, but don't try to compete, at least not too hard). |
Makes sense. I was talking about this with Cuong at lunch, and the main question I have is: are you sure it's the network bandwidth that is the bottleneck, or could it be the processing of RPC calls? In the latter case, you could just create a different channel/port for snapshot traffic.... |
We're somewhat certain that it's the network bandwidth, but @m-schneider is running more experiments now as the original issue #10972 didn't conclusively prove that. By different channel, do you mean using a different port and then letting the OS throttle that against the rest? That's probably unlikely to happen, for two reasons: a) we only have one IANA assigned port (26257) and b) we don't want to burden the operator with setting up said limits. |
The other thing I'm noticing is that it looks like we're using the same connection for snapshots and for "general" traffic: cockroach/pkg/kv/kvserver/raft_transport.go Lines 656 to 664 in 8c5253b
I wonder if changing this alone can produce any benefits. |
+cc @shralex. |
x-ref #63728. |
There are likely a few things we should do here:
|
#14718 limited the bandwidth for preemptive snapshots (i.e. rebalancing) to 2 MB/sec. This is a blunt instrument. @bdarnell says:
Jira issue: CRDB-6099
The text was updated successfully, but these errors were encountered: