-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: no inbound stream connection error on TPCC running on large cluster #55297
Comments
Did you manage to grab the logs? This error usually happens because node A is trying to connect to node B when running a distributed query, but the connection is not happening for some reason, which times out the goroutine on node B waiting for this connection. We've seen this happen a lot because of GRPC issues, so #52624 added retries to this connection attempt from node A. It'd be helpful to look at the logs to see what the state of GRPC connections were and whether connection attempts were happening. Overload can definitely play a part in this. Also what's the RTT between these nodes? |
From the error, there was no way to tell which particular node(s) timed out (aside from maybe narrowing it down to ~18 candidate nodes based on which workload generator had crashed). Since this was an 81 node cluster, debug zip collection was also wildly slow (we started the collection but after about ~10 minutes it had only collected info from ~3 of the nodes, so the whole thing would've taken 4-5 hours just to generate). We were also not running with any vmodules since this was a benchmarking run (do we need to do that to get the information you're talking about?). I looked through some of the conversation on the PR that adds retries, but interestingly, it seems like we only expect the situation solved/mitigated by that PR to occur after a node restart. Is that correct? In our case there were no node restarts. Are there ways where the same problem as #44101 could poke its head out even without a node restart? My hunch is that there is another failure mode that doesn't correspond to overload that (perhaps spuriously) causes this error. I say this because we were able to pass with 135K warehouses very comfortably, but we couldn't with 140K (or 150K) warehouses (due to these errors). The CPU headroom on the 140K run looked pretty similar to the 135K run.
They were all in the same AZ, and max RTT was less than 1ms. |
No, if there's a GRPC connection issue, that's usually printed out without a verbosity check since it can lead to these easy to observe failures. Usually if you grep for "timed out" there is some flow ID that you can associate across nodes (i.e. look for some connection error message decorated with that flow ID).
Did anything else change? It's definitely possible that this is not an overload problem, but I don't want to rule it out. It might be interesting to add a metric to see how long these inbound streams are waiting for a connection.
I don't think that this is what we're seeing if there weren't any node restarts. The retries were supposed to help in general as well, so it's interesting that the no inbound stream connection occurred despite these retries. Did you take a look at the |
👍 |
Describe the problem
When running TPCC 150K on a cluster of 81
c5d.9xlarge
CRDB nodes, along with 5 workload generator nodes of the same type (same configuration we used for the TPCC 100K run for 19.2, per our performance page), our workload generators frequently crashed with ano inbound stream connections
right around spikes in p95 latencies.To Reproduce
This is the script we used to set up the cluster:
and this is the script we use to run the workload:
Expected behavior
No
no inbound stream connection
errors when running this workload.Additional context
The cluster settings we were running with, were as follows:
We had started the CRDB processes with a
--cache
value of0.5
.The CPU utilization throughout the cluster was fairly even, albeit high:
We hit these errors on our run with 120K
active-warehouses
as well as on our run with 135Kactive-warehouses
. However, increasing thesql.distsql.max_running_flows
up from its default of500
to1000
got us to a passing run with 135K warehouses.We initially thought we were hitting the timeout that lead to this error because we were running with a low
sql.distsql.max_running_flows
which we had set to1000
in order to get a successful TPCC 135K run. So we tried bumping this limit up to1500
and then to2000
, but that didn't help. We then tried bumping thekv.dist_sender.concurrency_limit
from its default of1152
to4608
, since we noticed thatcr.node.distsender.batches.async.throttled
would skyrocket to 10s of thousands of batch requests every time there was a latency spike in the workload. This did not help either. Notably, we did not try bumping the timeout that triggers this error, which was an oversight.Interestingly we never ran into this error on our smaller TPCC runs on smaller clusters, even though they were similarly overloaded. This might indicate that cluster size has a role to play in this, perhaps.
EDIT: Looks like we thought that we had fixed/mitigated the error in #52624 (also see #53656) in 20.2.
cc @yuzefovich @nvanbenschoten @asubiotto @adityamaru
The text was updated successfully, but these errors were encountered: