sql: no inbound stream connection error on TPCC running on large cluster #55297

aayushshah15 · 2020-10-07T17:24:08Z

Describe the problem

When running TPCC 150K on a cluster of 81 c5d.9xlarge CRDB nodes, along with 5 workload generator nodes of the same type (same configuration we used for the TPCC 100K run for 19.2, per our performance page), our workload generators frequently crashed with a no inbound stream connections right around spikes in p95 latencies.

To Reproduce

This is the script we used to set up the cluster:

set -e
set -x
export SSDTPCC150="aayushs-ssd-tpcc150";

build/builder.sh mkrelease amd64-linux-gnu bin/workload;
build/builder.sh mkrelease amd64-linux-gnu;
make bin/roachprod

roachprod create $SSDTPCC150 --local-ssd=true --aws-machine-type-ssd=c5d.9xlarge --aws-zones=us-west-2b --nodes=86 --clouds=aws --lifetime=48h;
echo "NOTE: putting workload and cockroach binaries onto their respective nodes..."
roachprod put $SSDTPCC150:82-86 bin.docker_amd64/workload workload;
roachprod put $SSDTPCC150:1-81 cockroach-linux-2.6.32-gnu-amd64 ./cockroach;

echo "NOTE: starting the cockroach nodes....."
roachprod start $SSDTPCC150:1-81 --args="--cache=0.25" --racks=27
echo "NOTE: configuring the cluster for fast import..."
roachprod run $SSDTPCC150:1  "./cockroach sql --insecure -e \"alter range default configure zone using num_replicas=1; SET CLUSTER SETTING schemachanger.backfiller.max_buffer_size = '5 GiB'; SET CLUSTER SETTING kv.snapshot_recovery.max_rate = '128 MiB'; SET CLUSTER SETTING kv.snapshot_rebalance.max_rate = '128 MiB';\" ";

echo "NOTE: importing..."
roachprod run $SSDTPCC150:82 './workload fixtures import tpcc --warehouses=150000 --partitions=27 {pgurl:1}' 2>&1 | tee tpcc150kimport-localssd.out;
echo "NOTE: upping the replication factor back to 3";
roachprod run $SSDTPCC150:1  "./cockroach sql --insecure -e \"alter range default configure zone using num_replicas=3;\"";
echo "NOTE: done"

and this is the script we use to run the workload:

echo "NOTE: starting on workload generators"

export NUMWARE=150000
export SSDTPCC150="aayushs-ssd-tpcc150"
roachprod run $SSDTPCC150:82 "./workload run tpcc --partitions 27 --warehouses 150000 --active-warehouses $NUMWARE --partition-affinity 0,5,10,15,20,25 --ramp 4m --duration 30m --histograms workload1.histogram.ndjson {pgurl:1-81}" 2>&1 | tee workload1_${NUMWARE}.out &
roachprod run $SSDTPCC150:83 "./workload run tpcc --partitions 27 --warehouses 150000 --active-warehouses $NUMWARE --partition-affinity 1,6,11,16,21,26 --ramp 4m --duration 30m --histograms workload2.histogram.ndjson {pgurl:1-81}" 2>&1 > workload2_${NUMWARE}.out &
roachprod run $SSDTPCC150:84 "./workload run tpcc --partitions 27 --warehouses 150000 --active-warehouses $NUMWARE --partition-affinity 2,7,12,17,22 --ramp 4m --duration 30m --histograms workload3.histogram.ndjson {pgurl:1-81}" 2>&1 > workload3_${NUMWARE}.out &
roachprod run $SSDTPCC150:85 "./workload run tpcc --partitions 27 --warehouses 150000 --active-warehouses $NUMWARE --partition-affinity 3,8,13,18,23 --ramp 4m --duration 30m --histograms workload4.histogram.ndjson {pgurl:1-81}" 2>&1 > workload4_${NUMWARE}.out &
roachprod run $SSDTPCC150:86 "./workload run tpcc --partitions 27 --warehouses 150000 --active-warehouses $NUMWARE --partition-affinity 4,9,14,19,24 --ramp 4m --duration 30m --histograms workload5.histogram.ndjson {pgurl:1-81}" 2>&1 > workload5_${NUMWARE}.out &

Expected behavior

No no inbound stream connection errors when running this workload.

Additional context

The cluster settings we were running with, were as follows:

kv.bulk_io_write.concurrent_addsstable_requests=81
sql.distsql.max_running_flows=2000
kv.dist_sender.concurrency_limit=4608
kv.snapshot_recovery.max_rate = '128 MiB'
kv.snapshot_rebalance.max_rate = '128 MiB'

We had started the CRDB processes with a --cache value of 0.5.

The CPU utilization throughout the cluster was fairly even, albeit high:

We hit these errors on our run with 120K active-warehouses as well as on our run with 135K active-warehouses. However, increasing the sql.distsql.max_running_flows up from its default of 500 to 1000 got us to a passing run with 135K warehouses.

We initially thought we were hitting the timeout that lead to this error because we were running with a low sql.distsql.max_running_flows which we had set to 1000 in order to get a successful TPCC 135K run. So we tried bumping this limit up to 1500 and then to 2000, but that didn't help. We then tried bumping the kv.dist_sender.concurrency_limit from its default of 1152 to 4608, since we noticed that cr.node.distsender.batches.async.throttled would skyrocket to 10s of thousands of batch requests every time there was a latency spike in the workload. This did not help either. Notably, we did not try bumping the timeout that triggers this error, which was an oversight.

Interestingly we never ran into this error on our smaller TPCC runs on smaller clusters, even though they were similarly overloaded. This might indicate that cluster size has a role to play in this, perhaps.

EDIT: Looks like we thought that we had fixed/mitigated the error in #52624 (also see #53656) in 20.2.

cc @yuzefovich @nvanbenschoten @asubiotto @adityamaru

The text was updated successfully, but these errors were encountered:

asubiotto · 2020-10-08T08:58:14Z

Did you manage to grab the logs?

This error usually happens because node A is trying to connect to node B when running a distributed query, but the connection is not happening for some reason, which times out the goroutine on node B waiting for this connection. We've seen this happen a lot because of GRPC issues, so #52624 added retries to this connection attempt from node A. It'd be helpful to look at the logs to see what the state of GRPC connections were and whether connection attempts were happening. Overload can definitely play a part in this.

Also what's the RTT between these nodes?

aayushshah15 · 2020-10-08T15:15:08Z

From the error, there was no way to tell which particular node(s) timed out (aside from maybe narrowing it down to ~18 candidate nodes based on which workload generator had crashed). Since this was an 81 node cluster, debug zip collection was also wildly slow (we started the collection but after about ~10 minutes it had only collected info from ~3 of the nodes, so the whole thing would've taken 4-5 hours just to generate). We were also not running with any vmodules since this was a benchmarking run (do we need to do that to get the information you're talking about?).

I looked through some of the conversation on the PR that adds retries, but interestingly, it seems like we only expect the situation solved/mitigated by that PR to occur after a node restart. Is that correct? In our case there were no node restarts. Are there ways where the same problem as #44101 could poke its head out even without a node restart?

My hunch is that there is another failure mode that doesn't correspond to overload that (perhaps spuriously) causes this error. I say this because we were able to pass with 135K warehouses very comfortably, but we couldn't with 140K (or 150K) warehouses (due to these errors). The CPU headroom on the 140K run looked pretty similar to the 135K run.

Also what's the RTT between these nodes?

They were all in the same AZ, and max RTT was less than 1ms.

asubiotto · 2020-10-08T16:58:25Z

We were also not running with any vmodules since this was a benchmarking run (do we need to do that to get the information you're talking about?).

No, if there's a GRPC connection issue, that's usually printed out without a verbosity check since it can lead to these easy to observe failures. Usually if you grep for "timed out" there is some flow ID that you can associate across nodes (i.e. look for some connection error message decorated with that flow ID).

My hunch is that there is another failure mode that doesn't correspond to overload that (perhaps spuriously) causes this error. I say this because we were able to pass with 135K warehouses very comfortably, but we couldn't with 140K (or 150K) warehouses (due to these errors). The CPU headroom on the 140K run looked pretty similar to the 135K run.

Did anything else change? It's definitely possible that this is not an overload problem, but I don't want to rule it out. It might be interesting to add a metric to see how long these inbound streams are waiting for a connection.

I looked through some of the conversation on the PR that adds retries, but interestingly, it seems like we only expect the situation solved/mitigated by that PR to occur after a node restart.

I don't think that this is what we're seeing if there weren't any node restarts. The retries were supposed to help in general as well, so it's interesting that the no inbound stream connection occurred despite these retries.

Did you take a look at the sql.distsql.flow.* metrics? These would allow us to rule out whether sql.distsql.max_running_flows was too low.

aayushshah15 · 2020-10-18T01:04:41Z

We attempted TPCC 150K again on the same cluster setup this last friday. This time, we did not hit these errors at all, even though at various points the cluster was fairly overloaded, and for sustained durations.

Did you take a look at the sql.distsql.flow.* metrics?

I also now think that we vastly misdiagnosed the problem to the max_running_flows setting because the number of active flows per node seemed really low (does this seem right? for 150K warehouses on an 81 node cluster?). The metric that measures the number of flows queued up also basically stayed at zero the whole time.

What we observed last friday combined with our observations from the smaller TPCC 15K runs tells me that we perhaps got unlucky with some "bad" nodes in the cluster on our first attempt at 150k, since that's the only time we ran into these errors.

I'd be okay closing this issue for now & perhaps reviving it if someone else runs into something similar again, I don't know if there's enough here to dig further into at this moment.

asubiotto · 2020-10-19T11:51:57Z

👍

aayushshah15 changed the title ~~sql: no inbound stream connection error on large cluster~~ sql: no inbound stream connection error on TPCC running on large cluster Oct 7, 2020

yuzefovich added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Oct 7, 2020

cockroachdb deleted a comment from blathers-crl bot Oct 7, 2020

yuzefovich added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Oct 7, 2020

asubiotto moved this from Triage to [GENERAL BACKLOG] Enhancements/Features/Investigations in BACKLOG, NO NEW ISSUES: SQL Execution Oct 13, 2020

asubiotto closed this as completed Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: no inbound stream connection error on TPCC running on large cluster #55297

sql: no inbound stream connection error on TPCC running on large cluster #55297

aayushshah15 commented Oct 7, 2020 •

edited

asubiotto commented Oct 8, 2020

aayushshah15 commented Oct 8, 2020

asubiotto commented Oct 8, 2020

aayushshah15 commented Oct 18, 2020

asubiotto commented Oct 19, 2020

sql: no inbound stream connection error on TPCC running on large cluster #55297

sql: no inbound stream connection error on TPCC running on large cluster #55297

Comments

aayushshah15 commented Oct 7, 2020 • edited

asubiotto commented Oct 8, 2020

aayushshah15 commented Oct 8, 2020

asubiotto commented Oct 8, 2020

aayushshah15 commented Oct 18, 2020

asubiotto commented Oct 19, 2020

aayushshah15 commented Oct 7, 2020 •

edited