Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: clients without retries backoffs can cause metastable failure #123304

Open
andrewbaptist opened this issue Apr 30, 2024 · 5 comments
Open

kv: clients without retries backoffs can cause metastable failure #123304

andrewbaptist opened this issue Apr 30, 2024 · 5 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-admission-control Admission Control

Comments

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Apr 30, 2024

Describe the problem

In situations where clients set low SQL timeouts and retry without backoff, we can enter a state of metastable failure where the only way out is to completely stop the workload and then gradually restart it.

To Reproduce
Use a modified version of the workload tool which will retry errors when --tolerate-errors is set rather than just ignoring them. Note the different binary that is put on node 13 which has this behavior.

Create a 13 node cluster (12 nodes plus workload)

roachprod create -n 13 --gce-machine-type n2-standard-16 $CLUSTER
roachprod stage $CLUSTER:1-12 release v23.1.17
roachprod put $CLUSTER:13 artifacts/cockroach
roachprod start $CLUSTER:1-12
roachprod ssh $CLUSTER:1 "./cockroach workload init kv $(roachprod pgurl $CLUSTER:1) --splits 1000"

Set up the SQL user and permissions correctly

USE kv;
CREATE USER testuser;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA PUBLIC TO testuser;
ALTER USER testuser SET statement_timeout='250ms';
ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 600;

2x - run this command - note that the cluster runs at ~15% CPU usage (ideally we could run this once, but sometimes it fails to start).

roachprod ssh $CLUSTER:13 "./cockroach-short workload run kv $(roachprod pgurl $CLUSTER:1-6 | sed 's/root/testuser/g') --concurrency=50000 --max-rate=40000 --retry-errors=0ns --ramp=10s"

Let it run for ~1 minutes to generate some data.
Add a write heavy workload to a different DB for a few seconds to create LSM inversion.

roachprod ssh $CLUSTER:13 "./cockroach workload run kv $(roachprod pgurl $CLUSTER:1-12) --tolerate-errors --concurrency=1000 --max-block-bytes=1000000 --db=kv2 --drop --init --splits=100 --max-ops=5000"

Notice that the system enters a failure state where the CPU is pegged and it it only processing a fraction of the number of QPS it was before.

Stop the workload jobs, wait 10 seconds and restart it. Notice that now the cluster is stable again and handling the workload without issue.

Expected behavior
Ideally there would be no errors that occur during this test. Given that older versions of the software hit errors due to overload during index creation, the errors are not surprising, but the non-recovery of the system is.

Additional data / screenshots
Timeline

  • 17:29 - Cluster started with the 10K ops
  • 17:30:20 - Create index - takes ~1.5s.
  • 17:32 - stop workload, wait 10s and restart
image

Environment:
CRDB 23.1.17, see commands above for exact configuration.

Additional context
We have seen customers with similar configurations and setups that have hit this issue.

Jira issue: CRDB-38280

@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 30, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Apr 30, 2024
Previously if tolerate errors was set, the workload would log and drop
errors. Now it will retry the errors. This more closely simulates how
clients would use our system.

Informs: cockroachdb#123304

Epic: none

Release note: None
@andrewbaptist
Copy link
Collaborator Author

Running on 24.1-beta3 - the system experiences similar behavior. However it is harder to tip it into the unstable regime. Letting the system fill for ~10 minutes first will do it.

@lyang24
Copy link
Collaborator

lyang24 commented Apr 30, 2024

nit 'kv: clients without retries backoffs can cause metastable failure'

@andrewbaptist andrewbaptist changed the title kv: clients without retries can cause metastable failures kv: clients without retries backoffs can cause metastable failure Apr 30, 2024
@andrewbaptist
Copy link
Collaborator Author

On v24.1-beta3 with admission.kv.enabled=false or admission.kv.bulk_only.enabled=true the workload will not be unstable regime.

Setting server.max_open_transactions_per_gateway = 100 also prevents it from becoming unstable.

@andrewbaptist andrewbaptist added the T-admission-control Admission Control label Apr 30, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Apr 30, 2024
Previously if tolerate errors was set, the workload would log and drop
errors. Now it will retry the errors. This more closely simulates how
clients would use our system.

Informs: cockroachdb#123304

Epic: none

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 1, 2024
Previously if tolerate errors was set, the workload would log and drop
errors. Now it will retry the errors. This more closely simulates how
clients would use our system.

Informs: cockroachdb#123304

Epic: none

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 2, 2024
Previously if tolerate errors was set, the workload would log and drop
errors. Now it will retry the errors. This more closely simulates how
clients would use our system.

Informs: cockroachdb#123304

Epic: none

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 2, 2024
Previously if tolerate errors was set, the workload would log and drop
errors. Now it will retry the errors. This more closely simulates how
clients would use our system.

Informs: cockroachdb#123304

Epic: none

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 2, 2024
Previously if tolerate errors was set, the workload would log and drop
errors. Now it will retry the errors. This more closely simulates how
clients would use our system.

Informs: cockroachdb#123304

Epic: none

Release note: None
@sumeerbhola
Copy link
Collaborator

  • Disabling parts of AC: Setting admission.kv.enabled=false or admission.kv.bulk_only.enabled=true is not a desirable tradeoff since it will avoid necessary AC queueing but further invert the store. With admission,replication: replication admission control for regular work #123509, there will be fairness across leaseholder and follower writes, and we should embrace that.

  • Throughput decrease: We have a known problem that without server.max_open_transactions_per_gateway = 100, the throughput drops to about 50% -- this is documented in admission: reduce over-admission into the goroutine scheduler for latency isolation #91536 (comment). We need some improvements in txn hearbeating to improve this 50% number. I think that should be tracked elsewhere.

  • Latency increase and massive throughput decrease: if the number of open txns grows without bound, which is what this scenario with timeouts and immediate retry causes, a system with FIFO queueing will eventually have latency close to the timeout (or with no timeout, unbounded latency, since eventually very old txns will get their turn to execute), and effective throughput close to 0 (since every txn will start executing close to its timeout and reach the timeout before it completes). The default queueing order in AC for txns from the same tenant and same qos is the txn start time, so FIFO. We have implemented epoch-LIFO, but I don't think it is "ready" as a solution. The practical systems I have seen do FIFO scheduling and manage to get good throughput with a combination of (a) limiting admission into the system (rate limit or concurrency limit, like server.max_open_transactions_per_gateway) -- the queuing delay outside the system does not count towards the system's latency, (b) having large enough timeout that admitted work can complete before the timeout. For example, if desired latency is < 100ms, the timeout may be 10s, and (a) set to a high enough value that admitted txns complete within 2s (assuming we can live with 2s under periods of high load until provisioning is changed). We have knobs to control both (a) and (b), so unclear what more to do here.

I am inclined to close this as a combination of (1) a configuration problem, (2) known issues tracked elsewhere.

@sumeerbhola sumeerbhola closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024
@andrewbaptist
Copy link
Collaborator Author

I don't think we should close this issue until the default "out of the box" configuration doesn't enter the unstable mode. I also agree that we should not say this is addressed by disabling AC. I think we should consider making server.max_open_transactions_per_gateway a default configured setting, but ideally this could be tied into AC in the future as well. I also agree this is the approach most other production systems take.

We don't need to schedule this for an upcoming release, but we have seen this exact behavior at a customer and will likely see future customers who submit similar issues in the future. Having this open allows us to attach other customer cases to it and decide if fixing this is something we want to do.

It is also worth automating this test as a failing roachtest to ensure that if we do come up with a solution we can correctly address it in the future.

@andrewbaptist andrewbaptist reopened this May 8, 2024
@andrewbaptist andrewbaptist added the O-support Originated from a customer label May 8, 2024
@sumeerbhola sumeerbhola added the P-3 Issues/test failures with no fix SLA label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Originated from a customer P-3 Issues/test failures with no fix SLA T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

3 participants