-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: clients without retries backoffs can cause metastable failure #123304
Comments
Previously if tolerate errors was set, the workload would log and drop errors. Now it will retry the errors. This more closely simulates how clients would use our system. Informs: cockroachdb#123304 Epic: none Release note: None
Running on 24.1-beta3 - the system experiences similar behavior. However it is harder to tip it into the unstable regime. Letting the system fill for ~10 minutes first will do it. |
nit 'kv: clients without retries backoffs can cause metastable failure' |
On v24.1-beta3 with Setting |
Previously if tolerate errors was set, the workload would log and drop errors. Now it will retry the errors. This more closely simulates how clients would use our system. Informs: cockroachdb#123304 Epic: none Release note: None
Previously if tolerate errors was set, the workload would log and drop errors. Now it will retry the errors. This more closely simulates how clients would use our system. Informs: cockroachdb#123304 Epic: none Release note: None
Previously if tolerate errors was set, the workload would log and drop errors. Now it will retry the errors. This more closely simulates how clients would use our system. Informs: cockroachdb#123304 Epic: none Release note: None
Previously if tolerate errors was set, the workload would log and drop errors. Now it will retry the errors. This more closely simulates how clients would use our system. Informs: cockroachdb#123304 Epic: none Release note: None
Previously if tolerate errors was set, the workload would log and drop errors. Now it will retry the errors. This more closely simulates how clients would use our system. Informs: cockroachdb#123304 Epic: none Release note: None
I am inclined to close this as a combination of (1) a configuration problem, (2) known issues tracked elsewhere. |
I don't think we should close this issue until the default "out of the box" configuration doesn't enter the unstable mode. I also agree that we should not say this is addressed by disabling AC. I think we should consider making We don't need to schedule this for an upcoming release, but we have seen this exact behavior at a customer and will likely see future customers who submit similar issues in the future. Having this open allows us to attach other customer cases to it and decide if fixing this is something we want to do. It is also worth automating this test as a failing roachtest to ensure that if we do come up with a solution we can correctly address it in the future. |
Describe the problem
In situations where clients set low SQL timeouts and retry without backoff, we can enter a state of metastable failure where the only way out is to completely stop the workload and then gradually restart it.
To Reproduce
Use a modified version of the workload tool which will retry errors when
--tolerate-errors
is set rather than just ignoring them. Note the different binary that is put on node 13 which has this behavior.Create a 13 node cluster (12 nodes plus workload)
Set up the SQL user and permissions correctly
2x - run this command - note that the cluster runs at ~15% CPU usage (ideally we could run this once, but sometimes it fails to start).
Let it run for ~1 minutes to generate some data.
Add a write heavy workload to a different DB for a few seconds to create LSM inversion.
Notice that the system enters a failure state where the CPU is pegged and it it only processing a fraction of the number of QPS it was before.
Stop the workload jobs, wait 10 seconds and restart it. Notice that now the cluster is stable again and handling the workload without issue.
Expected behavior
Ideally there would be no errors that occur during this test. Given that older versions of the software hit errors due to overload during index creation, the errors are not surprising, but the non-recovery of the system is.
Additional data / screenshots
Timeline
Environment:
CRDB 23.1.17, see commands above for exact configuration.
Additional context
We have seen customers with similar configurations and setups that have hit this issue.
Jira issue: CRDB-38280
The text was updated successfully, but these errors were encountered: