kvserver: ambiguous result under high latencies when running TPCC #54899
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
S-3-ux-surprise
Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
T-kv
KV Team
Projects
Describe the problem
When running TPCC 15K on a 20.2 cluster with 15 c5.4xlarge nodes (and in general, when running large TPCC workloads on clusters that are perhaps too weak to handle the load), the
workload
generator receives aresult is ambiguous (intent missing and record aborted)
error afterp95
latencies degrade into the dozens of seconds.To Reproduce
Reproduction steps are identical to the ones in cockroachdb/pebble#934, except that running with
15000
active warehouses will reliably reproduce the error.Additional context
Related to #53156. After this PR, the only case where we should be getting these errors is when the transaction record of the errant transaction has been garbage collected. By default, this should only happen after 10 minutes. However, this seems not to be the case (see attached workload logs).
pebble15.txt
Jira issue: CRDB-3705
The text was updated successfully, but these errors were encountered: