kvserver: ambiguous result under high latencies when running TPCC #54899

aayushshah15 · 2020-09-29T01:37:13Z

Describe the problem

When running TPCC 15K on a 20.2 cluster with 15 c5.4xlarge nodes (and in general, when running large TPCC workloads on clusters that are perhaps too weak to handle the load), the workload generator receives a result is ambiguous (intent missing and record aborted) error after p95 latencies degrade into the dozens of seconds.

To Reproduce

Reproduction steps are identical to the ones in cockroachdb/pebble#934, except that running with 15000 active warehouses will reliably reproduce the error.

roachprod run $CLUSTER:16 './workload run tpcc --warehouses=15000 --partitions=5 --ramp=1m --duration=50m {pgurl:1-15}'

Additional context

Related to #53156. After this PR, the only case where we should be getting these errors is when the transaction record of the errant transaction has been garbage collected. By default, this should only happen after 10 minutes. However, this seems not to be the case (see attached workload logs).
pebble15.txt

Jira issue: CRDB-3705

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-09-29T01:37:16Z

Hi @aayushshah15, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

nvanbenschoten · 2020-10-06T15:41:59Z

This is unexpected. As you mentioned, #53156 was supposed to place some harder guarantees around this ambiguous error case. I bet we could figure out what's going wrong with a few well-placed log statements around EndTxn, QueryTxn, PushTxn, and RecoverTxn requests.

aayushshah15 · 2020-10-06T15:57:05Z

My repro steps mention a large cluster since that's what I was playing with at the time, but I think we should be able to repro this on a TPCC 3K run on 3 of those c5d.4xlarge nodes, for example.

blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 29, 2020

aayushshah15 closed this as completed Sep 29, 2020

aayushshah15 changed the title ~~Ambiguous result when running TOCC~~ Ambiguous result under high latencies when running TPCC Sep 29, 2020

aayushshah15 reopened this Sep 29, 2020

aayushshah15 added A-kv-transactions Relating to MVCC and the transactional model. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Sep 29, 2020

yuzefovich added this to Incoming in KV via automation Sep 29, 2020

aayushshah15 changed the title ~~Ambiguous result under high latencies when running TPCC~~ kvserver: ambiguous result under high latencies when running TPCC Oct 7, 2020

aayushshah15 self-assigned this Oct 20, 2020

nvanbenschoten mentioned this issue Oct 20, 2020

roachtest: add tpcc/overload test #37122

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

mwang1026 moved this from Incoming to On Hold in KV Jul 16, 2021

exalate-issue-sync bot unassigned aayushshah15 Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: ambiguous result under high latencies when running TPCC #54899

kvserver: ambiguous result under high latencies when running TPCC #54899

aayushshah15 commented Sep 29, 2020 •

edited by cockroach-jira-scripts

blathers-crl bot commented Sep 29, 2020

nvanbenschoten commented Oct 6, 2020

aayushshah15 commented Oct 6, 2020

kvserver: ambiguous result under high latencies when running TPCC #54899

kvserver: ambiguous result under high latencies when running TPCC #54899

Comments

aayushshah15 commented Sep 29, 2020 • edited by cockroach-jira-scripts

blathers-crl bot commented Sep 29, 2020

nvanbenschoten commented Oct 6, 2020

aayushshah15 commented Oct 6, 2020

aayushshah15 commented Sep 29, 2020 •

edited by cockroach-jira-scripts