Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: ambiguous result under high latencies when running TPCC #54899

Open
aayushshah15 opened this issue Sep 29, 2020 · 3 comments
Open

kvserver: ambiguous result under high latencies when running TPCC #54899

aayushshah15 opened this issue Sep 29, 2020 · 3 comments
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team
Projects

Comments

@aayushshah15
Copy link
Contributor

aayushshah15 commented Sep 29, 2020

Describe the problem

When running TPCC 15K on a 20.2 cluster with 15 c5.4xlarge nodes (and in general, when running large TPCC workloads on clusters that are perhaps too weak to handle the load), the workload generator receives a result is ambiguous (intent missing and record aborted) error after p95 latencies degrade into the dozens of seconds.

To Reproduce

Reproduction steps are identical to the ones in cockroachdb/pebble#934, except that running with 15000 active warehouses will reliably reproduce the error.

roachprod run $CLUSTER:16 './workload run tpcc --warehouses=15000 --partitions=5 --ramp=1m --duration=50m {pgurl:1-15}'

Additional context

Related to #53156. After this PR, the only case where we should be getting these errors is when the transaction record of the errant transaction has been garbage collected. By default, this should only happen after 10 minutes. However, this seems not to be the case (see attached workload logs).
pebble15.txt

Jira issue: CRDB-3705

@blathers-crl
Copy link

blathers-crl bot commented Sep 29, 2020

Hi @aayushshah15, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 29, 2020
@aayushshah15 aayushshah15 changed the title Ambiguous result when running TOCC Ambiguous result under high latencies when running TPCC Sep 29, 2020
@aayushshah15 aayushshah15 reopened this Sep 29, 2020
@aayushshah15 aayushshah15 added A-kv-transactions Relating to MVCC and the transactional model. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Sep 29, 2020
@yuzefovich yuzefovich added this to Incoming in KV via automation Sep 29, 2020
@nvanbenschoten
Copy link
Member

This is unexpected. As you mentioned, #53156 was supposed to place some harder guarantees around this ambiguous error case. I bet we could figure out what's going wrong with a few well-placed log statements around EndTxn, QueryTxn, PushTxn, and RecoverTxn requests.

@aayushshah15
Copy link
Contributor Author

My repro steps mention a large cluster since that's what I was playing with at the time, but I think we should be able to repro this on a TPCC 3K run on 3 of those c5d.4xlarge nodes, for example.

@aayushshah15 aayushshah15 changed the title Ambiguous result under high latencies when running TPCC kvserver: ambiguous result under high latencies when running TPCC Oct 7, 2020
@aayushshah15 aayushshah15 self-assigned this Oct 20, 2020
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@mwang1026 mwang1026 moved this from Incoming to On Hold in KV Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team
Projects
KV
On Hold
Development

No branches or pull requests

3 participants