Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: update tpcc overload #121833

Merged

Conversation

andrewbaptist
Copy link
Collaborator

Previously this test was created but wasn't run since it would always fail. The test still doesn't pass in the default configuration, however after the addition of #118781, at least it is possible to configure a cluster to pass.

Informs: #110272
Informs: #89142

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 2024-04-05-tpcc-overload-update branch from 9f0df24 to a32d8c4 Compare April 5, 2024 16:29
@andrewbaptist
Copy link
Collaborator Author

Run with this PR server.max_open_transactions_per_gateway = 100

image

I'm running again without that parameter to see the difference.

Previously this test was created but wasn't run since it would always
fail. The test still doesn't pass in the default configuration, however
after the addition of cockroachdb#118781, at least it is possible to configure a
cluster to pass.

Informs: cockroachdb#110272
Informs: cockroachdb#89142

Release note: None
@andrewbaptist
Copy link
Collaborator Author

This is the default behavior (without the parameter set). I "unset" the parameter at 15:35. Note that the throughput without the parameter is ~1/2 of before, and the latency is >100x.

image

@andrewbaptist andrewbaptist marked this pull request as ready for review April 9, 2024 20:28
@andrewbaptist andrewbaptist requested a review from a team as a code owner April 9, 2024 20:28
@andrewbaptist andrewbaptist requested review from srosenberg and renatolabs and removed request for a team April 9, 2024 20:28
@andrewbaptist andrewbaptist force-pushed the 2024-04-05-tpcc-overload-update branch from a32d8c4 to a4cdb4e Compare April 10, 2024 17:37
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is great, and nice experiment!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

@andrewbaptist
Copy link
Collaborator Author

TFTR!

bors r=rafiss

@craig craig bot merged commit c664210 into cockroachdb:master Apr 19, 2024
22 checks passed
@andrewbaptist andrewbaptist deleted the 2024-04-05-tpcc-overload-update branch April 19, 2024 15:32
@sumeerbhola
Copy link
Collaborator

Thanks for the test. I am running this now. This runs hourly full backups and 15 min incremental backups, which also consume significant resources. Was that intentional?
Screenshot 2024-04-26 at 10 05 16 AM

Also, what is the motivation for the 4h ramp? I was planning to speed it up to 30m to get to the overload quicker.

@sumeerbhola
Copy link
Collaborator

There is some uneven load distribution. n4 and n2 have higher CPU and higher KV read and write requests.
Screenshot 2024-04-26 at 10 16 14 AM
Screenshot 2024-04-26 at 10 16 55 AM

@andrewbaptist
Copy link
Collaborator Author

The reason for the 4h ramp is to allow a more gradual change from "sustainable to unsustainable" and watch the impact as it changes. The backups were not really intentional, but when I noticed them I didn't remove them since in the customer case they were somewhat impacted by backups. It would probably be cleaner to remove them.

From a rebalance perspective, it would be good to better understand if they converge over time and if not why not.

@sumeerbhola
Copy link
Collaborator

My run flaked out after about 75 min of the tpcc workload running due to. This was with server.max_open_transactions_per_gateway = -1.

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 4617.0s        0           48.9           33.6    436.2   1409.3   1543.5   1543.5 delivery
 4617.0s        0          425.5          334.8   1543.5   2147.5   2415.9   2550.1 newOrder
 4617.0s        0           94.9           33.6    104.9   1006.6   1208.0   1208.0 orderStatus
run_132808.222807000_n7_cockroach-workload-r: 14:47:20 cluster.go:2382: > result: _potential_ SSH flake (`ssh -vvv` log retained in artifacts/admission-control/tpcc-severe-overload/run_1/ssh/ssh_132808.262145000_n7_cockroach-workload-r.log): TRANSIENT_ERROR(ssh_problem): exit status 255

Throughput was looking ok
Screenshot 2024-04-26 at 1 34 09 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants