Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robustness: reduce concurrency for HighTraffic scenario #18208

Open
MadhavJivrajani opened this issue Jun 20, 2024 · 10 comments
Open

robustness: reduce concurrency for HighTraffic scenario #18208

MadhavJivrajani opened this issue Jun 20, 2024 · 10 comments

Comments

@MadhavJivrajani
Copy link
Contributor

Which Github Action / Prow Jobs are flaking?

ci-etcd-robustness-main-amd64

Which tests are flaking?

TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize1

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-main-amd64/1803129487966081024

Reason for failure (if possible)

Linearization validation is timing out! We should decrease the number of concurrent clients, you can see a lot of concurrent LISTs being issued:
Screenshot 2024-06-20 at 3 27 52 PM

Anything else we need to know?

Let's try changing concurrency to 10 here:

@fykaa
Copy link
Contributor

fykaa commented Jun 20, 2024

I understand that reducing the number of concurrent clients might help with the timeouts. I'm curious, though: Is there a specific reason we initially set the ClientCount to 12? Could there be potential trade-offs or impacts on the test coverage and robustness if we reduce it to 10? I'm keen to understand the broader context before making changes.

@MadhavJivrajani
Copy link
Contributor Author

GREAT question @fykaa 💯

Here's my understanding of it, but I'll let @serathius chime in to provide additional context that I maybe missing.

I think 12 was arbitrary, I don't recall there being a specific reason for it.

I think wrt coverage we should be good in terms of the properties we want to test. It would've been a concern if we chose fail points at random per client, but we do that before we start simulating traffic, so we are good even in terms of the code we are hitting:

if s.failpoint == nil {

Having higher concurrency is beneficial because it can help us test more realistic and complex histories from a linearization perspective and how etcd handles the load, however if we can make do with 10 clients then that validation is also good enough. And plus the upside is we can get a clearer signal out of robustness tests if we can reduce the flake rate that happens because of this specific timeout.

@fykaa
Copy link
Contributor

fykaa commented Jun 21, 2024

Thanks, @MadhavJivrajani! That makes a lot of sense. I appreciate the explanation about the trade-offs between higher concurrency and reducing flakiness.

Also, I am trying to understand, are there any other areas in the robustness tests where we might need to adjust configurations to balance between test coverage and reliability?

@MadhavJivrajani
Copy link
Contributor Author

@fykaa these are most of the knobs that exist:

var (
DefaultLeaseTTL int64 = 7200
RequestTimeout = 200 * time.Millisecond
WatchTimeout = time.Second
MultiOpTxnOpCount = 4
LowTraffic = Profile{
MinimalQPS: 100,
MaximalQPS: 200,
ClientCount: 8,
MaxNonUniqueRequestConcurrency: 3,
}
HighTrafficProfile = Profile{
MinimalQPS: 200,
MaximalQPS: 1000,
ClientCount: 12,
MaxNonUniqueRequestConcurrency: 3,
}
)

var (
Kubernetes Traffic = kubernetesTraffic{
averageKeyCount: 10,
resource: "pods",
namespace: "default",
writeChoices: []random.ChoiceWeight[KubernetesRequestType]{
{Choice: KubernetesUpdate, Weight: 85},
{Choice: KubernetesDelete, Weight: 5},
{Choice: KubernetesCreate, Weight: 5},
{Choice: KubernetesCompact, Weight: 5},
},
}
)

Feel free to assign the PR to me once you have it up.

@serathius
Copy link
Member

Hey @MadhavJivrajani, I have been looking at the recent failures and found the reason for linearization for timeout. #18214 should already help a lot (reduced linearization time from timeout to 15s on my machine). I also have a draft that should help reduce it back to 6s. If everything works well we might not need to reduce the concurrency at all.

@serathius
Copy link
Member

Or based on the today's robustness test results I'm wrong. Sorry for confusion but results cannot be guaranteed until we make a change and wait a day for the CI to get confirmation.

Linearization timeout in CI in 2 tests on 5 minutes
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-amd64/1805934408956383232

On my local machine it succeeded after 118s and 141s. Even with my latest ideas (main...serathius:etcd:robustness-history-patching) I only take improve the time for one of the cases (141s -> 1s). The other 118s is just lowered to 116s, which is still expected to timeout on CI :(

Based on the linearization results, I'm not surprized. Compaction caused watch events to not be delivered, so we don't know a lot of revisions.
Selection_018

@serathius
Copy link
Member

Might try another approach with using full on etcd model to generate the revisions, I don't like the idea too much as if the model has a bug this will make it much harder to debug.

@MadhavJivrajani
Copy link
Contributor Author

@serathius

full on etcd model to generate the revisions

Wouldn't this also increase the work porcupine has to do?

model has a bug this will make it much harder to debug.

+1

@fykaa do you want to send in a PR? Maybe we can try letting that bake in the CI for a a couple of days to see how it works out and can then take a call based on if we should revert, keep it that way or change something.

@serathius
Copy link
Member

serathius commented Jun 28, 2024

One request, before me make changes in the concurrency I would like to remove all other flake sources. #18240

Then we can measure if we are happy with the timeout chance and if not reduce the concurrency.

@fykaa
Copy link
Contributor

fykaa commented Jun 29, 2024

Thanks for the detailed insights, everyone!

@MadhavJivrajani, I will go ahead and create a PR to reduce the client concurrency from 12 to 10 as discussed.

@serathius, I understand that other flake sources need to be addressed as well, I would raise the PR now, and once you've resolved the other flakes, we can re-run the tests to get a clearer picture of the impact?

Thanks again for all the guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants