-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robustness: reduce concurrency for HighTraffic scenario #18208
Comments
I understand that reducing the number of concurrent clients might help with the timeouts. I'm curious, though: Is there a specific reason we initially set the |
GREAT question @fykaa 💯 Here's my understanding of it, but I'll let @serathius chime in to provide additional context that I maybe missing. I think 12 was arbitrary, I don't recall there being a specific reason for it. I think wrt coverage we should be good in terms of the properties we want to test. It would've been a concern if we chose fail points at random per client, but we do that before we start simulating traffic, so we are good even in terms of the code we are hitting: etcd/tests/robustness/main_test.go Line 77 in 00a6097
Having higher concurrency is beneficial because it can help us test more realistic and complex histories from a linearization perspective and how etcd handles the load, however if we can make do with 10 clients then that validation is also good enough. And plus the upside is we can get a clearer signal out of robustness tests if we can reduce the flake rate that happens because of this specific timeout. |
Thanks, @MadhavJivrajani! That makes a lot of sense. I appreciate the explanation about the trade-offs between higher concurrency and reducing flakiness. Also, I am trying to understand, are there any other areas in the robustness tests where we might need to adjust configurations to balance between test coverage and reliability? |
@fykaa these are most of the knobs that exist: etcd/tests/robustness/traffic/traffic.go Lines 33 to 51 in 9314ef7
etcd/tests/robustness/traffic/kubernetes.go Lines 35 to 47 in 9314ef7
Feel free to assign the PR to me once you have it up. |
Hey @MadhavJivrajani, I have been looking at the recent failures and found the reason for linearization for timeout. #18214 should already help a lot (reduced linearization time from timeout to 15s on my machine). I also have a draft that should help reduce it back to 6s. If everything works well we might not need to reduce the concurrency at all. |
Or based on the today's robustness test results I'm wrong. Sorry for confusion but results cannot be guaranteed until we make a change and wait a day for the CI to get confirmation. Linearization timeout in CI in 2 tests on 5 minutes On my local machine it succeeded after 118s and 141s. Even with my latest ideas (main...serathius:etcd:robustness-history-patching) I only take improve the time for one of the cases (141s -> 1s). The other 118s is just lowered to 116s, which is still expected to timeout on CI :( Based on the linearization results, I'm not surprized. Compaction caused watch events to not be delivered, so we don't know a lot of revisions. |
Might try another approach with using full on etcd model to generate the revisions, I don't like the idea too much as if the model has a bug this will make it much harder to debug. |
Wouldn't this also increase the work porcupine has to do?
+1 @fykaa do you want to send in a PR? Maybe we can try letting that bake in the CI for a a couple of days to see how it works out and can then take a call based on if we should revert, keep it that way or change something. |
One request, before me make changes in the concurrency I would like to remove all other flake sources. #18240 Then we can measure if we are happy with the timeout chance and if not reduce the concurrency. |
Thanks for the detailed insights, everyone! @MadhavJivrajani, I will go ahead and create a PR to reduce the client concurrency from 12 to 10 as discussed. @serathius, I understand that other flake sources need to be addressed as well, I would raise the PR now, and once you've resolved the other flakes, we can re-run the tests to get a clearer picture of the impact? Thanks again for all the guidance! |
Which Github Action / Prow Jobs are flaking?
ci-etcd-robustness-main-amd64
Which tests are flaking?
TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize1
Github Action / Prow Job link
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-main-amd64/1803129487966081024
Reason for failure (if possible)
Linearization validation is timing out! We should decrease the number of concurrent clients, you can see a lot of concurrent LISTs being issued:
Anything else we need to know?
Let's try changing concurrency to 10 here:
etcd/tests/robustness/traffic/traffic.go
Line 48 in 00a6097
The text was updated successfully, but these errors were encountered: