robustness: reduce concurrency for HighTraffic scenario #18208

MadhavJivrajani · 2024-06-20T09:58:41Z

Which Github Action / Prow Jobs are flaking?

ci-etcd-robustness-main-amd64

Which tests are flaking?

TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize1

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-main-amd64/1803129487966081024

Reason for failure (if possible)

Linearization validation is timing out! We should decrease the number of concurrent clients, you can see a lot of concurrent LISTs being issued:

Anything else we need to know?

Let's try changing concurrency to 10 here:

etcd/tests/robustness/traffic/traffic.go

Line 48 in 00a6097

ClientCount: 12,

The text was updated successfully, but these errors were encountered:

fykaa · 2024-06-20T10:06:47Z

I understand that reducing the number of concurrent clients might help with the timeouts. I'm curious, though: Is there a specific reason we initially set the ClientCount to 12? Could there be potential trade-offs or impacts on the test coverage and robustness if we reduce it to 10? I'm keen to understand the broader context before making changes.

MadhavJivrajani · 2024-06-20T10:33:10Z

GREAT question @fykaa 💯

Here's my understanding of it, but I'll let @serathius chime in to provide additional context that I maybe missing.

I think 12 was arbitrary, I don't recall there being a specific reason for it.

I think wrt coverage we should be good in terms of the properties we want to test. It would've been a concern if we chose fail points at random per client, but we do that before we start simulating traffic, so we are good even in terms of the code we are hitting:

etcd/tests/robustness/main_test.go

Line 77 in 00a6097

if s.failpoint == nil {

Having higher concurrency is beneficial because it can help us test more realistic and complex histories from a linearization perspective and how etcd handles the load, however if we can make do with 10 clients then that validation is also good enough. And plus the upside is we can get a clearer signal out of robustness tests if we can reduce the flake rate that happens because of this specific timeout.

fykaa · 2024-06-21T23:49:15Z

Thanks, @MadhavJivrajani! That makes a lot of sense. I appreciate the explanation about the trade-offs between higher concurrency and reducing flakiness.

Also, I am trying to understand, are there any other areas in the robustness tests where we might need to adjust configurations to balance between test coverage and reliability?

MadhavJivrajani · 2024-06-24T08:23:10Z

@fykaa these are most of the knobs that exist:

etcd/tests/robustness/traffic/traffic.go

Lines 33 to 51 in 9314ef7

    
           var ( 
        
           	DefaultLeaseTTL   int64 = 7200 
        
           	RequestTimeout          = 200 * time.Millisecond 
        
           	WatchTimeout            = time.Second 
        
           	MultiOpTxnOpCount       = 4 
        
           	LowTraffic = Profile{ 
        
           		MinimalQPS:                     100, 
        
           		MaximalQPS:                     200, 
        
           		ClientCount:                    8, 
        
           		MaxNonUniqueRequestConcurrency: 3, 
        
           	} 
        
           	HighTrafficProfile = Profile{ 
        
           		MinimalQPS:                     200, 
        
           		MaximalQPS:                     1000, 
        
           		ClientCount:                    12, 
        
           		MaxNonUniqueRequestConcurrency: 3, 
        
           	} 
        
           )

etcd/tests/robustness/traffic/kubernetes.go

Lines 35 to 47 in 9314ef7

    
           var ( 
        
           	Kubernetes Traffic = kubernetesTraffic{ 
        
           		averageKeyCount: 10, 
        
           		resource:        "pods", 
        
           		namespace:       "default", 
        
           		writeChoices: []random.ChoiceWeight[KubernetesRequestType]{ 
        
           			{Choice: KubernetesUpdate, Weight: 85}, 
        
           			{Choice: KubernetesDelete, Weight: 5}, 
        
           			{Choice: KubernetesCreate, Weight: 5}, 
        
           			{Choice: KubernetesCompact, Weight: 5}, 
        
           		}, 
        
           	} 
        
           )

Feel free to assign the PR to me once you have it up.

serathius · 2024-06-24T08:36:06Z

Hey @MadhavJivrajani, I have been looking at the recent failures and found the reason for linearization for timeout. #18214 should already help a lot (reduced linearization time from timeout to 15s on my machine). I also have a draft that should help reduce it back to 6s. If everything works well we might not need to reduce the concurrency at all.

serathius · 2024-06-26T18:00:49Z

Or based on the today's robustness test results I'm wrong. Sorry for confusion but results cannot be guaranteed until we make a change and wait a day for the CI to get confirmation.

Linearization timeout in CI in 2 tests on 5 minutes
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-amd64/1805934408956383232

On my local machine it succeeded after 118s and 141s. Even with my latest ideas (main...serathius:etcd:robustness-history-patching) I only take improve the time for one of the cases (141s -> 1s). The other 118s is just lowered to 116s, which is still expected to timeout on CI :(

Based on the linearization results, I'm not surprized. Compaction caused watch events to not be delivered, so we don't know a lot of revisions.

serathius · 2024-06-26T18:07:08Z

Might try another approach with using full on etcd model to generate the revisions, I don't like the idea too much as if the model has a bug this will make it much harder to debug.

MadhavJivrajani · 2024-06-28T06:44:02Z

@serathius

full on etcd model to generate the revisions

Wouldn't this also increase the work porcupine has to do?

model has a bug this will make it much harder to debug.

+1

@fykaa do you want to send in a PR? Maybe we can try letting that bake in the CI for a a couple of days to see how it works out and can then take a call based on if we should revert, keep it that way or change something.

serathius · 2024-06-28T08:32:22Z

One request, before me make changes in the concurrency I would like to remove all other flake sources. #18240

Then we can measure if we are happy with the timeout chance and if not reduce the concurrency.

fykaa · 2024-06-29T09:40:49Z

Thanks for the detailed insights, everyone!

@MadhavJivrajani, I will go ahead and create a PR to reduce the client concurrency from 12 to 10 as discussed.

@serathius, I understand that other flake sources need to be addressed as well, I would raise the PR now, and once you've resolved the other flakes, we can re-run the tests to get a clearer picture of the impact?

Thanks again for all the guidance!

MadhavJivrajani added area/testing type/flake labels Jun 20, 2024

MadhavJivrajani assigned fykaa Jun 21, 2024

jmhbnz added the area/robustness-testing label Jun 28, 2024

fykaa mentioned this issue Jun 29, 2024

Reduce client concurrency for high traffic robustness tests #18252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robustness: reduce concurrency for HighTraffic scenario #18208

robustness: reduce concurrency for HighTraffic scenario #18208

MadhavJivrajani commented Jun 20, 2024

fykaa commented Jun 20, 2024

MadhavJivrajani commented Jun 20, 2024

fykaa commented Jun 21, 2024

MadhavJivrajani commented Jun 24, 2024

serathius commented Jun 24, 2024

serathius commented Jun 26, 2024

serathius commented Jun 26, 2024

MadhavJivrajani commented Jun 28, 2024

serathius commented Jun 28, 2024 •

edited

Loading

fykaa commented Jun 29, 2024

robustness: reduce concurrency for HighTraffic scenario #18208

robustness: reduce concurrency for HighTraffic scenario #18208

Comments

MadhavJivrajani commented Jun 20, 2024

Which Github Action / Prow Jobs are flaking?

Which tests are flaking?

Github Action / Prow Job link

Reason for failure (if possible)

Anything else we need to know?

fykaa commented Jun 20, 2024

MadhavJivrajani commented Jun 20, 2024

fykaa commented Jun 21, 2024

MadhavJivrajani commented Jun 24, 2024

serathius commented Jun 24, 2024

serathius commented Jun 26, 2024

serathius commented Jun 26, 2024

MadhavJivrajani commented Jun 28, 2024

serathius commented Jun 28, 2024 • edited Loading

fykaa commented Jun 29, 2024

serathius commented Jun 28, 2024 •

edited

Loading