[BUG] create job times out when resources are not created after repeated errors #399

afcollins · 2023-08-01T17:11:17Z

Bug Description

If there are some cluster issues, create-go will Error and retry creation.
However, if it fails enough times, it appears to stop retrying and then waiters.go stall indefinitely because the resources are not actually created.

Output of `kube-burner` version

1.7.2@910b28640fb28fbee93c923caf43e52ea4895fae

To Reproduce

Steps to reproduce the behavior:

Run kube-burner ocp cluster-density-v2 --qps=80 --burst=80 --iterations 1800 on a 120 node cluster.

Expected behavior

The create job to retry creating all objects indefinitely until a timeout is reached.

Additional context

I am raising this issue now that we have seen it in two completely different environments (120 node AWS self-managed and 500 node hcp).

The text was updated successfully, but these errors were encountered:

afcollins · 2023-08-01T17:15:52Z

Added kube-burner output when I saw this and a couple of notes here: https://gist.github.com/afcollins/e72a8e937f9ee8b8ff4fec911715db06

vishnuchalla · 2023-08-01T18:30:00Z

Acknowledged! Duplicate Issue: #381

vishnuchalla · 2023-08-06T20:02:06Z

Few observations here:

For those client side throttling logs, those are completely related to server-side issues (i.e cluster on test). If the cluster goes down, we see requests getting throttling at the client side (i.e kube-burner) and will start seeing logs as below.
I0712 20:57:51.308618 2717685 request.go:696] Waited for 1.984728666s due to client-side throttling, not priority and fairness, request: GET:https://api.perf-985.ufkf.s1.devshift.org:6443/apis/apps/v1/namespaces/cluster-density-v2-1177/deployments
And these logs are not something which are under our control. These are being logged from our upstream library client-go. I think we should keep them as is as they indicate us about the responsiveness of our cluster under test.
And for the retries, yes we have steps=3 as a hard-coded value to perform those many amount of retries with exponential backoff. Related snippet

// RetryWithExponentialBackOff a utility for retrying the given function with exponential backoff.
func RetryWithExponentialBackOff(fn wait.ConditionFunc, duration time.Duration, factor, jitter float64, steps int) error {
	backoff := wait.Backoff{
		Duration: duration, // Value is 1 second
		Factor:   factor, // Value is 3
		Jitter:   jitter, // Value is 0
		Steps:    steps, // Value is 3
	}
	return wait.ExponentialBackoff(backoff, fn)
}

and it follows the below equation to calculate the retry delays for each step using the above values

nextRetry = duration * factor^stepCount * (1 ± jitter)

So in total 3 retries for each creation request, with 1, 3, 9 second delays respectively (sometimes these values might vary depending on some other factors). Since we wanted the retries to be happening till the job timeout, we can calculate the steps dynamically to keep our retries happening till the timeout. So our new equation would be

timeout = duration * factor^stepCount * (1 ± jitter)

And on simplifying it we can have the number of steps as follows that will make our app do retries till the job times out.

stepCount = log((timeout)/(duration * (1 ± jitter)))/log(factor)

Please feel free to comment on this approach or to suggest if you have any other thoughts/opinions. Thank you!

afcollins · 2023-08-08T23:23:19Z

I see. Great work to figure out the formula!

As I read about it, I wonder if we need to use ExponentialBackoff at all, or something like Until would make more sense here?

I am testing the PR changes now.

vishnuchalla · 2023-08-14T13:51:59Z

I see. Great work to figure out the formula!

As I read about it, I wonder if we need to use ExponentialBackoff at all, or something like Until would make more sense here?

I am testing the PR changes now.

Yes we can do that, but keeping them exponential would be better IMHO, as once a failure has occurred it would be nice if we can wait for increasing variable amount of time before retrying again. In order to achieve that inbuilt functions are a good to go than writing up our own custom logic!

afcollins added the bug Something isn't working label Aug 1, 2023

vishnuchalla assigned vishnuchalla, smalleni and rsevilla87 Aug 4, 2023

vishnuchalla mentioned this issue Aug 7, 2023

Implementing retries till job timeout #407

Merged

7 tasks

vishnuchalla closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] create job times out when resources are not created after repeated errors #399

[BUG] create job times out when resources are not created after repeated errors #399

afcollins commented Aug 1, 2023

afcollins commented Aug 1, 2023

vishnuchalla commented Aug 1, 2023

vishnuchalla commented Aug 6, 2023 •

edited

afcollins commented Aug 8, 2023 •

edited by vishnuchalla

vishnuchalla commented Aug 14, 2023

[BUG] create job times out when resources are not created after repeated errors #399

[BUG] create job times out when resources are not created after repeated errors #399

Comments

afcollins commented Aug 1, 2023

Bug Description

Output of kube-burner version

To Reproduce

Expected behavior

Additional context

afcollins commented Aug 1, 2023

vishnuchalla commented Aug 1, 2023

vishnuchalla commented Aug 6, 2023 • edited

afcollins commented Aug 8, 2023 • edited by vishnuchalla

vishnuchalla commented Aug 14, 2023

Output of `kube-burner` version

vishnuchalla commented Aug 6, 2023 •

edited

afcollins commented Aug 8, 2023 •

edited by vishnuchalla