Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] create job times out when resources are not created after repeated errors #399

Closed
afcollins opened this issue Aug 1, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@afcollins
Copy link
Contributor

Bug Description

If there are some cluster issues, create-go will Error and retry creation.
However, if it fails enough times, it appears to stop retrying and then waiters.go stall indefinitely because the resources are not actually created.

Output of kube-burner version

1.7.2@910b28640fb28fbee93c923caf43e52ea4895fae

To Reproduce

Steps to reproduce the behavior:

  1. Run kube-burner ocp cluster-density-v2 --qps=80 --burst=80 --iterations 1800 on a 120 node cluster.

Expected behavior

The create job to retry creating all objects indefinitely until a timeout is reached.

Additional context

I am raising this issue now that we have seen it in two completely different environments (120 node AWS self-managed and 500 node hcp).

@afcollins afcollins added the bug Something isn't working label Aug 1, 2023
@afcollins
Copy link
Contributor Author

Added kube-burner output when I saw this and a couple of notes here: https://gist.github.com/afcollins/e72a8e937f9ee8b8ff4fec911715db06

@vishnuchalla
Copy link
Collaborator

Acknowledged! Duplicate Issue: #381

@vishnuchalla
Copy link
Collaborator

vishnuchalla commented Aug 6, 2023

Few observations here:

  • For those client side throttling logs, those are completely related to server-side issues (i.e cluster on test). If the cluster goes down, we see requests getting throttling at the client side (i.e kube-burner) and will start seeing logs as below.
    I0712 20:57:51.308618 2717685 request.go:696] Waited for 1.984728666s due to client-side throttling, not priority and fairness, request: GET:https://api.perf-985.ufkf.s1.devshift.org:6443/apis/apps/v1/namespaces/cluster-density-v2-1177/deployments
    And these logs are not something which are under our control. These are being logged from our upstream library client-go. I think we should keep them as is as they indicate us about the responsiveness of our cluster under test.

  • And for the retries, yes we have steps=3 as a hard-coded value to perform those many amount of retries with exponential backoff. Related snippet

// RetryWithExponentialBackOff a utility for retrying the given function with exponential backoff.
func RetryWithExponentialBackOff(fn wait.ConditionFunc, duration time.Duration, factor, jitter float64, steps int) error {
	backoff := wait.Backoff{
		Duration: duration, // Value is 1 second
		Factor:   factor, // Value is 3
		Jitter:   jitter, // Value is 0
		Steps:    steps, // Value is 3
	}
	return wait.ExponentialBackoff(backoff, fn)
}

and it follows the below equation to calculate the retry delays for each step using the above values

nextRetry = duration * factor^stepCount * (1 ± jitter)

So in total 3 retries for each creation request, with 1, 3, 9 second delays respectively (sometimes these values might vary depending on some other factors). Since we wanted the retries to be happening till the job timeout, we can calculate the steps dynamically to keep our retries happening till the timeout. So our new equation would be

timeout = duration * factor^stepCount * (1 ± jitter)

And on simplifying it we can have the number of steps as follows that will make our app do retries till the job times out.

stepCount = log((timeout)/(duration * (1 ± jitter)))/log(factor)

Please feel free to comment on this approach or to suggest if you have any other thoughts/opinions. Thank you!

@afcollins
Copy link
Contributor Author

afcollins commented Aug 8, 2023

I see. Great work to figure out the formula!

As I read about it, I wonder if we need to use ExponentialBackoff at all, or something like Until would make more sense here?

I am testing the PR changes now.

@vishnuchalla
Copy link
Collaborator

I see. Great work to figure out the formula!

As I read about it, I wonder if we need to use ExponentialBackoff at all, or something like Until would make more sense here?

I am testing the PR changes now.

Yes we can do that, but keeping them exponential would be better IMHO, as once a failure has occurred it would be nice if we can wait for increasing variable amount of time before retrying again. In order to achieve that inbuilt functions are a good to go than writing up our own custom logic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants