-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] create job times out when resources are not created after repeated errors #399
Comments
Added kube-burner output when I saw this and a couple of notes here: https://gist.github.com/afcollins/e72a8e937f9ee8b8ff4fec911715db06 |
Acknowledged! Duplicate Issue: #381 |
Few observations here:
and it follows the below equation to calculate the retry delays for each step using the above values
So in total 3 retries for each creation request, with 1, 3, 9 second delays respectively (sometimes these values might vary depending on some other factors). Since we wanted the retries to be happening till the job timeout, we can calculate the steps dynamically to keep our retries happening till the timeout. So our new equation would be
And on simplifying it we can have the number of steps as follows that will make our app do retries till the job times out.
Please feel free to comment on this approach or to suggest if you have any other thoughts/opinions. Thank you! |
I see. Great work to figure out the formula! As I read about it, I wonder if we need to use ExponentialBackoff at all, or something like Until would make more sense here? I am testing the PR changes now. |
Yes we can do that, but keeping them exponential would be better IMHO, as once a failure has occurred it would be nice if we can wait for increasing variable amount of time before retrying again. In order to achieve that inbuilt functions are a good to go than writing up our own custom logic! |
Bug Description
If there are some cluster issues, create-go will Error and retry creation.
However, if it fails enough times, it appears to stop retrying and then waiters.go stall indefinitely because the resources are not actually created.
Output of
kube-burner
version1.7.2@910b28640fb28fbee93c923caf43e52ea4895fae
To Reproduce
Steps to reproduce the behavior:
kube-burner ocp cluster-density-v2 --qps=80 --burst=80 --iterations 1800
on a 120 node cluster.Expected behavior
The create job to retry creating all objects indefinitely until a timeout is reached.
Additional context
I am raising this issue now that we have seen it in two completely different environments (120 node AWS self-managed and 500 node hcp).
The text was updated successfully, but these errors were encountered: