-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow / Unresponsive Kubernetes API #4339
Comments
There is code in place to mitigate issues with updating the result of high load for the Kubernetes API (5 attempts over 500ms), so I believe you must be under extreme load. I think this can be improved however (every 1s with exponential back-off over 10s say). |
Thanks a lot for that @alexec! |
FYI EKS Cluster load:
|
…roj#4340) Signed-off-by: Alex Capras <alexcapras@gmail.com>
Summary
Not sure if this is Argo or AWS Kubernetes related... I would label it more as a "Problem" than a bug.
If we run ~300 Argo workflows (about 8 non parallel steps each) in parallel on our EKS Cluster which takes ~70 Nodes the Kubernetes API starts to get slow and in worst case we experience API timeouts.
The worst thing is that it also makes the workflows fail with different error messages:
We have scaled up the workflow-controller with
--workflow-workers 1024 --pod-workers 64 --qps 200 --burst 50
.We would expect Argo / Kubernetes to handle such workload without breaking.
Diagnostics
We are using AWS EKS 1.17 and Argo 2.11.5
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: