Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle idle timeouts more gracefully #4524

Closed
dcherman opened this issue Nov 13, 2020 · 7 comments · Fixed by #4535
Closed

Handle idle timeouts more gracefully #4524

dcherman opened this issue Nov 13, 2020 · 7 comments · Fixed by #4535
Assignees
Labels

Comments

@dcherman
Copy link
Member

Summary

When you're using an ingress controller that has an idle timeout configured, it's possible that there are no events that occur within that period of time which results in the UI throwing an error since the workflow-events stream is closed. In the case of my cluster, I use ingress-nginx which has a default idle timeout of 60s.

image

image

Since it's expected that these streams are very long lived connections, we should consider one of the following:

  1. Send a piece of data periodically if none has been sent. This is not optimal imo since we'd need to filter this out on the client, and it still may not solve the problem if the user configures an idle timeout shorter than the interval that we sent data.

  2. Retry the connection on the front-end at least once. If the connection is successfully re-established, then it's a candidate to retry again if/when the error occurs. If the connection fails to be re-established, throw our existing error since that might indicate a loss of network connectivity or problems with the argo-server pod.

In both cases, we should also provide a nicer way to retry when this occurs rather than reloading the page.

Diagnostics

What Kubernetes provider are you using?

EKS 1.17 with Ingress NGINX

What version of Argo Workflows are you running?

v2.11.1


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Nov 13, 2020

I think we have addressed some of this in v2.11.7 and v2.12. Could you try latest?

@dcherman
Copy link
Member Author

Yep, looks like you're right - v2.12.0-rc2 retries automatically after 10s and provides a way to explicitly reload.

argo retry (2)

Wonder if we can improve the UX on that though since for low traffic instances, this might be a relatively common error to run into. Maybe we can retry seamlessly without interrupting the UI, or make the error not completely replace the workflow listing?

@alexec
Copy link
Contributor

alexec commented Nov 16, 2020

Do you want to suggest something?

@dcherman
Copy link
Member Author

Sure - I think a pretty straightforward change to make would be when a disconnect occurs:

  • Don't remove the existing workflow listing since the existing data is all still valid and in the case of users that don't necessarily understand the error, causes them to reach out thinking that Argo is broken when it's just a normally occurring situation due to the idle timeout on the ingress controller if there's low traffic
  • Re-word the error message from Unknown error to something like Disconnected from workflow streaming. Reconnecting in 10s which gives a better idea about what is currently non-functional and what action will be taken when the timer completes

@alexec
Copy link
Contributor

alexec commented Nov 16, 2020

This makes sense. Each page is different and needs different disconnect logic.

@alexec alexec linked a pull request Nov 16, 2020 that will close this issue
1 task
@alexec alexec reopened this Nov 16, 2020
@alexec alexec self-assigned this Nov 16, 2020
alexec added a commit to alexec/argo-workflows that referenced this issue Nov 16, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec
Copy link
Contributor

alexec commented Nov 17, 2020

I'm fixing in the v3 UI

@alexec alexec closed this as completed Nov 30, 2020
@pires
Copy link

pires commented Feb 9, 2021

@alexec can you please provide a link of the PR/issue to track the fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants