Handle idle timeouts more gracefully #4524

dcherman · 2020-11-13T01:41:11Z

Summary

When you're using an ingress controller that has an idle timeout configured, it's possible that there are no events that occur within that period of time which results in the UI throwing an error since the workflow-events stream is closed. In the case of my cluster, I use ingress-nginx which has a default idle timeout of 60s.

Since it's expected that these streams are very long lived connections, we should consider one of the following:

Send a piece of data periodically if none has been sent. This is not optimal imo since we'd need to filter this out on the client, and it still may not solve the problem if the user configures an idle timeout shorter than the interval that we sent data.
Retry the connection on the front-end at least once. If the connection is successfully re-established, then it's a candidate to retry again if/when the error occurs. If the connection fails to be re-established, throw our existing error since that might indicate a loss of network connectivity or problems with the argo-server pod.

In both cases, we should also provide a nicer way to retry when this occurs rather than reloading the page.

Diagnostics

What Kubernetes provider are you using?

EKS 1.17 with Ingress NGINX

What version of Argo Workflows are you running?

v2.11.1

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2020-11-13T01:59:01Z

I think we have addressed some of this in v2.11.7 and v2.12. Could you try latest?

dcherman · 2020-11-13T02:57:43Z

Yep, looks like you're right - v2.12.0-rc2 retries automatically after 10s and provides a way to explicitly reload.

Wonder if we can improve the UX on that though since for low traffic instances, this might be a relatively common error to run into. Maybe we can retry seamlessly without interrupting the UI, or make the error not completely replace the workflow listing?

alexec · 2020-11-16T16:01:02Z

Do you want to suggest something?

dcherman · 2020-11-16T16:50:05Z

Sure - I think a pretty straightforward change to make would be when a disconnect occurs:

Don't remove the existing workflow listing since the existing data is all still valid and in the case of users that don't necessarily understand the error, causes them to reach out thinking that Argo is broken when it's just a normally occurring situation due to the idle timeout on the ingress controller if there's low traffic
Re-word the error message from Unknown error to something like Disconnected from workflow streaming. Reconnecting in 10s which gives a better idea about what is currently non-functional and what action will be taken when the timer completes

alexec · 2020-11-16T16:59:06Z

This makes sense. Each page is different and needs different disconnect logic.

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec · 2020-11-17T02:20:26Z

I'm fixing in the v3 UI

pires · 2021-02-09T00:19:01Z

@alexec can you please provide a link of the PR/issue to track the fix?

dcherman added the type/bug label Nov 13, 2020

alexec linked a pull request Nov 16, 2020 that will close this issue

fix(ui): Do not auto-reload doc.location. Fixes #4530 #4535

Merged

1 task

alexec closed this as completed in #4535 Nov 16, 2020

alexec reopened this Nov 16, 2020

alexec self-assigned this Nov 16, 2020

alexec added a commit to alexec/argo-workflows that referenced this issue Nov 16, 2020

feat(ui): Retrying list-watch. Fixes argoproj#4524

8728335

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec closed this as completed Nov 30, 2020

pires mentioned this issue Feb 16, 2021

Handle idle timeouts more gracefully #5124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle idle timeouts more gracefully #4524

Handle idle timeouts more gracefully #4524

dcherman commented Nov 13, 2020

alexec commented Nov 13, 2020

dcherman commented Nov 13, 2020

alexec commented Nov 16, 2020

dcherman commented Nov 16, 2020

alexec commented Nov 16, 2020

alexec commented Nov 17, 2020

pires commented Feb 9, 2021

Handle idle timeouts more gracefully #4524

Handle idle timeouts more gracefully #4524

Comments

dcherman commented Nov 13, 2020

Summary

Diagnostics

alexec commented Nov 13, 2020

dcherman commented Nov 13, 2020

alexec commented Nov 16, 2020

dcherman commented Nov 16, 2020

alexec commented Nov 16, 2020

alexec commented Nov 17, 2020

pires commented Feb 9, 2021