View Workflow Page Not Receiving Progress Updates #4301

terev · 2020-10-16T04:33:10Z

Summary

It seems like the view workflow page is no longer continuously receiving events. On 2.10.2 this did work where progress of the workflow could be viewed without refreshing. The SSE endpoint used to retrieve these events /workflow-events seems to be called once then after the request times out, a new watch request is not started. This may not be the only affected SSE endpoint.

Update:
This also appears to occur on the workflow list page. Where the SSE request eventually terminates with a 504
Gateway Timeout.

Diagnostics

GKE 1.15

On 2.11.4 i've noticed that the workflow graph isn't updating anymore. I upgraded to 2.11.4 from 2.10.2 where this was working before.

This has occurred when viewing any in progress workflow i've tried.

Paste the logs from the workflow controller:
these don't seem relevant

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2020-10-16T15:18:33Z

Would you like to see if you can identify the offending change?

terev · 2020-10-16T18:15:51Z

Sure I'll have a look

terev · 2020-10-16T20:20:45Z

I have a feeling it has something to do with the changes here: 6e3c5be#diff-884cea1f0a354998c12cefc2b8097a881fe6d621fa91c1c549d241d6023fa5b5L42 .

In my browser i added a breakpoint here ui/src/app/shared/services/requests.ts and here ui/src/app/shared/services/requests.ts . In 2.10 it seems like the error (where the error is actually undefined) is caught and received by the chained then but in 2.11 its passed to the onerror handler. But i can't tell what would be causing the EventSource to not be reestablished.

alexec · 2020-10-16T21:00:28Z

What is the UI behaviour? Does it show an error?

terev · 2020-10-16T21:03:43Z

No error in the ui i can see. It just stays static in the state it was when loading the page (workflow-details).

terev · 2020-10-17T06:31:12Z

Locally i reverted just these changes 6e3c5be#diff-0670537605e4212d6dada675c6498eabe38d45eb0d3594d59c0779905a83f18bR319-R322 and after the first request times out another one is established.

i have a nginx proxy set up locally in order to mimic the ingress in my deployment with configuration like:

events { }

 http {
   server {
     proxy_http_version 1.1;

     listen 2345;

     set $argo_upstream http://127.0.0.1:2746;
     location / {
         proxy_set_header Connection '';
         proxy_buffering off;
         proxy_cache off;
         chunked_transfer_encoding off;
         proxy_pass $argo_upstream;
     }
   }
 }

nginx -c ./nginx.conf
And i modified the webpack dev config to proxy through the nginx proxy

 devServer: {
   historyApiFallback: {
      disableDotRule: true
    },
    proxy: {
      "/api": {
        "target": isProd ? "" : "http://localhost:2345",
        "secure": false
      },

alexec · 2020-10-19T18:43:53Z

Can you test with argocli:latest please?

terev · 2020-10-21T21:47:55Z

@alexec I had to modify my proxy config since the cli only works with a grpc connection:

 events { }

 error_log /Users/trevorfoster/tempproxy/error.log info;

 http {
   access_log /Users/trevorfoster/tempproxy/access.log;
   server {
     resolver 127.0.0.1;

     server_name localhost;
     listen 2345 http2;

     set $argo_upstream grpc://127.0.0.1:2746;
     location / {
         grpc_pass $argo_upstream;
     }
   }
 }

However the watch still times out after one minute because of nginx's default keep alive setting of 1 minute. This keepalive setting is also the reason why the ui request times out. If the request times out the client should re start the watch in my opinion.

Name:                fantastic-whale
Namespace:           argo
ServiceAccount:      default
Status:              Running
Created:             Wed Oct 21 17:44:05 -0400 (1 minute ago)
Started:             Wed Oct 21 17:44:05 -0400 (1 minute ago)
Duration:            1 minute 5 seconds
Progress:            0/1
Parameters:
  message:           hello argo

STEP                TEMPLATE  PODNAME          DURATION  MESSAGE
 ● fantastic-whale  argosay   fantastic-whale  1m
FATA[0060] rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR

alexec · 2020-10-21T22:14:34Z

@alexmt any thoughts on what does Argo CD does about this? Does it send some kind of keep-alive?

alexec · 2020-10-22T20:36:19Z

The way the UI is coded is that it should present an error notice if there is an error, and then automatically reconnect.

terev · 2020-10-22T20:52:07Z

that's what i thought as well since the eventsource handles that but it doesn't seem to work.

In the event of an upstream timeout it does appear the request ends with a response code of 200. So it's possible the logic reestablishing the connection isn't triggered in this case. However the same response is returned on 2.10 and it does establish a new request.

alexec · 2020-10-26T17:34:32Z

So we think that the request ends with 200, but we do not reconnect like we should.

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec · 2020-10-26T18:00:04Z

Can you please test argoproj/argocli:ui-prog and see if this fixes your issue?

terev · 2020-10-26T20:23:20Z

I checked out your changes locally and it seems like a new call still isn't being established in the details ui. I added a breakpoint around the code called on subscribe finish:

It seems like the error handler is being called instead of the third handler where you added the recursive call.

alexec · 2020-10-26T21:31:32Z

I've made another change. Can you pull the image again and try again?

terev · 2020-10-26T21:47:13Z

That seems to work! Very neat, should we make this change for the other eventsource subscribers?

alexec · 2020-10-26T21:49:39Z

@terev would you be interested in taking over on this fix?

alexec · 2020-10-26T21:51:32Z

I think we should maybe determine if the error is a disconnect (so reconnecting is a good idea) rather than 403 or 500 (which would result in going into a hot-loop).

alexec · 2020-10-26T21:54:03Z

https://kaazing.com/kaazing.io/doc/javascript-starter-template

terev · 2020-10-26T21:56:37Z

Sweet that makes sense

Signed-off-by: Alex Collins <alex_collins@intuit.com>

Signed-off-by: Alex Collins <alex_collins@intuit.com> Signed-off-by: Alex Capras <alexcapras@gmail.com>

terev added the type/bug label Oct 16, 2020

alexec added this to the v2.11 milestone Oct 16, 2020

alexec added the more-information-needed label Oct 16, 2020

no-response bot removed the more-information-needed label Oct 16, 2020

jessesuen assigned alexec Oct 16, 2020

alexec added the more-information-needed label Oct 19, 2020

no-response bot removed the more-information-needed label Oct 21, 2020

alexec added a commit to alexec/argo-workflows that referenced this issue Oct 26, 2020

fix(ui): Reconnect to DAG. Fixes argoproj#4301

2e60739

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec linked a pull request Oct 27, 2020 that will close this issue

fix(ui): Reconnect to DAG. Fixes #4301 #4378

Merged

6 tasks

alexec closed this as completed in #4378 Oct 27, 2020

alexec added a commit that referenced this issue Oct 27, 2020

fix(ui): Reconnect to DAG. Fixes #4301 (#4378)

ddd45b6

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec added a commit that referenced this issue Oct 27, 2020

fix(ui): Reconnect to DAG. Fixes #4301 (#4378)

0a0255a

Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexcapras pushed a commit to alexcapras/argo that referenced this issue Nov 12, 2020

fix(ui): Reconnect to DAG. Fixes argoproj#4301 (argoproj#4378)

c59fbd4

Signed-off-by: Alex Collins <alex_collins@intuit.com> Signed-off-by: Alex Capras <alexcapras@gmail.com>

alexec mentioned this issue Dec 7, 2020

Argo Server UI repeatedly requests resource with too old version #4580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

View Workflow Page Not Receiving Progress Updates #4301

View Workflow Page Not Receiving Progress Updates #4301

terev commented Oct 16, 2020 •

edited

alexec commented Oct 16, 2020

terev commented Oct 16, 2020

terev commented Oct 16, 2020 •

edited

alexec commented Oct 16, 2020

terev commented Oct 16, 2020

terev commented Oct 17, 2020 •

edited

alexec commented Oct 19, 2020

terev commented Oct 21, 2020

alexec commented Oct 21, 2020

alexec commented Oct 22, 2020

terev commented Oct 22, 2020 •

edited

alexec commented Oct 26, 2020

alexec commented Oct 26, 2020

terev commented Oct 26, 2020

alexec commented Oct 26, 2020

terev commented Oct 26, 2020

alexec commented Oct 26, 2020

alexec commented Oct 26, 2020 •

edited

alexec commented Oct 26, 2020

terev commented Oct 26, 2020

View Workflow Page Not Receiving Progress Updates #4301

View Workflow Page Not Receiving Progress Updates #4301

Comments

terev commented Oct 16, 2020 • edited

Summary

Diagnostics

alexec commented Oct 16, 2020

terev commented Oct 16, 2020

terev commented Oct 16, 2020 • edited

alexec commented Oct 16, 2020

terev commented Oct 16, 2020

terev commented Oct 17, 2020 • edited

alexec commented Oct 19, 2020

terev commented Oct 21, 2020

alexec commented Oct 21, 2020

alexec commented Oct 22, 2020

terev commented Oct 22, 2020 • edited

alexec commented Oct 26, 2020

alexec commented Oct 26, 2020

terev commented Oct 26, 2020

alexec commented Oct 26, 2020

terev commented Oct 26, 2020

alexec commented Oct 26, 2020

alexec commented Oct 26, 2020 • edited

alexec commented Oct 26, 2020

terev commented Oct 26, 2020

terev commented Oct 16, 2020 •

edited

terev commented Oct 16, 2020 •

edited

terev commented Oct 17, 2020 •

edited

terev commented Oct 22, 2020 •

edited

alexec commented Oct 26, 2020 •

edited