Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

View Workflow Page Not Receiving Progress Updates #4301

Closed
terev opened this issue Oct 16, 2020 · 20 comments · Fixed by #4378
Closed

View Workflow Page Not Receiving Progress Updates #4301

terev opened this issue Oct 16, 2020 · 20 comments · Fixed by #4378
Assignees
Labels
Milestone

Comments

@terev
Copy link
Contributor

terev commented Oct 16, 2020

Summary

It seems like the view workflow page is no longer continuously receiving events. On 2.10.2 this did work where progress of the workflow could be viewed without refreshing. The SSE endpoint used to retrieve these events /workflow-events seems to be called once then after the request times out, a new watch request is not started. This may not be the only affected SSE endpoint.

Update:
This also appears to occur on the workflow list page. Where the SSE request eventually terminates with a 504
Gateway Timeout.

Diagnostics

GKE 1.15

On 2.11.4 i've noticed that the workflow graph isn't updating anymore. I upgraded to 2.11.4 from 2.10.2 where this was working before.

This has occurred when viewing any in progress workflow i've tried.

Paste the logs from the workflow controller:
these don't seem relevant

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@terev terev added the type/bug label Oct 16, 2020
@alexec
Copy link
Contributor

alexec commented Oct 16, 2020

Would you like to see if you can identify the offending change?

@alexec alexec added this to the v2.11 milestone Oct 16, 2020
@terev
Copy link
Contributor Author

terev commented Oct 16, 2020

Sure I'll have a look

@terev
Copy link
Contributor Author

terev commented Oct 16, 2020

I have a feeling it has something to do with the changes here: 6e3c5be#diff-884cea1f0a354998c12cefc2b8097a881fe6d621fa91c1c549d241d6023fa5b5L42 .

In my browser i added a breakpoint here ui/src/app/shared/services/requests.ts and here ui/src/app/shared/services/requests.ts . In 2.10 it seems like the error (where the error is actually undefined) is caught and received by the chained then but in 2.11 its passed to the onerror handler. But i can't tell what would be causing the EventSource to not be reestablished.

@alexec
Copy link
Contributor

alexec commented Oct 16, 2020

What is the UI behaviour? Does it show an error?

@terev
Copy link
Contributor Author

terev commented Oct 16, 2020

No error in the ui i can see. It just stays static in the state it was when loading the page (workflow-details).

@terev
Copy link
Contributor Author

terev commented Oct 17, 2020

Locally i reverted just these changes 6e3c5be#diff-0670537605e4212d6dada675c6498eabe38d45eb0d3594d59c0779905a83f18bR319-R322 and after the first request times out another one is established.

i have a nginx proxy set up locally in order to mimic the ingress in my deployment with configuration like:

events { }

 http {
   server {
     proxy_http_version 1.1;

     listen 2345;

     set $argo_upstream http://127.0.0.1:2746;
     location / {
         proxy_set_header Connection '';
         proxy_buffering off;
         proxy_cache off;
         chunked_transfer_encoding off;
         proxy_pass $argo_upstream;
     }
   }
 }

nginx -c ./nginx.conf
And i modified the webpack dev config to proxy through the nginx proxy

 devServer: {
   historyApiFallback: {
      disableDotRule: true
    },
    proxy: {
      "/api": {
        "target": isProd ? "" : "http://localhost:2345",
        "secure": false
      },

@alexec
Copy link
Contributor

alexec commented Oct 19, 2020

Can you test with argocli:latest please?

@terev
Copy link
Contributor Author

terev commented Oct 21, 2020

@alexec I had to modify my proxy config since the cli only works with a grpc connection:

 events { }

 error_log /Users/trevorfoster/tempproxy/error.log info;

 http {
   access_log /Users/trevorfoster/tempproxy/access.log;
   server {
     resolver 127.0.0.1;

     server_name localhost;
     listen 2345 http2;

     set $argo_upstream grpc://127.0.0.1:2746;
     location / {
         grpc_pass $argo_upstream;
     }
   }
 }

However the watch still times out after one minute because of nginx's default keep alive setting of 1 minute. This keepalive setting is also the reason why the ui request times out. If the request times out the client should re start the watch in my opinion.

Name:                fantastic-whale
Namespace:           argo
ServiceAccount:      default
Status:              Running
Created:             Wed Oct 21 17:44:05 -0400 (1 minute ago)
Started:             Wed Oct 21 17:44:05 -0400 (1 minute ago)
Duration:            1 minute 5 seconds
Progress:            0/1
Parameters:
  message:           hello argo

STEP                TEMPLATE  PODNAME          DURATION  MESSAGE
 ● fantastic-whale  argosay   fantastic-whale  1m
FATA[0060] rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: INTERNAL_ERROR

@alexec
Copy link
Contributor

alexec commented Oct 21, 2020

@alexmt any thoughts on what does Argo CD does about this? Does it send some kind of keep-alive?

@alexec
Copy link
Contributor

alexec commented Oct 22, 2020

The way the UI is coded is that it should present an error notice if there is an error, and then automatically reconnect.

@terev
Copy link
Contributor Author

terev commented Oct 22, 2020

that's what i thought as well since the eventsource handles that but it doesn't seem to work.

In the event of an upstream timeout it does appear the request ends with a response code of 200. So it's possible the logic reestablishing the connection isn't triggered in this case. However the same response is returned on 2.10 and it does establish a new request.

Screen Shot 2020-10-22 at 10 03 32 PM

@alexec
Copy link
Contributor

alexec commented Oct 26, 2020

So we think that the request ends with 200, but we do not reconnect like we should.

alexec added a commit to alexec/argo-workflows that referenced this issue Oct 26, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec
Copy link
Contributor

alexec commented Oct 26, 2020

Can you please test argoproj/argocli:ui-prog and see if this fixes your issue?

@terev
Copy link
Contributor Author

terev commented Oct 26, 2020

I checked out your changes locally and it seems like a new call still isn't being established in the details ui. I added a breakpoint around the code called on subscribe finish:

Screen Shot 2020-10-26 at 4 20 39 PM

It seems like the error handler is being called instead of the third handler where you added the recursive call.

@alexec
Copy link
Contributor

alexec commented Oct 26, 2020

I've made another change. Can you pull the image again and try again?

@terev
Copy link
Contributor Author

terev commented Oct 26, 2020

That seems to work! Very neat, should we make this change for the other eventsource subscribers?

@alexec
Copy link
Contributor

alexec commented Oct 26, 2020

@terev would you be interested in taking over on this fix?

@alexec
Copy link
Contributor

alexec commented Oct 26, 2020

I think we should maybe determine if the error is a disconnect (so reconnecting is a good idea) rather than 403 or 500 (which would result in going into a hot-loop).

@alexec
Copy link
Contributor

alexec commented Oct 26, 2020

@terev
Copy link
Contributor Author

terev commented Oct 26, 2020

Sweet that makes sense

@alexec alexec linked a pull request Oct 27, 2020 that will close this issue
6 tasks
alexec added a commit that referenced this issue Oct 27, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
alexec added a commit that referenced this issue Oct 27, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
alexcapras pushed a commit to alexcapras/argo that referenced this issue Nov 12, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Alex Capras <alexcapras@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants