Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While watching a job, we get a " You are not authorized to view the details of this pipeline" error #3745

Closed
aeijdenberg opened this issue Apr 17, 2019 · 21 comments · Fixed by #3833
Assignees
Milestone

Comments

@aeijdenberg
Copy link
Contributor

@aeijdenberg aeijdenberg commented Apr 17, 2019

Bug Report

After updating from 5.0.1 to 5.1.0 we've observed that when using the UI to watch a running job, then after waiting a period of time (one user reported 30 seconds, I just observed 180 seconds) s the UI shows a 401 Unauthorized - You are not authorized to view the details of this pipeline error message.

Steps to Reproduce

View a running job and watch it.

Expected Results

To watch it uninterrupted by error messages.

Actual Results

Error message appears after 180 seconds (though we'd had varying reports of this time length).

Additional Context

We login to our Concourse with UAA.
We have an HAProxy in front of our Concourse.

Version Info

  • Concourse version: 5.1.0 - this previously worked fine with 5.0.1.
  • Deployment type (BOSH/Docker/binary): bosh
  • Infrastructure/IaaS: AWS
  • Browser (if applicable): Chrome
  • Did this used to work? Yes
@aeijdenberg aeijdenberg added the bug label Apr 17, 2019
@aeijdenberg

This comment has been minimized.

Copy link
Contributor Author

@aeijdenberg aeijdenberg commented Apr 17, 2019

@aeijdenberg

This comment has been minimized.

Copy link
Contributor Author

@aeijdenberg aeijdenberg commented Apr 17, 2019

I'll note a few more observations - the time appears a little random. This morning I watched a job and it gave me the error after 2.5 minutes, then after page refresh, 1.45.

Not that I'd expect it to make a difference, but quite some time ago we moved our Concourse instance off the public internet, and only accessible via a SOCKS5 proxy. At about that time we also noticed that sometimes when watching a job, the output streaming from the server (if not changing frequently) would appear to stall and we wouldn't get any updates. A page refresh would fix.

I'm wondering if the now "401 Unauthorized" message is perhaps a different manifestation of an issue that previously failed a bit more silently?

@aeijdenberg

This comment has been minimized.

Copy link
Contributor Author

@aeijdenberg aeijdenberg commented Apr 17, 2019

Hmm, at the same time the error appears on screen, Chrome console shows:

GET https://concourse.example.com/api/v1/builds/2113698/events net::ERR_INCOMPLETE_CHUNKED_ENCODING 200 (OK)
... then, a little while later, after the error has already appeared ...
GET https://concourse.example.com/api/v1/builds/2113698/events 504 (Gateway Time-out)
@jfisher84

This comment has been minimized.

Copy link

@jfisher84 jfisher84 commented Apr 18, 2019

We're getting the same error

@nterry

This comment has been minimized.

Copy link

@nterry nterry commented Apr 22, 2019

Same here...

@aeijdenberg

This comment has been minimized.

Copy link
Contributor Author

@aeijdenberg aeijdenberg commented Apr 22, 2019

@vito for visibility - looks like a bunch of us are seeing this issue since the v5.1.0 release last week.

@cirocosta

This comment has been minimized.

Copy link
Member

@cirocosta cirocosta commented Apr 23, 2019

@pivotal-jamie-klassen

This comment has been minimized.

Copy link
Contributor

@pivotal-jamie-klassen pivotal-jamie-klassen commented Apr 23, 2019

I'm marking this as web-ui, ans I suspect it is related to changes in the
build events code -- I'm not certain, but I believe that back when we used an
effect module (a feature that was removed in Elm 0.19) there was a slightly
more resilient behaviour. I think the browser would attempt to re-open the
stream of build events on an error. Probably worth investigating, anyway.

@aeijdenberg

This comment has been minimized.

Copy link
Contributor Author

@aeijdenberg aeijdenberg commented Apr 23, 2019

I wonder if it's related to the HAProxy we have in front. I note it has the following timeouts:

    timeout connect         5000ms
    timeout client          30000ms
    timeout server          30000ms
    timeout tunnel          3600000ms
    timeout http-keep-alive 500ms
    timeout http-request    5000ms
    timeout queue           30000ms

I wonder if 30 seconds of inactivity (e.g. long running job with no new log lines in that period) is causing HAProxy to kill the connection?

I wonder if we could convince HAProxy that this connection is a tunnel (and thus apply the longer timeout): https://cbonte.github.io/haproxy-dconv/1.8/configuration.html#4.2-timeout%20tunnel

@ralekseenkov

This comment has been minimized.

Copy link
Contributor

@ralekseenkov ralekseenkov commented Apr 24, 2019

we are hitting the same problem.

@farukhkhan123

This comment has been minimized.

Copy link

@farukhkhan123 farukhkhan123 commented Apr 25, 2019

we have noticed the same problem viewing logs of a running pipeline job...timing of the error is fluctuates b/w 1-2 mins

@aeijdenberg

This comment has been minimized.

Copy link
Contributor Author

@aeijdenberg aeijdenberg commented Apr 25, 2019

By way of update, we changed the timeout server 30000ms setting in our HAProxy to be 1 hour, and that seems to have worked around the issue we were seeing.

@avanier

This comment has been minimized.

Copy link
Contributor

@avanier avanier commented Apr 30, 2019

Yeah, we're getting the same issue over here.

Our webs runs in Kubernetes behind an Amazon ELB Classic.

cc @Typositoire

@nterry

This comment has been minimized.

Copy link

@nterry nterry commented Apr 30, 2019

By way of update, we changed the timeout server 30000ms setting in our HAProxy to be 1 hour, and that seems to have worked around the issue we were seeing.

Great, how do we do that. Is there a new docker image in the works to address this?

@aeijdenberg

This comment has been minimized.

Copy link
Contributor Author

@aeijdenberg aeijdenberg commented May 1, 2019

In our case our HAProxy config was fairly custom to begin with, so no shared Docker image to update.

My feeling is that a more correct fix would be make the stream look more like a tunnel (e.g. websocket is hinted at in the HAProxy documentation, which by default attracts a much higher timeout) - and/or put some kind of heartbeat in so that large periods of expected inactivity don't drop the connection... (and/or try reconnecting if it drops).

@thecpdubguy

This comment has been minimized.

Copy link

@thecpdubguy thecpdubguy commented May 1, 2019

Adding on to this, same here, bumped from 4.2.2 to 5.1.0, when looking at a specific build in a job, after some time of being idle the screen changes to "401 unauthorized" - I haven't poked into it much to see when the time occurs. A simple page refresh goes back to watching the build for another set of time. Appears to only happen while idle.

As per request below:
Firefox Quantum - 66.0.3 (64-bit)
(Also happened on Chrome but I am unable to access that machine so I don't have specific version)

@vito

This comment has been minimized.

Copy link
Member

@vito vito commented May 1, 2019

Saw this myself a couple times just now when viewing PR builds. The console logs showed ERR_NETWORK_CHANGED in my case. Which is odd because I'm on ethernet, but maybe this was happening before too and the front-end code was just more resilient to it so I didn't notice? 🤔 Haven't seen it happen since I've started paying more attention to it...

We don't have any HAProxy and for me the error happened pretty much immediately after opening the page, so my scenario sounds slightly different. If anyone can get this down to a reliable repro case that'd help a lot! 👍

Also everyone please report your exact browser version just in case there's been some behavior change in recent Chrome versions or something.

@avanier

This comment has been minimized.

Copy link
Contributor

@avanier avanier commented May 1, 2019

I'm reporting from :

Mozilla Firefox 66.0.3
@vito

This comment has been minimized.

Copy link
Member

@vito vito commented May 1, 2019

@pivotal-jamie-klassen Looks like it might be this bit of code?:

newModel =
if
List.map .data envelopes
|> List.member STModels.NetworkError
then
{ model | authorized = False }
else
model

Maybe it needs to be made conditional on eventSourceOpened like it is here?:

( if eventSourceOpened then
-- connection could have dropped out of the blue;
-- just let the browser handle reconnecting
model
else
-- assume request was rejected because auth is required;
-- no way to really tell
{ model | authorized = False }

@jfisher84

This comment has been minimized.

Copy link

@jfisher84 jfisher84 commented May 1, 2019

Chrome Version 74.0.3729.131 (Official Build) (64-bit)

@Typositoire

This comment has been minimized.

Copy link

@Typositoire Typositoire commented May 3, 2019

@vito In our case I'm always getting net::ERR_INCOMPLETE_CHUNKED_ENCODING in the console when this happens.

And it happens when trying to get [...]/events (eventSource)
image

@vito vito added this to the v5.2.0 milestone May 6, 2019
@pivotal-jamie-klassen pivotal-jamie-klassen self-assigned this May 8, 2019
pivotal-jamie-klassen added a commit that referenced this issue May 8, 2019
#3745

Signed-off-by: Jamie Klassen <cklassen@pivotal.io>
@vito vito closed this in #3833 May 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.