Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.0.0 skyhook token expired immediately after generation #2471

Closed
eedwards-sk opened this issue Aug 7, 2018 · 12 comments
Closed

4.0.0 skyhook token expired immediately after generation #2471

eedwards-sk opened this issue Aug 7, 2018 · 12 comments

Comments

@eedwards-sk
Copy link

Bug Report

  • Concourse version: 4.0.0
  • Deployment type (BOSH/Docker/binary): Docker Compose (Web / Worker) Darwin (Worker)
  • Infrastructure/IaaS:
  • Browser (if applicable): Chrome
  • Did this used to work?: It does sometimes

Getting a 'token expired' issue when trying to execute a fly command that logs me in through skyhook. Results in a 400 error on the sky token page and errors in logs.

execution flow:

  1. attempted to load a pipeline
→ snap-dev concourse fly set-pipeline -c ci/concourse/pipeline-local.yaml -p bash-util-local --load-vars-from=ci/concourse/vars/local.yaml
could not find a valid token.
logging in to team 'main'

navigate to the following URL in your browser:

  http://concourse.service.local.consul:8080/sky/login?redirect_uri=http://127.0.0.1:57269/auth/callback
  1. navigated to the url and logged in, which redirected me to a 400 page (chrome error page) with this url:
http://concourse.service.local.consul:8080/sky/callback?code=lnv25sms7u5qbmbihlphpp6lk&state=eyJyZWRpcmVjdF91cmkiOiJodHRwOi8vMTI3LjAuMC4xOjU3MjY5L2F1dGgvY2FsbGJhY2siLCJlbnRyb3B5IjoiYzdmNDY0ZWE5ODBhOTc0N2VhMzUyYmY0NTBkZmNmYjRmNWUxODhmOTRjMzE1ZTQ5YWFmOWMzYTRhNDY5MDA1YiJ9
  1. checked web logs which contained numerous entries of:
{"timestamp":"1533658475.805241585","source":"atc","message":"atc.sky.userinfo.failed-to-validate-claims","log_level":2,"data":{"error":"square/go-jose/jwt: validation failed, token is expired (exp)","session":"4.6249"}}

Even though this was just generated now. Possibly a time desync issue? I'm running concourse in docker compose and was logging in from the local dev mac.

@eedwards-sk
Copy link
Author

I was unable to log in from the CLI either, and the logs were spammed with the above failed-to-validate-claims message constantly.

destroying and re-creating the concourse stack allowed me to log in again

@loganmzz
Copy link

I have also a 400 error after login. I'm using quickstart in Docker and the external URL http://concourse.dev.localhost through a local reverse-proxy (Traefik).

When I'm using external URL http://localhost:8080 it works from Web UI but can't log from fly. It asks me to login from Web UI but entering provided URL results in 400 after submitting the form.

@vito
Copy link
Member

vito commented Aug 22, 2018

@loganmzz That sounds like #2463

@eedwards-sk If you have multiple ATCs, this might actually be a case of #2425 which we'll be shipping a fix for soon. (If so would ya mind closing this as a dupe?)

If it's a time desync issue there's honestly nothing we can do. You'd have to fix up your system clocks to be consistent.

@eedwards-sk
Copy link
Author

eedwards-sk commented Aug 22, 2018

@vito Single ATC. This is from a local docker/compose style stack with a single web/atc node, single linux worker, and single darwin worker.

Honestly I haven't tried reproing it lately since I gave up on the sky hook after this issue hit me the first time and I log in purely through the CLI now.

I get a 100% login failure repro in the web UI now too, every time I sleep my computer, wake it back up later, and then try to log back in to an expired session in the web... when it redirects me the redirect always fails and I have to go back to the base URL and log in again.

Time desync was a thought but in the original issue I had literally just launched a fresh stack in compose. (Edit: actually maybe it wasn't a fresh stack? it was a fresh login token... either way the logs were spammed with the message above and the stack had to be recreated to stop it)

@loganmzz
Copy link

@eedwards-sk Have you tried to set Web/ATC Container hostname to concourse.service.local.consul ?

@jduv
Copy link

jduv commented Aug 24, 2018

I'm seeing this on multiple ATC but not on single ATCs. I have an environment spun up that I'm happy to donate in order to help folks debug this problem. The problem is clearly repeatable. The following errors occurred after this sequence of steps:

  1. Spin up a single new concourse node in AWS using Terraform. These nodes are in docker (run command available upon request.
  2. Associate the concourse instance with a new GitHub application.
  3. Log in. Confirm login works. I won't show you the logs here because they contain sensitive information about our organization.
  4. Logout.
  5. Using terraform, increase the node count for web boxes to 2. Try to log in after the new node is awake.
  6. Observe the following behaviors in the logs:
{"timestamp":"1535081696.712544203","source":"atc","message":"atc.dex.event","log_level":2,"data":{"fields":{},"message":"Failed to get auth request: not found","session":"5"}}
{"timestamp":"1535081720.367409229","source":"atc","message":"atc.dex.event","log_level":2,"data":{"fields":{},"message":"Invalid 'state' parameter provided: not found","session":"5"}}
{"timestamp":"1535081735.249691486","source":"atc","message":"atc.dex.event","log_level":2,"data":{"fields":{},"message":"Failed to get auth request: not found","session":"5"}}
{"timestamp":"1535081745.121060610","source":"atc","message":"atc.sky.callback.failed-to-fetch-dex-token","log_level":2,"data":{"error":"oauth2: cannot fetch token: 400 Bad Request\nResponse: {\"error\":\"invalid_request\",\"error_description\":\"Invalid or expired code parameter.\"}","session":"4.310"}}

There are multiple errors that are displayed on the UI. Anything from a 500 to a 400 depending on values. I even tried clearing the teams table but to no avail. How does concourse store authentication data? In memory? Bouncing the box didn't help either. Perhaps I'm too much of an oAuth noob to begin to understand the root cause of this.

I've also confirmed connectivity from node A to node B and have confirmed via docker inspect that the correct IP addresses are set for Peer-IP.

@vito I'd love to help fix this if I knew where to start.

@vito
Copy link
Member

vito commented Aug 24, 2018

@jduv We've already got a fix for that done, it's just making its way through our pipeline. It'll be in 4.1: #2425

@eedwards-sk
Copy link
Author

eedwards-sk commented Aug 26, 2018

@loganmzz What's the best way to do that? I'm using the concourse provided docker image for web. I am already setting CONCOURSE_EXTERNAL_URL to that hostname, but I assume you're saying the docker image itself?

I'd rather not have to build a custom docker image for this.

Edit: also, this hostname resolves properly already... it's not a hostname DNS issue... I can immediately go back to the root URL after the login failure and it works.

Edit: Okay, I've found how to set the hostname of the docker container. We'll see if that helps :)

→ docker exec -it local-concourse-web sh
# hostname
concourse.service.local.consul

@eedwards-sk
Copy link
Author

eedwards-sk commented Aug 26, 2018

Setting the hostname possibly made it worse?

{"timestamp":"1535312181.877935648","source":"atc","message":"atc.sky.token.failed-to-fetch-dex-token","log_level":2,"data":{"error":"Post http://concourse.service.localtest.consul:8181/sky/issuer/token: dial tcp 172.17.0.10:8181: connect: connection refused","session":"4.7"}}

concourse.service.localtest.consul should be my host IP, which is 192.168.20.100 not 172.17.0.10

Edit: Yeah, I had to revert forcing the hostname. I'm now just setting it to the docker container name (local-concourse-web).

@loganmzz
Copy link

@eedwards-sk Container hostname can be set through -h|--hostname from docker run. External URL must be both resolvable from your client Web Browser or Shell but also inside the container.

I had same issue #2513. I couldnt't log as long as external URL were not resolvable from inside container.

@eedwards-sk
Copy link
Author

eedwards-sk commented Aug 30, 2018 via email

@eedwards-sk
Copy link
Author

I'm going to close this since I haven't been able to repro on 4.2.1, although I continue to receive errors when logging back in after the session kicks me out (I have to go back to the home page and log in again from there).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants