Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent token invalid #4551

Closed
fpoussin opened this issue Oct 14, 2022 · 12 comments
Closed

Agent token invalid #4551

fpoussin opened this issue Oct 14, 2022 · 12 comments

Comments

@fpoussin
Copy link

fpoussin commented Oct 14, 2022

v0.9.10

Coder started showing these errors out of the blue.
Token seems to be correct but coderd won't accept it.

2022-10-14 13:02:18.705 [WARN]  <./agent/agent.go:134>  (*agent).run    failed to dial  {"error": "GET http://coder.coder.svc.cluster.local/api/v2/workspaceagents/me/metadata: unexpected status code 401: Agent token is invalid.: Try logging in using 'coder login \u003curl\u003e'."}
~ k -n coder-workspaces exec -it coder-fpoussin-rb-5b5cf57855-4mv8t -c coder -- /bin/env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=coder-fpoussin-rb-5b5cf57855-4mv8t
TERM=xterm
CODER_AGENT_TOKEN=720e380e-cf12-4dfe-9210-864422081dc7
KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1
KUBERNETES_SERVICE_HOST=172.20.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://172.20.0.1:443
HOME=/home/coder
  containers:                                                 
  - command:                             
    - sh                                                                                                                      
    - -c                           
    - "#!/usr/bin/env sh\nset -eux\n# Sleep for a good long while before exiting.\n#                                          
      This is to allow folks to exec into a failed workspace and poke around to\n#
      troubleshoot.\nwaitonexit() {\n\techo \"=== Agent script exited with non-zero
      code. Sleeping 24h to preserve logs...\"\n\tsleep 86400\n}\ntrap waitonexit                                             
      EXIT\nBINARY_DIR=$(mktemp -d -t coder.XXXXXX)\nBINARY_NAME=coder\nBINARY_URL=http://coder.coder.svc.cluster.local/bin/coder-linux-amd64\ncd
      \"$BINARY_DIR\"\n# Attempt to download the coder agent.\n# This could fail for
      a number of reasons, many of which are likely transient.\n# So just keep trying!\nwhile
      :; do\n\t# Try a number of different download tools, as we don not know what
      we\n\t# will have available.\n\tstatus=\"\"\n\tif command -v curl >/dev/null
      2>&1; then\n\t\tcurl -fsSL --compressed \"${BINARY_URL}\" -o \"${BINARY_NAME}\"
      && break\n\t\tstatus=$?\n\telif command -v wget >/dev/null 2>&1; then\n\t\twget
      -q \"${BINARY_URL}\" -O \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telif command
      -v busybox >/dev/null 2>&1; then\n\t\tbusybox wget -q \"${BINARY_URL}\" -O \"${BINARY_NAME}\"
      && break\n\t\tstatus=$?\n\telse\n\t\techo \"error: no download tool found, please
      install curl, wget or busybox wget\"\n\t\texit 127\n\tfi\n\techo \"error: failed                                                                                                                                                                       
      to download coder agent\"\n\techo \"       command returned: ${status}\"\n\techo
      \"Trying again in 30 seconds...\"\n\tsleep 30\ndone\n\nif ! chmod +x $BINARY_NAME;                                      
      then\n\techo \"Failed to make $BINARY_NAME executable\"\n\texit 1\nfi\n\nexport
      CODER_AGENT_AUTH=\"token\"\nexport CODER_AGENT_URL=\"http://coder.coder.svc.cluster.local/\"\nexec                      
      ./$BINARY_NAME agent\n"      
    env:                                        
    - name: CODER_AGENT_TOKEN     
      value: 720e380e-cf12-4dfe-9210-864422081dc7
....

The token is indeed missing in the database. (workspace_agents table)

Sometimes it ends up working:

2022-10-14 13:18:06.285 [INFO]  <./agent/agent.go:125>  (*agent).run    connecting
2022-10-14 13:18:06.308 [WARN]  <./agent/agent.go:134>  (*agent).run    failed to dial  {"error": "GET http://coder.coder.svc.cluster.local/api/v2/workspaceagents/me/metadata: unexpected status code 401: Agent token is invalid.: Try logging in using 'co
der login \u003curl\u003e'."}
2022-10-14 13:18:12.710 [INFO]  <./agent/agent.go:125>  (*agent).run    connecting
2022-10-14 13:18:12.713 [INFO]  <./agent/agent.go:137>  (*agent).run    fetched metadata

This seems related to the coder terraform provider.
Deleting these workspaces after they fail to start sometimes leave all created resources untouched, forcing me to delete them manually.
This generates no logs on the coder side.
Tried downgrading to 0.9.9, no changes.

Debug log of agent resource creation:

2022-10-14 16:05:16.997 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:392>       (*API).followLogs.func1 subscribe buffered log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "stage": "Starting workspace"}
2022-10-14 16:05:16.997 [DEBUG] <./provisionerd/runner/runner.go:855>   (*Runner).runWorkspaceBuild     workspace provision job logged  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "level": "INFO", "output": "coder_agent.main: Creating...", "works
pace_build_id": "4d1b88da-5b26-45c0-b4f4-2a26df2e8cab"}
2022-10-14 16:05:16.997 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:190>       (*API).provisionerJobLogs       subscribe encoding log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "request_id": "38b0c7a2-00b9-422f-92a6-f77ac9bd8881", "sta
ge": "Starting workspace"}
2022-10-14 16:05:17.048 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:338>    (*provisionerdServer).UpdateJob UpdateJob starting      {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f"}
2022-10-14 16:05:17.051 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:376>    (*provisionerdServer).UpdateJob job log {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "stage": "Starting workspace", "output": "
coder_agent.main: Creating..."}
2022-10-14 16:05:17.053 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:386>    (*provisionerdServer).UpdateJob inserted job logs       {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f"}
2022-10-14 16:05:17.054 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:396>    (*provisionerdServer).UpdateJob published job logs      {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f"}
2022-10-14 16:05:17.054 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:392>       (*API).followLogs.func1 subscribe buffered log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "stage": "Starting workspace"}
2022-10-14 16:05:17.054 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:190>       (*API).provisionerJobLogs       subscribe encoding log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "request_id": "38b0c7a2-00b9-422f-92a6-f77ac9bd8881", "sta
ge": "Starting workspace"}
2022-10-14 16:05:17.054 [DEBUG] <./provisionerd/runner/runner.go:855>   (*Runner).runWorkspaceBuild     workspace provision job logged  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "level": "INFO", "output": "coder_agent.main: Creation complete af
ter 0s [id=43cb5671-f54a-43dc-956d-472947e3dd71]", "workspace_build_id": "4d1b88da-5b26-45c0-b4f4-2a26df2e8cab"}

@fpoussin
Copy link
Author

fpoussin commented Oct 17, 2022

It's indeed related to the templates.
Having more than 3 container in the pod causes this

@bpmct
Copy link
Member

bpmct commented Oct 17, 2022

Thanks for sharing thorough debug logs!

However, I haven't been able to reproduce this, I tried with a template with 4 containers in the pod and it succeeded every time after 12 starts. As a baseline, would you be able to try this template and see if the problem still continues?

@bpmct bpmct added the s0 Major regression, all-hands-on-deck to fix label Oct 17, 2022
@fpoussin
Copy link
Author

fpoussin commented Oct 17, 2022

I had 1 init container + 4 containers for testing.
I've moved one of these to another deployment/pod and it works now.
I could not reproduce the issue with your example.

@fpoussin
Copy link
Author

This is the template I'm using:
https://gist.github.com/fpoussin/b9da6254edd95a53e2853081779d72d1
Moving the postgres container to the main deployment will trigger the problem.

@bpmct
Copy link
Member

bpmct commented Oct 19, 2022

Hmm... that's super odd. I was somewhat able to reproduce this, but instead of the agent stuck on connecting, it appears connected and disconnects immediately. In fact, I can't get any variations to work.

Here's my modified kube-pods.tf. I'm glad you got things working, but we can definitely investigate because it seems like this config should work.

@dklesev
Copy link

dklesev commented Nov 27, 2022

It seems that I experience somehow the same issue but using vms (on exoscale) and auth = "token" in coder_agent resource
I inject the token from coder_agent like this in user-data

...
--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"

#!/bin/bash
echo "CODER_AGENT_TOKEN=${coder_agent.dev.token}" >> /etc/environment
sudo -E -u ubuntu /bin/bash -c '${coder_agent.dev.init_script}'
--//--

Injected token is the same if one looks into coder state pull <workspace_name>
But it is a different one in the database table workspace_agents in the auth_token column.
If I replace the value in the table with the one from the state it successfully connects.

@dklesev
Copy link

dklesev commented Dec 4, 2022

Found my issue.
Due to this lines and having this in my template

resource "coder_agent_instance" "dev" {
  agent_id    = coder_agent.dev.id
  instance_id = exoscale_compute_instance.instance.id
}

auth_instance_id seems to be assumed (with auth = "token" set, however)

So by just removing the block the correct token is used.

@r2r-dev
Copy link

r2r-dev commented Dec 7, 2022

I can observe the same issue in two cases:

  1. When running a single k8s pod with one container and one init-container. This issue can be fixed by replacing init-container with a k8s job.
  2. When there is at least one coder_app which executes a command:
resource "coder_app" "startup-logs" {
  agent_id     = coder_agent.main.id
  icon         = "/icon/widgets.svg"
  slug         = "startup-logs"
  display_name = "Startup log"
  command      = "less +F /tmp/coder-startup-script.log"
}

Coder:
Coder v0.12.9+2b6c229

Providers:
coder/coder/0.6.5
hashicorp/kubernetes/2.16.1

@r2r-dev
Copy link

r2r-dev commented Dec 8, 2022

A workaround, albeit hacky is to manually report status. For example from startup_script

CODER_AGENT_HEADER="cookie: coder_session_token=${CODER_AGENT_TOKEN}"
METADATA_ENDPOINT="${CODER_AGENT_URL}/api/v2/workspaceagents/me/metadata"
HEALTH_ENDPOINT="${CODER_AGENT_URL}/api/v2/workspaceagents/me/app-health"

curl \
 --silent \
 --fail \
 -X GET \
 -H "${CODER_AGENT_HEADER}" \
 "${METADATA_ENDPOINT}" \
   | jq -c \
       '[.apps[] | select( .health != "disabled")] | reduce .[] as $i ({}; .[$i.id] = "healthy") | {"Healths": .}' \
   | curl \
      --silent \
      --fail \
      -X POST \
      -H "${CODER_AGENT_HEADER}" \
      -d @- \
      "${HEALTH_ENDPOINT}"

@kylecarbs
Copy link
Member

Hmm, does this happen after provisioning completes? This is natural to happen during the provision job, because the resources haven't been created yet.

@kylecarbs
Copy link
Member

@r2r-dev can you post your template? I've been unable to reproduce.

kylecarbs added a commit that referenced this issue Dec 14, 2022
I'm not sure why this issue is common, but it seems to be
based on: #4551.

This improves the error messages to be unique,
and also fixes a small edge-case bug a user ran into.
kylecarbs added a commit that referenced this issue Dec 14, 2022
I'm not sure why this issue is common, but it seems to be
based on: #4551.

This improves the error messages to be unique,
and also fixes a small edge-case bug a user ran into.
@kylecarbs
Copy link
Member

@r2r-dev I found the issue you're having! It'll be fixed in a release coming out in a few hours :)

@bpmct bpmct closed this as completed Feb 9, 2023
@bpmct bpmct removed the s0 Major regression, all-hands-on-deck to fix label Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants