Agent token invalid #4551

fpoussin · 2022-10-14T13:22:53Z

v0.9.10

Coder started showing these errors out of the blue.
Token seems to be correct but coderd won't accept it.

2022-10-14 13:02:18.705 [WARN]  <./agent/agent.go:134>  (*agent).run    failed to dial  {"error": "GET http://coder.coder.svc.cluster.local/api/v2/workspaceagents/me/metadata: unexpected status code 401: Agent token is invalid.: Try logging in using 'coder login \u003curl\u003e'."}

~ k -n coder-workspaces exec -it coder-fpoussin-rb-5b5cf57855-4mv8t -c coder -- /bin/env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=coder-fpoussin-rb-5b5cf57855-4mv8t
TERM=xterm
CODER_AGENT_TOKEN=720e380e-cf12-4dfe-9210-864422081dc7
KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1
KUBERNETES_SERVICE_HOST=172.20.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://172.20.0.1:443
HOME=/home/coder

  containers:                                                 
  - command:                             
    - sh                                                                                                                      
    - -c                           
    - "#!/usr/bin/env sh\nset -eux\n# Sleep for a good long while before exiting.\n#                                          
      This is to allow folks to exec into a failed workspace and poke around to\n#
      troubleshoot.\nwaitonexit() {\n\techo \"=== Agent script exited with non-zero
      code. Sleeping 24h to preserve logs...\"\n\tsleep 86400\n}\ntrap waitonexit                                             
      EXIT\nBINARY_DIR=$(mktemp -d -t coder.XXXXXX)\nBINARY_NAME=coder\nBINARY_URL=http://coder.coder.svc.cluster.local/bin/coder-linux-amd64\ncd
      \"$BINARY_DIR\"\n# Attempt to download the coder agent.\n# This could fail for
      a number of reasons, many of which are likely transient.\n# So just keep trying!\nwhile
      :; do\n\t# Try a number of different download tools, as we don not know what
      we\n\t# will have available.\n\tstatus=\"\"\n\tif command -v curl >/dev/null
      2>&1; then\n\t\tcurl -fsSL --compressed \"${BINARY_URL}\" -o \"${BINARY_NAME}\"
      && break\n\t\tstatus=$?\n\telif command -v wget >/dev/null 2>&1; then\n\t\twget
      -q \"${BINARY_URL}\" -O \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telif command
      -v busybox >/dev/null 2>&1; then\n\t\tbusybox wget -q \"${BINARY_URL}\" -O \"${BINARY_NAME}\"
      && break\n\t\tstatus=$?\n\telse\n\t\techo \"error: no download tool found, please
      install curl, wget or busybox wget\"\n\t\texit 127\n\tfi\n\techo \"error: failed                                                                                                                                                                       
      to download coder agent\"\n\techo \"       command returned: ${status}\"\n\techo
      \"Trying again in 30 seconds...\"\n\tsleep 30\ndone\n\nif ! chmod +x $BINARY_NAME;                                      
      then\n\techo \"Failed to make $BINARY_NAME executable\"\n\texit 1\nfi\n\nexport
      CODER_AGENT_AUTH=\"token\"\nexport CODER_AGENT_URL=\"http://coder.coder.svc.cluster.local/\"\nexec                      
      ./$BINARY_NAME agent\n"      
    env:                                        
    - name: CODER_AGENT_TOKEN     
      value: 720e380e-cf12-4dfe-9210-864422081dc7
....

The token is indeed missing in the database. (workspace_agents table)

Sometimes it ends up working:

2022-10-14 13:18:06.285 [INFO]  <./agent/agent.go:125>  (*agent).run    connecting
2022-10-14 13:18:06.308 [WARN]  <./agent/agent.go:134>  (*agent).run    failed to dial  {"error": "GET http://coder.coder.svc.cluster.local/api/v2/workspaceagents/me/metadata: unexpected status code 401: Agent token is invalid.: Try logging in using 'co
der login \u003curl\u003e'."}
2022-10-14 13:18:12.710 [INFO]  <./agent/agent.go:125>  (*agent).run    connecting
2022-10-14 13:18:12.713 [INFO]  <./agent/agent.go:137>  (*agent).run    fetched metadata

This seems related to the coder terraform provider.
Deleting these workspaces after they fail to start sometimes leave all created resources untouched, forcing me to delete them manually.
This generates no logs on the coder side.
Tried downgrading to 0.9.9, no changes.

Debug log of agent resource creation:

2022-10-14 16:05:16.997 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:392>       (*API).followLogs.func1 subscribe buffered log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "stage": "Starting workspace"}
2022-10-14 16:05:16.997 [DEBUG] <./provisionerd/runner/runner.go:855>   (*Runner).runWorkspaceBuild     workspace provision job logged  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "level": "INFO", "output": "coder_agent.main: Creating...", "works
pace_build_id": "4d1b88da-5b26-45c0-b4f4-2a26df2e8cab"}
2022-10-14 16:05:16.997 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:190>       (*API).provisionerJobLogs       subscribe encoding log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "request_id": "38b0c7a2-00b9-422f-92a6-f77ac9bd8881", "sta
ge": "Starting workspace"}
2022-10-14 16:05:17.048 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:338>    (*provisionerdServer).UpdateJob UpdateJob starting      {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f"}
2022-10-14 16:05:17.051 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:376>    (*provisionerdServer).UpdateJob job log {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "stage": "Starting workspace", "output": "
coder_agent.main: Creating..."}
2022-10-14 16:05:17.053 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:386>    (*provisionerdServer).UpdateJob inserted job logs       {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f"}
2022-10-14 16:05:17.054 [DEBUG] (coderd.provisionerd-compassionate_kowalevski5) <./coderd/provisionerdaemons.go:396>    (*provisionerdServer).UpdateJob published job logs      {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f"}
2022-10-14 16:05:17.054 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:392>       (*API).followLogs.func1 subscribe buffered log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "stage": "Starting workspace"}
2022-10-14 16:05:17.054 [DEBUG] (coderd)        <./coderd/provisionerjobs.go:190>       (*API).provisionerJobLogs       subscribe encoding log  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "request_id": "38b0c7a2-00b9-422f-92a6-f77ac9bd8881", "sta
ge": "Starting workspace"}
2022-10-14 16:05:17.054 [DEBUG] <./provisionerd/runner/runner.go:855>   (*Runner).runWorkspaceBuild     workspace provision job logged  {"job_id": "32454099-bd2d-4721-980f-e274cc97f20f", "level": "INFO", "output": "coder_agent.main: Creation complete af
ter 0s [id=43cb5671-f54a-43dc-956d-472947e3dd71]", "workspace_build_id": "4d1b88da-5b26-45c0-b4f4-2a26df2e8cab"}

The text was updated successfully, but these errors were encountered:

fpoussin · 2022-10-17T16:33:56Z

It's indeed related to the templates.
Having more than 3 container in the pod causes this

bpmct · 2022-10-17T19:55:00Z

Thanks for sharing thorough debug logs!

However, I haven't been able to reproduce this, I tried with a template with 4 containers in the pod and it succeeded every time after 12 starts. As a baseline, would you be able to try this template and see if the problem still continues?

fpoussin · 2022-10-17T20:21:49Z

I had 1 init container + 4 containers for testing.
I've moved one of these to another deployment/pod and it works now.
I could not reproduce the issue with your example.

fpoussin · 2022-10-17T20:31:56Z

This is the template I'm using:
https://gist.github.com/fpoussin/b9da6254edd95a53e2853081779d72d1
Moving the postgres container to the main deployment will trigger the problem.

bpmct · 2022-10-19T14:12:07Z

Hmm... that's super odd. I was somewhat able to reproduce this, but instead of the agent stuck on connecting, it appears connected and disconnects immediately. In fact, I can't get any variations to work.

Here's my modified kube-pods.tf. I'm glad you got things working, but we can definitely investigate because it seems like this config should work.

dklesev · 2022-11-27T20:06:11Z

It seems that I experience somehow the same issue but using vms (on exoscale) and auth = "token" in coder_agent resource
I inject the token from coder_agent like this in user-data

...
--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"

#!/bin/bash
echo "CODER_AGENT_TOKEN=${coder_agent.dev.token}" >> /etc/environment
sudo -E -u ubuntu /bin/bash -c '${coder_agent.dev.init_script}'
--//--

Injected token is the same if one looks into coder state pull <workspace_name>
But it is a different one in the database table workspace_agents in the auth_token column.
If I replace the value in the table with the one from the state it successfully connects.

dklesev · 2022-12-04T01:14:28Z

Found my issue.
Due to this lines and having this in my template

resource "coder_agent_instance" "dev" {
  agent_id    = coder_agent.dev.id
  instance_id = exoscale_compute_instance.instance.id
}

auth_instance_id seems to be assumed (with auth = "token" set, however)

So by just removing the block the correct token is used.

r2r-dev · 2022-12-07T10:54:47Z

I can observe the same issue in two cases:

When running a single k8s pod with one container and one init-container. This issue can be fixed by replacing init-container with a k8s job.
When there is at least one coder_app which executes a command:

resource "coder_app" "startup-logs" {
  agent_id     = coder_agent.main.id
  icon         = "/icon/widgets.svg"
  slug         = "startup-logs"
  display_name = "Startup log"
  command      = "less +F /tmp/coder-startup-script.log"
}

Coder:
Coder v0.12.9+2b6c229

Providers:
coder/coder/0.6.5
hashicorp/kubernetes/2.16.1

r2r-dev · 2022-12-08T14:24:12Z

A workaround, albeit hacky is to manually report status. For example from startup_script

CODER_AGENT_HEADER="cookie: coder_session_token=${CODER_AGENT_TOKEN}"
METADATA_ENDPOINT="${CODER_AGENT_URL}/api/v2/workspaceagents/me/metadata"
HEALTH_ENDPOINT="${CODER_AGENT_URL}/api/v2/workspaceagents/me/app-health"

curl \
 --silent \
 --fail \
 -X GET \
 -H "${CODER_AGENT_HEADER}" \
 "${METADATA_ENDPOINT}" \
   | jq -c \
       '[.apps[] | select( .health != "disabled")] | reduce .[] as $i ({}; .[$i.id] = "healthy") | {"Healths": .}' \
   | curl \
      --silent \
      --fail \
      -X POST \
      -H "${CODER_AGENT_HEADER}" \
      -d @- \
      "${HEALTH_ENDPOINT}"

kylecarbs · 2022-12-14T18:01:39Z

Hmm, does this happen after provisioning completes? This is natural to happen during the provision job, because the resources haven't been created yet.

kylecarbs · 2022-12-14T18:09:09Z

@r2r-dev can you post your template? I've been unable to reproduce.

I'm not sure why this issue is common, but it seems to be based on: #4551. This improves the error messages to be unique, and also fixes a small edge-case bug a user ran into.

kylecarbs · 2022-12-14T20:28:15Z

@r2r-dev I found the issue you're having! It'll be fixed in a release coming out in a few hours :)

bpmct added the s0 Major regression, all-hands-on-deck to fix label Oct 17, 2022

kylecarbs mentioned this issue Dec 14, 2022

fix: improve error messages when the agent token is invalid #5423

Merged

kylecarbs mentioned this issue Dec 14, 2022

feat: add external property to coder_app #5425

Merged

bpmct closed this as completed Feb 9, 2023

bpmct removed the s0 Major regression, all-hands-on-deck to fix label Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent token invalid #4551

Agent token invalid #4551

fpoussin commented Oct 14, 2022 •

edited

fpoussin commented Oct 17, 2022 •

edited

bpmct commented Oct 17, 2022

fpoussin commented Oct 17, 2022 •

edited

fpoussin commented Oct 17, 2022

bpmct commented Oct 19, 2022 •

edited

dklesev commented Nov 27, 2022 •

edited

dklesev commented Dec 4, 2022

r2r-dev commented Dec 7, 2022 •

edited

r2r-dev commented Dec 8, 2022

kylecarbs commented Dec 14, 2022

kylecarbs commented Dec 14, 2022

kylecarbs commented Dec 14, 2022

Agent token invalid #4551

Agent token invalid #4551

Comments

fpoussin commented Oct 14, 2022 • edited

fpoussin commented Oct 17, 2022 • edited

bpmct commented Oct 17, 2022

fpoussin commented Oct 17, 2022 • edited

fpoussin commented Oct 17, 2022

bpmct commented Oct 19, 2022 • edited

dklesev commented Nov 27, 2022 • edited

dklesev commented Dec 4, 2022

r2r-dev commented Dec 7, 2022 • edited

r2r-dev commented Dec 8, 2022

kylecarbs commented Dec 14, 2022

kylecarbs commented Dec 14, 2022

kylecarbs commented Dec 14, 2022

fpoussin commented Oct 14, 2022 •

edited

fpoussin commented Oct 17, 2022 •

edited

fpoussin commented Oct 17, 2022 •

edited

bpmct commented Oct 19, 2022 •

edited

dklesev commented Nov 27, 2022 •

edited

r2r-dev commented Dec 7, 2022 •

edited