Can the Coder agent immediately realize it's disconnected? #5901

bpmct · 2023-01-28T18:22:52Z

With #5292, the experience is quite nice when I type sudo reboot or a pod gets rescheduled (e.g. on spot VMs). However, in the dashboard, the agent still thinks it's connected. Could this be instant and we display a message to the user saying the agent has disconnected? Eventually, this could work with #4680 or other messages in the terminal/IDEs to inform the user.

The text was updated successfully, but these errors were encountered:

mafredri · 2023-01-28T19:36:51Z

This is something I could look at before tackling #4677 to add shutdown lifecycle states. It should be fairly minor:

Better signal handling so agent doesn’t get killed immediately
If above is not enough, add a final ping-of-death at the end of agent shutdown

This change allows the agent to handle common shutdown signals like interrupt, hangup and terminate and initiate a graceful shutdown. As long as terraform providers initiate graceful shutdowns via the aforementioned signals, things like SSH connections will be closed immediately on shutdown instead of being left hanging/timing out due to the agent being abruptly killed. Refs: #4677, #5901

bpmct · 2023-02-10T19:51:06Z

@mafredri is there another part to this? I noticed that the dashboard still reports the agent is online

Screen.Recording.2023-02-10.at.1.49.00.PM.mov

mafredri · 2023-02-10T21:34:24Z

@bpmct if the agent is getting killed abruptly, then there’s not much we can do except wait for the timeout to detect the agent is disconnected. Recent changes should improve conditions to avoid being killed abruptly, but it’s still possible.

Could you share how the agent is being run? That could belp discern how it is being stopped. Logs from the agent might also help.

Ultimately, if the agent is receiving SIGINT or SIGTERM, then disconnection should work. You can try this via kill -INT $agent_pid).

bpmct · 2023-02-10T21:50:55Z

It's being started via user-data, same as our aws-linux example.

What is the timeout for the agent at this point? How long would it take for the server to realize an agent looses connection after an agent is killed abruptly?

mafredri · 2023-02-10T22:08:14Z

I can evaluate the relevant timeouts on Monday, don’t know off the top of my head (I believe they may also be configurable).

In the meantime, can you verify whether InstanceInitiatedShutdownBehavior is set to Stop or Terminate? I’m hoping it’s Terminate, in which case changing to Stop might fix the issue.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html#Using_ChangingInstanceInitiatedShutdownBehavior

mafredri · 2023-02-13T14:18:17Z

@bpmct I'm trying to reproduce this issue (not on AWS), but each time I try to kill the agent, gracefully or abruptly (kill -9), the disconnection status is immediately.

That is, it should be fixed in 2dbe00a.

The thing is, I looked at your recording and it's showing your agent is based on commit 2dbe00a, and I have no idea why it's not working for you 🤔.

mafredri · 2023-02-13T16:17:01Z

So after @bpmct helped me reproduce this issue (🙏🏻), it seems that we have an agent inactivity reporting issue.

As Ben observed, rebooting the server is an easy repro. This causes the agent to be considered disconnected only after like 20 minutes (Tailscale/WireGuard timeout I presume).

Another way to reproduce this is to stop networking:

However, simply killing the agent (either gracefully or immediately) does not reproduce this issue:

Why the difference? Since networking is still up, in the last case, we can immediately detect that other end went away. In the reboot case, the networking is shut down before the agent because it's not running as a systemd service (manually stopping the network mimics this case).

Typically an agent is considered disconnected after not talking to the server for 6 seconds, but for some reason, this is not happening here. We need to dig deeper into why this is and since we can't really reduce the Tailscale/WireGuard timeout, we need another path to discover that chatter has ceased.

Fixes #5901

bpmct · 2023-03-13T13:35:11Z

Nice!

Screen.Recording.2023-03-13.at.6.34.12.AM.mov

mafredri mentioned this issue Jan 30, 2023

feat(agent): Handle signals and shutdown gracefully #5914

Merged

mafredri mentioned this issue Feb 10, 2023

fix(api): Allow workspace agent coordinate to report disconnect #6152

Merged

mafredri mentioned this issue Feb 13, 2023

Research and document how to gracefully shut down agents using different terraform providers (update templates) #6174

Closed

7 tasks

bpmct mentioned this issue Feb 13, 2023

As an operator, how will I know if one of my users runs into an error? #6181

Closed

mafredri self-assigned this Mar 6, 2023

mafredri mentioned this issue Mar 9, 2023

fix(coderd): Detect agent disconnect via inactivity #6528

Merged

mafredri added a commit that referenced this issue Mar 9, 2023

fix(coderd): Detect agent disconnect via inactivity

d35f8ff

Fixes #5901

mafredri added a commit that referenced this issue Mar 9, 2023

fix(coderd): Detect agent disconnect via inactivity

50be1ca

Fixes #5901

mafredri added a commit that referenced this issue Mar 9, 2023

fix(coderd): Detect agent disconnect via inactivity

dc70eab

Fixes #5901

mafredri closed this as completed in #6528 Mar 13, 2023

mafredri added a commit that referenced this issue Mar 13, 2023

fix(coderd): Detect agent disconnect via inactivity (#6528)

179d9e0

Fixes #5901

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can the Coder agent immediately realize it's disconnected? #5901

Can the Coder agent immediately realize it's disconnected? #5901

bpmct commented Jan 28, 2023

mafredri commented Jan 28, 2023 •

edited

bpmct commented Feb 10, 2023

mafredri commented Feb 10, 2023

bpmct commented Feb 10, 2023

mafredri commented Feb 10, 2023 •

edited

mafredri commented Feb 13, 2023

mafredri commented Feb 13, 2023

bpmct commented Mar 13, 2023

Can the Coder agent immediately realize it's disconnected? #5901

Can the Coder agent immediately realize it's disconnected? #5901

Comments

bpmct commented Jan 28, 2023

mafredri commented Jan 28, 2023 • edited

bpmct commented Feb 10, 2023

mafredri commented Feb 10, 2023

bpmct commented Feb 10, 2023

mafredri commented Feb 10, 2023 • edited

mafredri commented Feb 13, 2023

mafredri commented Feb 13, 2023

bpmct commented Mar 13, 2023

mafredri commented Jan 28, 2023 •

edited

mafredri commented Feb 10, 2023 •

edited