Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can the Coder agent immediately realize it's disconnected? #5901

Closed
bpmct opened this issue Jan 28, 2023 · 8 comments · Fixed by #6528
Closed

Can the Coder agent immediately realize it's disconnected? #5901

bpmct opened this issue Jan 28, 2023 · 8 comments · Fixed by #6528
Assignees

Comments

@bpmct
Copy link
Member

bpmct commented Jan 28, 2023

With #5292, the experience is quite nice when I type sudo reboot or a pod gets rescheduled (e.g. on spot VMs). However, in the dashboard, the agent still thinks it's connected. Could this be instant and we display a message to the user saying the agent has disconnected? Eventually, this could work with #4680 or other messages in the terminal/IDEs to inform the user.

@mafredri
Copy link
Member

mafredri commented Jan 28, 2023

This is something I could look at before tackling #4677 to add shutdown lifecycle states. It should be fairly minor:

  1. Better signal handling so agent doesn’t get killed immediately
  2. If above is not enough, add a final ping-of-death at the end of agent shutdown

mafredri added a commit that referenced this issue Jan 30, 2023
This change allows the agent to handle common shutdown signals like
interrupt, hangup and terminate and initiate a graceful shutdown.

As long as terraform providers initiate graceful shutdowns via the
aforementioned signals, things like SSH connections will be closed
immediately on shutdown instead of being left hanging/timing out due to
the agent being abruptly killed.

Refs: #4677, #5901
mafredri added a commit that referenced this issue Jan 30, 2023
This change allows the agent to handle common shutdown signals like
interrupt, hangup and terminate and initiate a graceful shutdown.

As long as terraform providers initiate graceful shutdowns via the
aforementioned signals, things like SSH connections will be closed
immediately on shutdown instead of being left hanging/timing out due to
the agent being abruptly killed.

Refs: #4677, #5901
@bpmct
Copy link
Member Author

bpmct commented Feb 10, 2023

@mafredri is there another part to this? I noticed that the dashboard still reports the agent is online

Screen.Recording.2023-02-10.at.1.49.00.PM.mov

@mafredri
Copy link
Member

@bpmct if the agent is getting killed abruptly, then there’s not much we can do except wait for the timeout to detect the agent is disconnected. Recent changes should improve conditions to avoid being killed abruptly, but it’s still possible.

Could you share how the agent is being run? That could belp discern how it is being stopped. Logs from the agent might also help.

Ultimately, if the agent is receiving SIGINT or SIGTERM, then disconnection should work. You can try this via kill -INT $agent_pid).

@bpmct
Copy link
Member Author

bpmct commented Feb 10, 2023

It's being started via user-data, same as our aws-linux example.

What is the timeout for the agent at this point? How long would it take for the server to realize an agent looses connection after an agent is killed abruptly?

@mafredri
Copy link
Member

mafredri commented Feb 10, 2023

I can evaluate the relevant timeouts on Monday, don’t know off the top of my head (I believe they may also be configurable).

In the meantime, can you verify whether InstanceInitiatedShutdownBehavior is set to Stop or Terminate? I’m hoping it’s Terminate, in which case changing to Stop might fix the issue.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/terminating-instances.html#Using_ChangingInstanceInitiatedShutdownBehavior

@mafredri
Copy link
Member

@bpmct I'm trying to reproduce this issue (not on AWS), but each time I try to kill the agent, gracefully or abruptly (kill -9), the disconnection status is immediately.

That is, it should be fixed in 2dbe00a.

The thing is, I looked at your recording and it's showing your agent is based on commit 2dbe00a, and I have no idea why it's not working for you 🤔.

@mafredri
Copy link
Member

So after @bpmct helped me reproduce this issue (🙏🏻), it seems that we have an agent inactivity reporting issue.

As Ben observed, rebooting the server is an easy repro. This causes the agent to be considered disconnected only after like 20 minutes (Tailscale/WireGuard timeout I presume).

Another way to reproduce this is to stop networking:

image

However, simply killing the agent (either gracefully or immediately) does not reproduce this issue:

image

Why the difference? Since networking is still up, in the last case, we can immediately detect that other end went away. In the reboot case, the networking is shut down before the agent because it's not running as a systemd service (manually stopping the network mimics this case).

Typically an agent is considered disconnected after not talking to the server for 6 seconds, but for some reason, this is not happening here. We need to dig deeper into why this is and since we can't really reduce the Tailscale/WireGuard timeout, we need another path to discover that chatter has ceased.

@bpmct
Copy link
Member Author

bpmct commented Mar 13, 2023

Nice!

Screen.Recording.2023-03-13.at.6.34.12.AM.mov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants