Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health-check for coder_apps #2662

Closed
sharkymark opened this issue Jun 26, 2022 · 15 comments
Closed

Add health-check for coder_apps #2662

sharkymark opened this issue Jun 26, 2022 · 15 comments
Assignees
Labels
api Area: API feature Something we don't have yet site Area: frontend dashboard
Milestone

Comments

@sharkymark
Copy link
Contributor

sharkymark commented Jun 26, 2022

edited by @bpmct

Problem statement

Throughout Coder's documentation and examples, the startup_script is used to install web IDEs onto the workspace, such as code-server, Jetbrains Projector, JupyterLab, etc. From there, users connect via links in the dashboard. In the template, this is defined via the coder_app resource.

When the workspace starts, it takes 15-60s for the IDEs to install before a user can get to the page. When they click it before the app loads, there's a 404 page:

Screen Shot 2022-08-01 at 5 00 55 PM

However, when you refresh 30 seconds later, it works!

Screen Shot 2022-08-01 at 5 02 13 PM

Definition of done

From the dashboard, the app cannot be opened until the health check passes, or the app is eventually deemed unhealthy.

Prior art

Health checks are implemented for generic apps in Coder Classic with support for exec and http based health checks, but with a hardcoded interval/timeout/unhealthy threshold. There is not a loading indicator in the Ui, but when an app is clicked, the tab loads for x seconds until the health check passes/fails.

Ideas

@bpmct: Unlike health checks in Coder Classic, I think health checks would benefit from a configurable unhealthy threshold since applications will often be installed during runtime, leading to longer-than-normal "wait times." GCP follows this pattern

  • Some apps may depend on a process starting (e.g code-server, http.server) so it can be considered unhealthy in 15 seconds
  • Some apps may depend on IntelliJ downloading and could take 60+ seconds

Add health check to coder_app resource

resource "coder_app" "code_server" {
   # ...
+  health_check {
+     # actual schema TBD
+     enabled = true
+     unhealthy_threshold = "60s"
+   }
}

Before the unhealthy threshold, a loading indicator could be present making it clear to users the app is still unhealthy/loading until the health check passes or times out. Perhaps the app is also unclickable

Screen Shot 2022-08-01 at 5 27 45 PM

After the threshold is exceeded (e.g 3 mins), the app can have a red/error indicator if the health check never passes:

Screen Shot 2022-08-01 at 5 30 55 PM.

@sharkymark sharkymark mentioned this issue Jul 22, 2022
20 tasks
@bpmct bpmct mentioned this issue Jul 30, 2022
25 tasks
@bpmct bpmct removed their assignment Aug 1, 2022
@kylecarbs
Copy link
Member

Kubernetes has an initial delay and a period that polls at an interval. It could be an interesting thing to add:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

I wonder if this should be enabled or disabled by default? I can see it both ways.

@spikecurtis
Copy link
Contributor

Kubernetes is a good API to peek at, but I think we can simplify considerably. It has two checks,

  1. "liveness" which it uses to decide whether the container needs to be restarted
  2. "readiness" which is uses to decide whether to route traffic to a service when there are multiple backing replicas

Since we're just deciding whether to allow the app to be clickable on the dashboard, we should only have one kind of check.

Also, I think we can drop the "initial delay." Presumably that is there to prevent Kubernetes from restarting a container that just takes a long time to start. For our use case, it doesn't matter if the app fails the first health probes --- we just continue to wait.

@mafredri
Copy link
Member

mafredri commented Aug 2, 2022

Also, I think we can drop the "initial delay." Presumably that is there to prevent Kubernetes from restarting a container that just takes a long time to start. For our use case, it doesn't matter if the app fails the first health probes --- we just continue to wait.

I think there could still be value in "initial delay" so that we can accurately inform the user that there's a problem. They may be wondering why the app stays grayed-out forever?

@bpmct
Copy link
Member

bpmct commented Aug 2, 2022

I think there could still be value in "initial delay" so that we can accurately inform the user that there's a problem. They may be wondering why the app stays grayed-out forever?

I think an unhealthy threshold helps with that a bit more directly versus an initial delay which just indicates when the checks first start. For example, after 100 seconds of checks, an app can be determined "unhealthy." I have a few mocks of this in the description

@bpmct
Copy link
Member

bpmct commented Aug 2, 2022

However, I'm seeing now the unhealthy threshold in GCP is simply a number of checks so I could also see a few options working together to accurately check if an app starts in time, or fails. 🤷🏼

@misskniss
Copy link

@Emyrk do you have context on the effort for this in V1 in terms of complexity? Sounds like this might need an RFC.

@Emyrk
Copy link
Member

Emyrk commented Aug 2, 2022

Do you mean for V1 generic applications? @misskniss

I haven't worked with this part of the code yet in V2, so I can't comment on v2 complexity.

@kylecarbs
Copy link
Member

@misskniss this issue isn't relevant to v1. What context is missing from the issue that would require an RFC? Ben filled this out yesterday.

@bpmct
Copy link
Member

bpmct commented Aug 2, 2022

I feel like we can confidently come to a solution with the information we have, so it's up to the engineering owner how to settle on a schema (RFC, comments in GitHub, etc) I think we mostly just need an owner and estimate to get things going.

The idea of an RFC came up because it won't be a direct port from v1, but I've seen us implement features in different ways, such as done in #2989 and #2179

@misskniss
Copy link

misskniss commented Aug 2, 2022

@kylecarbs V1 only came up because we were asking about who had experience with it previously is all.

@Emyrk you were not in the session we were discussing it in but people thought you may have good insight on things we liked and did not like in V1, not that we expected you to take ownership necessarily. Though it is up for grabs if you want it.

@sreya
Copy link
Collaborator

sreya commented Aug 3, 2022

I wrote the V1 implementation so I could maybe help out here. I would love to hear feedback on what the pros/cons were of the first version...I never got much at the time (could be a good or a bad thing!).

@bpmct
Copy link
Member

bpmct commented Aug 8, 2022

@sreya - I haven't heard any cons of the health checks in v1, it continues to work well for me. One thing I'm worried about (but lack the full context) is that v1 health checks may not work well for apps that may take up to 3 minutes to download and start. This will be common in Coder OSS.

Based on that, we were discussing a few dashboard designs/changes to the schema to support longer wait times, as opposed to keeping a request open. Would love your thoughts here.

@spikecurtis
Copy link
Contributor

Architecturally, I think we should go with the local agent actually performing the health checks, and then reporting changes in status up to Coder Server, which puts it in the DB. This matches Kubernetes architecture where the local kubelet does health checking against pods running on the node.

I thought I heard it mentioned on the call that Coder Server should do it, but I think this isn't great because:

  1. Creates a background load (CPU, network IO) on Coder Server that scales with the number of coder apps.
  2. In the case of multiple Coder Servers we need to coordinate which ones healthcheck which apps
  3. Precludes exec-style healthchecks, which might be a useful increment (K8s supports them)

cc @kylecarbs

@kylecarbs
Copy link
Member

Entirely agree with @spikecurtis!

@bpmct bpmct added this to the EE milestone Aug 22, 2022
@kylecarbs kylecarbs added feature Something we don't have yet site Area: frontend dashboard api Area: API and removed needs grooming labels Aug 22, 2022
@BrunoQuaresma
Copy link
Collaborator

Adding an extra opinion on this. Would be nice to have this endpoint returning some status like “not found”, “initializing”, “ready”

@kylecarbs kylecarbs changed the title Add health-check for coder_apps and items in coder_script before allowing users to click them in the workspace UI Add health-check for coder_apps Sep 20, 2022
@f0ssel f0ssel closed this as completed Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Area: API feature Something we don't have yet site Area: frontend dashboard
Projects
None yet
Development

No branches or pull requests

10 participants