Add health-check for `coder_apps` #2662

sharkymark · 2022-06-26T19:29:30Z

edited by @bpmct

Problem statement

Throughout Coder's documentation and examples, the startup_script is used to install web IDEs onto the workspace, such as code-server, Jetbrains Projector, JupyterLab, etc. From there, users connect via links in the dashboard. In the template, this is defined via the coder_app resource.

When the workspace starts, it takes 15-60s for the IDEs to install before a user can get to the page. When they click it before the app loads, there's a 404 page:

However, when you refresh 30 seconds later, it works!

Definition of done

From the dashboard, the app cannot be opened until the health check passes, or the app is eventually deemed unhealthy.

Prior art

Health checks are implemented for generic apps in Coder Classic with support for exec and http based health checks, but with a hardcoded interval/timeout/unhealthy threshold. There is not a loading indicator in the Ui, but when an app is clicked, the tab loads for x seconds until the health check passes/fails.

Ideas

@bpmct: Unlike health checks in Coder Classic, I think health checks would benefit from a configurable unhealthy threshold since applications will often be installed during runtime, leading to longer-than-normal "wait times." GCP follows this pattern

Some apps may depend on a process starting (e.g code-server, http.server) so it can be considered unhealthy in 15 seconds
Some apps may depend on IntelliJ downloading and could take 60+ seconds

Add health check to coder_app resource

resource "coder_app" "code_server" {
   # ...
+  health_check {
+     # actual schema TBD
+     enabled = true
+     unhealthy_threshold = "60s"
+   }
}

Before the unhealthy threshold, a loading indicator could be present making it clear to users the app is still unhealthy/loading until the health check passes or times out. Perhaps the app is also unclickable

After the threshold is exceeded (e.g 3 mins), the app can have a red/error indicator if the health check never passes:

.

The text was updated successfully, but these errors were encountered:

kylecarbs · 2022-08-01T22:50:10Z

Kubernetes has an initial delay and a period that polls at an interval. It could be an interesting thing to add:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

I wonder if this should be enabled or disabled by default? I can see it both ways.

spikecurtis · 2022-08-01T23:10:31Z

Kubernetes is a good API to peek at, but I think we can simplify considerably. It has two checks,

"liveness" which it uses to decide whether the container needs to be restarted
"readiness" which is uses to decide whether to route traffic to a service when there are multiple backing replicas

Since we're just deciding whether to allow the app to be clickable on the dashboard, we should only have one kind of check.

Also, I think we can drop the "initial delay." Presumably that is there to prevent Kubernetes from restarting a container that just takes a long time to start. For our use case, it doesn't matter if the app fails the first health probes --- we just continue to wait.

mafredri · 2022-08-02T13:18:08Z

Also, I think we can drop the "initial delay." Presumably that is there to prevent Kubernetes from restarting a container that just takes a long time to start. For our use case, it doesn't matter if the app fails the first health probes --- we just continue to wait.

I think there could still be value in "initial delay" so that we can accurately inform the user that there's a problem. They may be wondering why the app stays grayed-out forever?

bpmct · 2022-08-02T13:55:55Z

I think there could still be value in "initial delay" so that we can accurately inform the user that there's a problem. They may be wondering why the app stays grayed-out forever?

I think an unhealthy threshold helps with that a bit more directly versus an initial delay which just indicates when the checks first start. For example, after 100 seconds of checks, an app can be determined "unhealthy." I have a few mocks of this in the description

bpmct · 2022-08-02T13:59:37Z

However, I'm seeing now the unhealthy threshold in GCP is simply a number of checks so I could also see a few options working together to accurately check if an app starts in time, or fails. 🤷🏼

misskniss · 2022-08-02T15:29:16Z

@Emyrk do you have context on the effort for this in V1 in terms of complexity? Sounds like this might need an RFC.

Emyrk · 2022-08-02T15:39:09Z

Do you mean for V1 generic applications? @misskniss

I haven't worked with this part of the code yet in V2, so I can't comment on v2 complexity.

kylecarbs · 2022-08-02T15:42:11Z

@misskniss this issue isn't relevant to v1. What context is missing from the issue that would require an RFC? Ben filled this out yesterday.

bpmct · 2022-08-02T16:47:35Z

I feel like we can confidently come to a solution with the information we have, so it's up to the engineering owner how to settle on a schema (RFC, comments in GitHub, etc) I think we mostly just need an owner and estimate to get things going.

The idea of an RFC came up because it won't be a direct port from v1, but I've seen us implement features in different ways, such as done in #2989 and #2179

misskniss · 2022-08-02T19:27:59Z

@kylecarbs V1 only came up because we were asking about who had experience with it previously is all.

@Emyrk you were not in the session we were discussing it in but people thought you may have good insight on things we liked and did not like in V1, not that we expected you to take ownership necessarily. Though it is up for grabs if you want it.

sreya · 2022-08-03T00:44:38Z

I wrote the V1 implementation so I could maybe help out here. I would love to hear feedback on what the pros/cons were of the first version...I never got much at the time (could be a good or a bad thing!).

bpmct · 2022-08-08T14:59:20Z

@sreya - I haven't heard any cons of the health checks in v1, it continues to work well for me. One thing I'm worried about (but lack the full context) is that v1 health checks may not work well for apps that may take up to 3 minutes to download and start. This will be common in Coder OSS.

Based on that, we were discussing a few dashboard designs/changes to the schema to support longer wait times, as opposed to keeping a request open. Would love your thoughts here.

spikecurtis · 2022-08-09T18:42:49Z

Architecturally, I think we should go with the local agent actually performing the health checks, and then reporting changes in status up to Coder Server, which puts it in the DB. This matches Kubernetes architecture where the local kubelet does health checking against pods running on the node.

I thought I heard it mentioned on the call that Coder Server should do it, but I think this isn't great because:

Creates a background load (CPU, network IO) on Coder Server that scales with the number of coder apps.
In the case of multiple Coder Servers we need to coordinate which ones healthcheck which apps
Precludes exec-style healthchecks, which might be a useful increment (K8s supports them)

cc @kylecarbs

kylecarbs · 2022-08-09T18:46:20Z

Entirely agree with @spikecurtis!

BrunoQuaresma · 2022-08-29T13:49:27Z

Adding an extra opinion on this. Would be nice to have this endpoint returning some status like “not found”, “initializing”, “ready”

sharkymark mentioned this issue Jul 22, 2022

July Roadmap 2022 #3042

Closed

20 tasks

bpmct mentioned this issue Jul 30, 2022

August Roadmap 2022 #3182

Closed

25 tasks

tjcran assigned bpmct Aug 1, 2022

bpmct removed their assignment Aug 1, 2022

misskniss added the needs grooming label Aug 2, 2022

sharkymark mentioned this issue Aug 12, 2022

Failed workspace actions seem to create extra buttons #3481

Closed

bpmct added this to the EE milestone Aug 22, 2022

kylecarbs added feature Something we don't have yet site Area: frontend dashboard api Area: API and removed needs grooming labels Aug 22, 2022

kylecarbs assigned f0ssel Aug 31, 2022

kylecarbs changed the title ~~Add health-check for coder_apps and items in coder_script before allowing users to click them in the workspace UI~~ Add health-check for coder_apps Sep 20, 2022

f0ssel closed this as completed Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health-check for `coder_apps` #2662

Add health-check for `coder_apps` #2662

sharkymark commented Jun 26, 2022 •

edited by bpmct

kylecarbs commented Aug 1, 2022

spikecurtis commented Aug 1, 2022

mafredri commented Aug 2, 2022

bpmct commented Aug 2, 2022

bpmct commented Aug 2, 2022

misskniss commented Aug 2, 2022

Emyrk commented Aug 2, 2022

kylecarbs commented Aug 2, 2022

bpmct commented Aug 2, 2022

misskniss commented Aug 2, 2022 •

edited

sreya commented Aug 3, 2022

bpmct commented Aug 8, 2022 •

edited

spikecurtis commented Aug 9, 2022

kylecarbs commented Aug 9, 2022

BrunoQuaresma commented Aug 29, 2022

Add health-check for coder_apps #2662

Add health-check for coder_apps #2662

Comments

sharkymark commented Jun 26, 2022 • edited by bpmct

Problem statement

Definition of done

Prior art

Ideas

kylecarbs commented Aug 1, 2022

spikecurtis commented Aug 1, 2022

mafredri commented Aug 2, 2022

bpmct commented Aug 2, 2022

bpmct commented Aug 2, 2022

misskniss commented Aug 2, 2022

Emyrk commented Aug 2, 2022

kylecarbs commented Aug 2, 2022

bpmct commented Aug 2, 2022

misskniss commented Aug 2, 2022 • edited

sreya commented Aug 3, 2022

bpmct commented Aug 8, 2022 • edited

spikecurtis commented Aug 9, 2022

kylecarbs commented Aug 9, 2022

BrunoQuaresma commented Aug 29, 2022

Add health-check for `coder_apps` #2662

Add health-check for `coder_apps` #2662

sharkymark commented Jun 26, 2022 •

edited by bpmct

misskniss commented Aug 2, 2022 •

edited

bpmct commented Aug 8, 2022 •

edited