Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

registry-facade: Ensure that node-labeler always monitors the registr-facade container #15053

Merged
merged 2 commits into from Dec 6, 2022

Conversation

utam0k
Copy link
Contributor

@utam0k utam0k commented Nov 30, 2022

Description

When only the container of the registry-facade was broken, the ready-nobel of the node was not disappearing because when the registry-facade container restarts, other containers even in the same pod don't restart.
Kubernetes docs about that

This PR monitors the state of the registry-facade with the node-labeler's liveness probe. If the registry-facade container is not ready, the prestop hooks of node-labeler will remove the label.

Previously, I thought to address this by changing the ready-probe-labeler. However, we realized that it would be simpler to go with the Kubernetes mechanism.
#15021

Related Issue(s)

Fixes: #13915

How to test

Please check this video more detail
https://www.loom.com/share/ed6f97ecaa9a4c249309f847af522308

  • Node label, gitpod.io/registry-facade_ready_ns_default disappears when deleting a registry-facade container and turn on when registry-facade come back
  • Node label, gitpod.io/registry-facade_ready_ns_default disappears when deleting a registry-facade pod and turn on when registry-facade come back

Release Notes

Do not land workspaces on the node with broken registry-facade

Documentation

Werft options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh

@utam0k utam0k requested a review from a team November 30, 2022 05:52
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Nov 30, 2022
@utam0k
Copy link
Contributor Author

utam0k commented Nov 30, 2022

@aledbf What do you think about this approach?

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k may I ask, could you also consider adding a similar probe to node-labeler container in ws-daemon daemonset? You'll see here, that it would benefit from the same pattern.

Comment on lines +312 to +324
LivenessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/ready",
Port: intstr.IntOrString{IntVal: ReadinessPort},
},
},
InitialDelaySeconds: 5,
PeriodSeconds: 2,
TimeoutSeconds: 2,
SuccessThreshold: 1,
FailureThreshold: 3,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, nice work, @utam0k ! In other words, this liveness probe will only prevent scheduling of future workspaces (until the label is added back), it won't interrupt workspaces that are actively pulling images on start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k why are node-labeler PeriodSeconds time 2, and registry-facade 10?

The side effect is we'll be more likely (when network conditions are poor) to remove the node label, but not necessarily restart the registry-facade container (until more time has passed).

I assume intentional?

I am going to include @sagor999 , so that he can check if this will have an unintended consequence, and whether that is acceptable or not.

For example, if when we remove the label, that may cause the autoscaler to provision a new node, because more Re: workspace pods are unscheduable (ultimately, probably a good thing).

Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/hold
waits for @sagor999 review

@sagor999
Copy link
Contributor

sagor999 commented Dec 6, 2022

/unhold

@roboquat roboquat merged commit 8cf3a1d into main Dec 6, 2022
@roboquat roboquat deleted the to/labeler branch December 6, 2022 01:53
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note size/S team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

registry-facade restart loop
5 participants