New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
registry-facade: Ensure that node-labeler always monitors the registr-facade container #15053
Conversation
@aledbf What do you think about this approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LivenessProbe: &corev1.Probe{ | ||
ProbeHandler: corev1.ProbeHandler{ | ||
HTTPGet: &corev1.HTTPGetAction{ | ||
Path: "/ready", | ||
Port: intstr.IntOrString{IntVal: ReadinessPort}, | ||
}, | ||
}, | ||
InitialDelaySeconds: 5, | ||
PeriodSeconds: 2, | ||
TimeoutSeconds: 2, | ||
SuccessThreshold: 1, | ||
FailureThreshold: 3, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, nice work, @utam0k ! In other words, this liveness probe will only prevent scheduling of future workspaces (until the label is added back), it won't interrupt workspaces that are actively pulling images on start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@utam0k why are node-labeler
PeriodSeconds time 2, and registry-facade
10?
The side effect is we'll be more likely (when network conditions are poor) to remove the node label, but not necessarily restart the registry-facade
container (until more time has passed).
I assume intentional?
I am going to include @sagor999 , so that he can check if this will have an unintended consequence, and whether that is acceptable or not.
For example, if when we remove the label, that may cause the autoscaler to provision a new node, because more Re: workspace pods are unscheduable (ultimately, probably a good thing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/hold
waits for @sagor999 review
/unhold |
Description
When only the container of the registry-facade was broken, the ready-nobel of the node was not disappearing because when the registry-facade container restarts, other containers even in the same pod don't restart.
Kubernetes docs about that
This PR monitors the state of the registry-facade with the node-labeler's liveness probe. If the registry-facade container is not ready, the prestop hooks of node-labeler will remove the label.
Previously, I thought to address this by changing the ready-probe-labeler. However, we realized that it would be simpler to go with the Kubernetes mechanism.
#15021
Related Issue(s)
Fixes: #13915
How to test
Please check this video more detail
https://www.loom.com/share/ed6f97ecaa9a4c249309f847af522308
Release Notes
Documentation
Werft options:
If enabled this will build
install/preview
Valid options are
all
,workspace
,webapp
,ide
,jetbrains
,vscode
,ssh