New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator api server modularization #24228
Operator api server modularization #24228
Conversation
187b6e7
to
cb1206f
Compare
4128fa6
to
2c4547d
Compare
operator/cmd/root.go
Outdated
operatorCfg *operatorOption.OperatorConfig, | ||
) api.SharedConfig { | ||
return api.SharedConfig{ | ||
EnableK8s: operatorCfg.EnableK8s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be preferable to this SharedConfig
to instead check for k8s enabled status via the k8s clientset: clientset.IsEnabled()
, rational being that it's better to have only a single source of truth.
Hmm, or is the problem that the k8s client gets configured and we want to use EnableK8s
to overriding? Alternative I suppose is to explicitly Disable()
the client based on EnableK8s
in an Invoke
. That would also make sure that the heartbeats aren't running. Though need to be careful that the Invoke
runs early enough before any calls to clientset.IsEnabled()
.
... almost feels like the cleanest solution would be to make enable-k8s
part of client.Config
, but that would also then introduce it to cilium-agent
and others. Still maybe not a bad idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the best solution is to move enable-k8s
as part of client.Config
.
I don't think it will be a problem to introduce it to other binaries. The original PR for this was mentioning that the other binaries were already working correctly. Specifically, I think the agent was behaving properly because of the clientset.IsEnabled()
checks.
Instead the operator had some code paths that where not guarded by that, so an additional flag was added (adding a new one was easier, I suppose).
2c4547d
to
c3a0f21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for cli, sig-k8s, api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mostly. My comments are around commit squashing, especially to squash test commits with the implementation commits and any cmdref changes squashed into the commits that introduced them.
3a14638
to
96b531b
Compare
/test Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one non-blocking nit: I think we meant to squash "operator/api: Add api server tests" with the previous commit.
96b531b
to
920d9aa
Compare
You're right, I was missing that. Squashed 👍 |
/test Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/1668/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/1684/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/845/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/1727/ If it is a flake and a GitHub issue doesn't already exist to track it, comment |
Seems like a legit failure with the Operator |
920d9aa
to
53f14f0
Compare
Modularize operator api server metrics handler into its own independent cell. This will improve the testability of the handler. The old code is left untouched to complete the server refactoring in another commit. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Add HTTP response "Not Implemented" for the oeprator api server health handler. The code is generated with go-swagger running `make generate-operator-api`. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Modularize operator api server health handler into its own independent cell. This will improve the testability of the handler. The old code is left untouched to complete the server refactoring in another commit. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Avoid initialization of leader election context and cancel function at package import time to allow restarting the operator multiple times in controlplane testing. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Modularize the operator api server into its own independent cell. The refactoring aims to keep the same behavior of the previous implementation. Specifically, the refactored server: - adds a custom handler for /healthz as an alias to /v1/healthz - starts two HTTP servers, one listening on "127.0.0.1:0" and the other on "[::1]:0", in case the user is running ipv4 or ipv6 only Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Avoid specifying a port for the operator API server in controlplane testing. This can cause port clashing in the test environment, leading to test flakyness. Instead, let the kernel pick a random available port using ":0". Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Even if the operator is not leading, it should answer to the health check requests. In other words, the operator api server should always be up and running, whether the operator has been elected leader or not. In case the operator is elected leader, the health handler should actually check the server version and, if configured, the connection to the kvstore, while it should simply return a `200 OK` response code if the operator is not leading. To configure the health handler behavior after the leader election takes place, the health handler accepts a function to get the current leading status of the operator instance. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The CNPStatusNodesGC test verifies the garbage collection of the stale status nodes updates at the operator startup, when the related config option is set. The test is divided in two parts: the first one verifies that the stale updates are not cleaned up when the GC is disabled, while the second one verifies the opposite. The first issue with the current code is the incorrect call to DeleteObjects, that is lacking the references to the objects to delete. IOW, the DeleteObjects calls is a no-op, and the second part of the test depends on what the first part leaves in the object trackers. Fixing this issue revealed another one. The modularization of the operator api server impacted the timing of the operator startup sequence. This sequence includes the legacyCell, that contains the code responsible for the startup of all the operator subsystems not yet modularized, including the nodes status updates GC. This resulted in the one-shot GC running before the object trackers held the CNPs needed for the test, ultimately leading to a test failure. To fix this timing issue, we insert all the k8s objects needed for the test before restarting the agent and the operator. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The enable-k8s option has been introduced to avoid some operator code paths that assumed to run in a k8s environment. Instead of passing around the flag, we use it to configure the clientset, so that the check `clientset.IsEnabled()` will keep that too into account and avoid the execution of kubernetes-specific code. The operator api server cell and tests have been updated accordingly. Related: cilium#21344 Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
53f14f0
to
6930f26
Compare
/test |
Previous revision was incorrectly trying to check for external kvstore status from the operator replica not elected leader. Fixed it with this patch: --- a/operator/api/health.go
+++ b/operator/api/health.go
@@ -61,11 +61,9 @@ func (h *healthHandler) Handle(params operator.GetHealthzParams) middleware.Resp
return operator.NewGetHealthzNotImplemented()
}
- if !h.isOperatorLeading() {
- if err := h.checkStatus(); err != nil {
- h.log.WithError(err).Warn("Health check status")
- return operator.NewGetHealthzInternalServerError().WithPayload(err.Error())
- }
+ if err := h.checkStatus(); err != nil {
+ h.log.WithError(err).Warn("Health check status")
+ return operator.NewGetHealthzInternalServerError().WithPayload(err.Error())
}
return operator.NewGetHealthzOK().WithPayload("ok")
@@ -74,7 +72,12 @@ func (h *healthHandler) Handle(params operator.GetHealthzParams) middleware.Resp
// checkStatus verifies the connection status to the kvstore and the
// k8s apiserver and returns an error if any of them is unhealthy
func (h *healthHandler) checkStatus() error {
- if h.kvstoreEnabled() {
+ // We check if we are the leader here because only the leader has
+ // access to the kvstore client. Otherwise, the kvstore client check
+ // will block. It is safe for a non-leader to skip this check, as the
+ // it is the leader's responsibility to report the status of the
+ // kvstore client.
+ if h.isOperatorLeading() && h.kvstoreEnabled() {
client := kvstore.Client()
if client == nil {
return errors.New("kvstore client not configured") |
test-runtime is hitting #24342 |
/test-runtime |
/test-1.26-net-next |
Modularize the operator api server into its own independent cell. The refactoring aims to keep the same behavior of the previous implementation. The refactored server:
To answer the health checks, the operator api server should always be up and running, whether the operator has been elected leader or not.
Therefore, the health handler behavior should change based on the operator instance being elected leader or not, so it is refactored to accept a function that returns the current leading status of the operator instance.
Integration style unit tests have been added for both the handlers and the server.
Fixes: #24064