Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator api server modularization #24228

Conversation

pippolo84
Copy link
Member

@pippolo84 pippolo84 commented Mar 7, 2023

Modularize the operator api server into its own independent cell. The refactoring aims to keep the same behavior of the previous implementation. The refactored server:

  • adds a custom handler for /healthz as an alias to /v1/healthz
  • in case of an empty server address option, it starts two HTTP servers, one listening on "127.0.0.1:0" and the other on "[::1]:0", in case the user is running ipv4 or ipv6 only, respectively

To answer the health checks, the operator api server should always be up and running, whether the operator has been elected leader or not.
Therefore, the health handler behavior should change based on the operator instance being elected leader or not, so it is refactored to accept a function that returns the current leading status of the operator instance.

Integration style unit tests have been added for both the handlers and the server.

Fixes: #24064

@pippolo84 pippolo84 added area/operator Impacts the cilium-operator component release-note/misc This PR makes changes that have no direct user impact. area/modularization labels Mar 7, 2023
@pippolo84 pippolo84 force-pushed the pr/pippolo84/operator-api-server-modularization branch 5 times, most recently from 187b6e7 to cb1206f Compare March 10, 2023 16:48
@pippolo84 pippolo84 force-pushed the pr/pippolo84/operator-api-server-modularization branch 4 times, most recently from 4128fa6 to 2c4547d Compare March 19, 2023 23:08
@pippolo84 pippolo84 marked this pull request as ready for review March 19, 2023 23:48
@pippolo84 pippolo84 requested review from a team as code owners March 19, 2023 23:48
operator/api/cell.go Outdated Show resolved Hide resolved
operator/api/server.go Outdated Show resolved Hide resolved
operator/cmd/root.go Outdated Show resolved Hide resolved
operatorCfg *operatorOption.OperatorConfig,
) api.SharedConfig {
return api.SharedConfig{
EnableK8s: operatorCfg.EnableK8s,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be preferable to this SharedConfig to instead check for k8s enabled status via the k8s clientset: clientset.IsEnabled(), rational being that it's better to have only a single source of truth.

Hmm, or is the problem that the k8s client gets configured and we want to use EnableK8s to overriding? Alternative I suppose is to explicitly Disable() the client based on EnableK8s in an Invoke. That would also make sure that the heartbeats aren't running. Though need to be careful that the Invoke runs early enough before any calls to clientset.IsEnabled().

... almost feels like the cleanest solution would be to make enable-k8s part of client.Config, but that would also then introduce it to cilium-agent and others. Still maybe not a bad idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the best solution is to move enable-k8s as part of client.Config.
I don't think it will be a problem to introduce it to other binaries. The original PR for this was mentioning that the other binaries were already working correctly. Specifically, I think the agent was behaving properly because of the clientset.IsEnabled() checks.
Instead the operator had some code paths that where not guarded by that, so an additional flag was added (adding a new one was easier, I suppose).

operator/api/health.go Outdated Show resolved Hide resolved
@pippolo84 pippolo84 force-pushed the pr/pippolo84/operator-api-server-modularization branch from 2c4547d to c3a0f21 Compare March 21, 2023 11:42
@pippolo84 pippolo84 requested review from a team as code owners March 21, 2023 11:42
@pippolo84 pippolo84 requested a review from joamaki March 21, 2023 14:09
Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for cli, sig-k8s, api

@nathanjsweet nathanjsweet requested a review from a team March 24, 2023 17:25
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mostly. My comments are around commit squashing, especially to squash test commits with the implementation commits and any cmdref changes squashed into the commits that introduced them.

operator/api/metrics_test.go Outdated Show resolved Hide resolved
operator/api/health_test.go Outdated Show resolved Hide resolved
Documentation/cmdref/cilium-operator-alibabacloud_hive.md Outdated Show resolved Hide resolved
@pippolo84 pippolo84 force-pushed the pr/pippolo84/operator-api-server-modularization branch from 3a14638 to 96b531b Compare April 4, 2023 07:59
@pippolo84
Copy link
Member Author

pippolo84 commented Apr 4, 2023

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one non-blocking nit: I think we meant to squash "operator/api: Add api server tests" with the previous commit.

@pippolo84 pippolo84 force-pushed the pr/pippolo84/operator-api-server-modularization branch from 96b531b to 920d9aa Compare April 7, 2023 20:44
@pippolo84
Copy link
Member Author

LGTM, just one non-blocking nit: I think we meant to squash "operator/api: Add api server tests" with the previous commit.

You're right, I was missing that. Squashed 👍

@pippolo84
Copy link
Member Author

pippolo84 commented Apr 8, 2023

/test

Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/1668/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-5.4 so I can create one.

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/1684/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/845/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Etcd Check connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/1727/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

@christarazi
Copy link
Member

Seems like a legit failure with the Operator

@pippolo84 pippolo84 force-pushed the pr/pippolo84/operator-api-server-modularization branch from 920d9aa to 53f14f0 Compare April 12, 2023 09:34
Modularize operator api server metrics handler into its own independent
cell.  This will improve the testability of the handler.

The old code is left untouched to complete the server refactoring in
another commit.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Add HTTP response "Not Implemented" for the oeprator api server health
handler.  The code is generated with go-swagger running `make
generate-operator-api`.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Modularize operator api server health handler into its own independent
cell. This will improve the testability of the handler.

The old code is left untouched to complete the server refactoring in
another commit.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Avoid initialization of leader election context and cancel function at
package import time to allow restarting the operator multiple times in
controlplane testing.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Modularize the operator api server into its own independent cell.  The
refactoring aims to keep the same behavior of the previous
implementation. Specifically, the refactored server:

- adds a custom handler for /healthz as an alias to /v1/healthz
- starts two HTTP servers, one listening on "127.0.0.1:0" and the other
  on "[::1]:0", in case the user is running ipv4 or ipv6 only

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Avoid specifying a port for the operator API server in controlplane
testing. This can cause port clashing in the test environment, leading
to test flakyness.
Instead, let the kernel pick a random available port using ":0".

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Even if the operator is not leading, it should answer to the health
check requests. In other words, the operator api server should always be
up and running, whether the operator has been elected leader or not.

In case the operator is elected leader, the health handler should
actually check the server version and, if configured, the connection to
the kvstore, while it should simply return a `200 OK` response code if
the operator is not leading.

To configure the health handler behavior after the leader election takes
place, the health handler accepts a function to get the current leading
status of the operator instance.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The CNPStatusNodesGC test verifies the garbage collection of the stale
status nodes updates at the operator startup, when the related config
option is set.

The test is divided in two parts: the first one verifies that the stale
updates are not cleaned up when the GC is disabled, while the second one
verifies the opposite.

The first issue with the current code is the incorrect call to
DeleteObjects, that is lacking the references to the objects to delete.
IOW, the DeleteObjects calls is a no-op, and the second part of the test
depends on what the first part leaves in the object trackers.

Fixing this issue revealed another one.

The modularization of the operator api server impacted the timing of the
operator startup sequence.  This sequence includes the legacyCell, that
contains the code responsible for the startup of all the operator
subsystems not yet modularized, including the nodes status updates GC.

This resulted in the one-shot GC running before the object trackers held
the CNPs needed for the test, ultimately leading to a test failure.

To fix this timing issue, we insert all the k8s objects needed for the
test before restarting the agent and the operator.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The enable-k8s option has been introduced to avoid some operator code
paths that assumed to run in a k8s environment.  Instead of passing
around the flag, we use it to configure the clientset, so that the check
`clientset.IsEnabled()` will keep that too into account and avoid the
execution of kubernetes-specific code.

The operator api server cell and tests have been updated accordingly.

Related: cilium#21344

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
@pippolo84 pippolo84 force-pushed the pr/pippolo84/operator-api-server-modularization branch from 53f14f0 to 6930f26 Compare April 12, 2023 09:46
@pippolo84
Copy link
Member Author

/test

@pippolo84
Copy link
Member Author

pippolo84 commented Apr 12, 2023

Seems like a legit failure with the Operator

Previous revision was incorrectly trying to check for external kvstore status from the operator replica not elected leader. Fixed it with this patch:

--- a/operator/api/health.go
+++ b/operator/api/health.go
@@ -61,11 +61,9 @@ func (h *healthHandler) Handle(params operator.GetHealthzParams) middleware.Resp
                return operator.NewGetHealthzNotImplemented()
        }
 
-       if !h.isOperatorLeading() {
-               if err := h.checkStatus(); err != nil {
-                       h.log.WithError(err).Warn("Health check status")
-                       return operator.NewGetHealthzInternalServerError().WithPayload(err.Error())
-               }
+       if err := h.checkStatus(); err != nil {
+               h.log.WithError(err).Warn("Health check status")
+               return operator.NewGetHealthzInternalServerError().WithPayload(err.Error())
        }
 
        return operator.NewGetHealthzOK().WithPayload("ok")
@@ -74,7 +72,12 @@ func (h *healthHandler) Handle(params operator.GetHealthzParams) middleware.Resp
 // checkStatus verifies the connection status to the kvstore and the
 // k8s apiserver and returns an error if any of them is unhealthy
 func (h *healthHandler) checkStatus() error {
-       if h.kvstoreEnabled() {
+       // We check if we are the leader here because only the leader has
+       // access to the kvstore client. Otherwise, the kvstore client check
+       // will block. It is safe for a non-leader to skip this check, as the
+       // it is the leader's responsibility to report the status of the
+       // kvstore client.
+       if h.isOperatorLeading() && h.kvstoreEnabled() {
                client := kvstore.Client()
                if client == nil {
                        return errors.New("kvstore client not configured")

@pippolo84
Copy link
Member Author

test-runtime is hitting #24342
Tracked the flake in the related issue, rerunning.

@pippolo84
Copy link
Member Author

/test-runtime

@pippolo84
Copy link
Member Author

/test-1.26-net-next

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 12, 2023
@dylandreimerink dylandreimerink merged commit d5005bf into cilium:master Apr 13, 2023
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/modularization area/operator Impacts the cilium-operator component ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modularization: Operator API Server
5 participants