Remove Agent's dependency on proxy to access Antrea Service #6361

antoninbas · 2024-05-22T23:23:31Z

We add Endpoint resolution to the AntreaClientProvider, so that when
running in-cluster, accessing the Antrea Service (i.e., accessing the
Antrea Controller API) no longer depends on the ClusterIP functionality
provided by the K8s proxy, whether it is kube-proxy or AntreaProxy.

This gives us more flexibility during Agent initialization. For example,
when kube-proxy is removed and ProxyAll is enable for AntreaProxy,
accessing the Antrea Service no longer requires any routes or OVS flows
installed by the Antrea Agent.

To implement this functionality, we add a controller (EndpointResolver),
to watch the Antrea Service and the corresponding Endpoints
resource. For every relevant update, the Endpoint is resolved and the
new URL is sent to the AntreaClientProvider. This is a similar model as
the one we already use for CA bundle updates.

Note that when the Service stops being available, we clear the Endpoint
URL and notify listeners. This means that GetAntreaClient() can now
return an error even if a previous call was successful.

antoninbas · 2024-05-22T23:28:27Z

@tnqn This is related to the idea you brought up in #6342 (comment).

I considered 2 solutions:

Use a controller to maintain an up-to-date endpoint URL (implemented in this PR)
Resolve the endpoint by calling proxy.ResolveEndpoint synchronously every time GetAntreaClient() is called

I don't really have a strong preference between the 2 and I can go either way.

One issue with the current approach is the transient error logs when the Agent is started before the Controller, but maybe it's pretty minor:

I0522 22:27:14.957139       1 endpoint_resolver.go:159] "Starting controller" name="ServiceEndpointResolver-kube-system/antrea"
E0522 22:27:15.064154       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:15.176660       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:15.386246       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:15.789781       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:16.597705       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:18.203027       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
I0522 22:27:21.409353       1 endpoint_resolver.go:204] "Selected Endpoint has changed for Antrea Service" url="https://172.18.0.3:10349"

If we resolve every time GetAntreaClient is called, we may incur a small extra cost, but it is not significant as we just lookup the Service and Endpoints resource from the informer cache.

Let me know if you have a preference between the 2.
The current solution is tested and works, but I would not mind switching to the other solution.

tnqn · 2024-05-23T02:43:06Z

pkg/agent/client/endpoint_resolver.go

+	endpointsInformer.Informer().AddEventHandler(cache.FilteringResourceEventHandler{
+		FilterFunc: func(obj interface{}) bool {
+			// The Endpoints resource for a Service has the same name as the Service.
+			if service, ok := obj.(*corev1.Service); ok {


should it be Endpoints?

Thanks 😊 that explains why the tests are failing...

tnqn · 2024-05-23T02:43:43Z

pkg/agent/client/endpoint_resolver.go

+				// We do not care about potential Status updates.
+				if reflect.DeepEqual(newSvc.Spec, oldSvc.Spec) {


maybe just check generation which is more efficient

Is Generation available for all core resources, I never know?

I thought setting/updating generation is automatic for all resources, getting that impression from Egress and AntreaNetworkPolicy CRs, but just confirmed I was wrong: Service doesn't have a generation, the generation calculation of core resources is added case-by-case.

tnqn · 2024-05-23T02:44:35Z

pkg/agent/client/endpoint_resolver.go

+			}
+			return false
+		},
+		// Any change to Endpoints will trigger a resync.


it seems only Subsets change will trigger a resync.

tnqn · 2024-05-23T02:47:42Z

pkg/agent/client/endpoint_resolver.go

+	if err != nil {
+		return err
+	}
+	// The separate Load an Store calls are safe because there is a single


Suggested change

// The separate Load an Store calls are safe because there is a single

// The separate Load and Store calls are safe because there is a single

pkg/agent/client/endpoint_resolver.go

antoninbas · 2024-05-23T22:11:51Z

@tnqn I addressed comments and did some improvements. I also added a few unit tests for the new code. PTAL.

antoninbas · 2024-05-23T22:16:15Z

pkg/agent/client/endpoint_resolver.go

+	// the Endpoints resource is updated in a way that will cause this function to be called again.
+	if errors.IsServiceUnavailable(err) {
+		klog.ErrorS(err, "Cannot resolve endpoint because Service is unavailable", "service", klog.KRef(r.namespace, r.serviceName))
+		r.updateEndpointIfNeeded(nil)


@tnqn I wanted to bring this to your attention, as it means that GetAntreaClient can return (nil, non-nil error) when the Antrea Service is not available (e.g. the antrea-controller Pod is restarted). I believe that with the previous behavior, GetAntreaClient was guaranteed to always succeed after the first successful call (that doesn't mean that the Service could be accessed successfully).

Let me know what you think. I think this is the right thing to do, and I don't think that should impact existing consumers of GetAntreaClient, but I can also remove the calls to r.updateEndpointIfNeeded(nil) if you prefer.

The change makes sense to me.

antoninbas · 2024-05-23T22:17:25Z

With the latest code, the new logs look much better because we avoid retries when unnecessary.

Agent starting before Controller:

I0523 22:08:26.117956       1 endpoint_resolver.go:194] "Starting controller" name="ServiceEndpointResolver-kube-system/antrea"
E0523 22:08:26.224238       1 endpoint_resolver.go:240] "Cannot resolve endpoint because Service is unavailable" err="no endpoints available for service \"antrea\"" service="kube-system/antrea"
I0523 22:08:33.546174       1 endpoint_resolver.go:276] "Selected Endpoint has changed for Service, notifying listeners" service="kube-system/antrea" url="https://172.18.0.4:10349"

Controller restart (Pod deletion, new Pod scheduled on different Node):

I0523 22:09:30.868733       1 endpoint_resolver.go:194] "Starting controller" name="ServiceEndpointResolver-kube-system/antrea"
I0523 22:09:30.969815       1 endpoint_resolver.go:276] "Selected Endpoint has changed for Service, notifying listeners" service="kube-system/antrea" url="https://172.18.0.4:10349"
E0523 22:09:49.622553       1 endpoint_resolver.go:240] "Cannot resolve endpoint because Service is unavailable" err="no endpoints available for service \"antrea\"" service="kube-system/antrea"
I0523 22:09:49.625716       1 endpoint_resolver.go:278] "No more selected Endpoint for Service, notifying listeners" service="kube-system/antrea"
E0523 22:09:49.843244       1 endpoint_resolver.go:240] "Cannot resolve endpoint because Service is unavailable" err="no endpoints available for service \"antrea\"" service="kube-system/antrea"
I0523 22:10:00.721816       1 endpoint_resolver.go:276] "Selected Endpoint has changed for Service, notifying listeners" service="kube-system/antrea" url="https://172.18.0.3:10349"

tnqn · 2024-05-24T08:14:00Z

pkg/agent/client/endpoint_resolver.go

+	// the Endpoints resource is updated in a way that will cause this function to be called again.
+	if errors.IsServiceUnavailable(err) {
+		klog.ErrorS(err, "Cannot resolve endpoint because Service is unavailable", "service", klog.KRef(r.namespace, r.serviceName))
+		r.updateEndpointIfNeeded(nil)


The change makes sense to me.

tnqn · 2024-05-24T08:14:50Z

pkg/agent/client/endpoint_resolver.go

+
+func NewEndpointResolver(kubeClient kubernetes.Interface, namespace, serviceName string, servicePort int32) *EndpointResolver {
+	key := namespace + "/" + serviceName
+	controllerName := fmt.Sprintf("ServiceEndpointResolver-%s", key)


Is ServiceEndpointResolver:kube-system/antrea more readable than ServiceEndpointResolver-kube-system/antrea?

tnqn · 2024-05-24T08:35:55Z

pkg/agent/client/endpoint_resolver.go

+
+	serviceInformer.Informer().AddEventHandler(cache.FilteringResourceEventHandler{
+		// FilterFunc ignores all Service events which do not relate to the named Service.
+		// It should be redudant given the filtering that we already do at the informer level.


s/redudant/redundant, but do we still need to keep the filter?

I feel like it's ok to keep it. I saw the same pattern in ConfigMapCAController, even though I agree it is not needed.

pkg/agent/client/endpoint_resolver.go

antoninbas · 2024-05-24T20:24:46Z

/test-all

antoninbas · 2024-05-24T20:27:28Z

pkg/agent/controller/networkpolicy/networkpolicy_controller.go

+	// If Antrea client is not ready within 5s, we assume that the Antrea Controller is not
+	// available. We proceed with our watches, which are likely to fail. In turn, this will
+	// trigger the fallback mechanism.
+	// 5s should be more than enough if the Antrea Controller is running correctly.
+	ctx, cancel := context.WithTimeout(wait.ContextForChannel(stopCh), 5*time.Second)


@tnqn I tried a few things, but this was the simplest and most "correct" IMO.
Given that ConfigMapCAController is third-party code, it's hard to come up with a better solution (my preferred solution would have been to wait until both controllers have "synced" and processed initial items).
This works well in practice: if the antrea-controller is already running, the client should be ready with 1 or 2 seconds so we don't hit the timeout; otherwise, the timeout is short enough that we don't wait too long before falling back to local policy files.

It looks good to me.

antoninbas · 2024-05-24T20:27:39Z

/test-all

luolanzone

I suppose this change is invisible to users and no need to be in release log?

pkg/agent/client/client.go

pkg/agent/client/endpoint_resolver.go

tnqn

LGTM once typos are fixed.

tnqn · 2024-05-27T14:56:40Z

pkg/agent/controller/networkpolicy/networkpolicy_controller.go

+	// If Antrea client is not ready within 5s, we assume that the Antrea Controller is not
+	// available. We proceed with our watches, which are likely to fail. In turn, this will
+	// trigger the fallback mechanism.
+	// 5s should be more than enough if the Antrea Controller is running correctly.
+	ctx, cancel := context.WithTimeout(wait.ContextForChannel(stopCh), 5*time.Second)


It looks good to me.

antoninbas · 2024-05-28T18:18:31Z

I suppose this change is invisible to users and no need to be in release log?

I don't feel very strongly either way, but if it were me I would mention it. I will add the action/release-note label, but feel free to remove it if you disagree.

We add Endpoint resolution to the AntreaClientProvider, so that when running in-cluster, accessing the Antrea Service (i.e., accessing the Antrea Controller API) no longer depends on the ClusterIP functionality provided by the K8s proxy, whether it is kube-proxy or AntreaProxy. This gives us more flexibility during Agent initialization. For example, when kube-proxy is removed and ProxyAll is enable for AntreaProxy, accessing the Antrea Service no longer requires any routes or OVS flows installed by the Antrea Agent. To implement this functionality, we add a controller (EndpointResolver), to watch the Antrea Service and the corresponding Endpoints resource. For every relevant update, the Endpoint is resolved and the new URL is sent to the AntreaClientProvider. This is a similar model as the one we already use for CA bundle updates. Note that when the Service stops being available, we clear the Endpoint URL and notify listeners. This means that GetAntreaClient() can now return an error even if a previous call was successful. We also update the NetworkPolicyController in the Agent, so that we fallback to saved policies in case the Antrea client does not become ready within 5s. Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

antoninbas · 2024-05-28T18:24:10Z

/test-all
/test-vm-e2e
/test-windows-all
/test-flexible-ipam-e2e

antoninbas · 2024-05-28T20:06:04Z

Linux e2e tests currently failing because of missing script:

./ci/jenkins/docker_login.sh: No such file or directory

tnqn

LGTM

antoninbas · 2024-05-29T16:36:41Z

/test-all

antoninbas · 2024-05-30T21:41:55Z

/test-all

antoninbas · 2024-05-31T03:27:55Z

/test-vm-e2e

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of antrea-io#6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to antrea-io#6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in #5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of #6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to #6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes #6338 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

tnqn reviewed May 23, 2024

View reviewed changes

antoninbas force-pushed the remove-agent-dependency-on-proxy-to-access-controller branch from 16a7c21 to d41062f Compare May 23, 2024 22:10

antoninbas changed the title ~~[WIP] Remove Agent's dependency on proxy to access Antrea Service~~ Remove Agent's dependency on proxy to access Antrea Service May 23, 2024

antoninbas requested a review from tnqn May 23, 2024 22:11

antoninbas commented May 23, 2024

View reviewed changes

tnqn reviewed May 24, 2024

View reviewed changes

antoninbas commented May 24, 2024

View reviewed changes

luolanzone reviewed May 27, 2024

View reviewed changes

pkg/agent/client/client.go Outdated Show resolved Hide resolved

pkg/agent/client/client.go Outdated Show resolved Hide resolved

pkg/agent/client/endpoint_resolver.go Outdated Show resolved Hide resolved

tnqn previously approved these changes May 27, 2024

View reviewed changes

antoninbas added the action/release-note Indicates a PR that should be included in release notes. label May 28, 2024

antoninbas added 2 commits May 28, 2024 11:21

Address review comments

00220b9

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

antoninbas dismissed tnqn’s stale review via 00220b9 May 28, 2024 18:23

antoninbas force-pushed the remove-agent-dependency-on-proxy-to-access-controller branch from 2469992 to 00220b9 Compare May 28, 2024 18:23

antoninbas requested review from tnqn and luolanzone May 28, 2024 18:23

tnqn approved these changes May 29, 2024

View reviewed changes

antoninbas merged commit f20bdb7 into antrea-io:main May 31, 2024
58 of 63 checks passed

antoninbas deleted the remove-agent-dependency-on-proxy-to-access-controller branch May 31, 2024 04:22

antoninbas mentioned this pull request May 31, 2024

Delay removal of flow-restore-wait #6342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Agent's dependency on proxy to access Antrea Service #6361

Remove Agent's dependency on proxy to access Antrea Service #6361

antoninbas commented May 22, 2024 •

edited

Loading

antoninbas commented May 22, 2024

tnqn May 23, 2024

antoninbas May 23, 2024

tnqn May 23, 2024

antoninbas May 23, 2024

tnqn May 23, 2024

tnqn May 23, 2024

tnqn May 23, 2024

antoninbas commented May 23, 2024

antoninbas May 23, 2024

tnqn May 24, 2024

antoninbas commented May 23, 2024

tnqn May 24, 2024

tnqn May 24, 2024

tnqn May 24, 2024

antoninbas May 24, 2024

antoninbas commented May 24, 2024

antoninbas May 24, 2024

tnqn May 27, 2024

antoninbas commented May 24, 2024

luolanzone left a comment

tnqn left a comment

tnqn May 27, 2024

antoninbas commented May 28, 2024 •

edited

Loading

antoninbas commented May 28, 2024

antoninbas commented May 28, 2024

tnqn left a comment

antoninbas commented May 29, 2024

antoninbas commented May 30, 2024

antoninbas commented May 31, 2024

		// We do not care about potential Status updates.
		if reflect.DeepEqual(newSvc.Spec, oldSvc.Spec) {

	// The separate Load an Store calls are safe because there is a single
	// The separate Load and Store calls are safe because there is a single

Remove Agent's dependency on proxy to access Antrea Service #6361

Remove Agent's dependency on proxy to access Antrea Service #6361

Conversation

antoninbas commented May 22, 2024 • edited Loading

antoninbas commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoninbas commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoninbas commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoninbas commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoninbas commented May 24, 2024

luolanzone left a comment

Choose a reason for hiding this comment

tnqn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoninbas commented May 28, 2024 • edited Loading

antoninbas commented May 28, 2024

antoninbas commented May 28, 2024

tnqn left a comment

Choose a reason for hiding this comment

antoninbas commented May 29, 2024

antoninbas commented May 30, 2024

antoninbas commented May 31, 2024

antoninbas commented May 22, 2024 •

edited

Loading

antoninbas commented May 28, 2024 •

edited

Loading