Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Agent's dependency on proxy to access Antrea Service #6361

Conversation

antoninbas
Copy link
Contributor

@antoninbas antoninbas commented May 22, 2024

We add Endpoint resolution to the AntreaClientProvider, so that when
running in-cluster, accessing the Antrea Service (i.e., accessing the
Antrea Controller API) no longer depends on the ClusterIP functionality
provided by the K8s proxy, whether it is kube-proxy or AntreaProxy.

This gives us more flexibility during Agent initialization. For example,
when kube-proxy is removed and ProxyAll is enable for AntreaProxy,
accessing the Antrea Service no longer requires any routes or OVS flows
installed by the Antrea Agent.

To implement this functionality, we add a controller (EndpointResolver),
to watch the Antrea Service and the corresponding Endpoints
resource. For every relevant update, the Endpoint is resolved and the
new URL is sent to the AntreaClientProvider. This is a similar model as
the one we already use for CA bundle updates.

Note that when the Service stops being available, we clear the Endpoint
URL and notify listeners. This means that GetAntreaClient() can now
return an error even if a previous call was successful.

@antoninbas
Copy link
Contributor Author

@tnqn This is related to the idea you brought up in #6342 (comment).

I considered 2 solutions:

  1. Use a controller to maintain an up-to-date endpoint URL (implemented in this PR)
  2. Resolve the endpoint by calling proxy.ResolveEndpoint synchronously every time GetAntreaClient() is called

I don't really have a strong preference between the 2 and I can go either way.

One issue with the current approach is the transient error logs when the Agent is started before the Controller, but maybe it's pretty minor:

I0522 22:27:14.957139       1 endpoint_resolver.go:159] "Starting controller" name="ServiceEndpointResolver-kube-system/antrea"
E0522 22:27:15.064154       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:15.176660       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:15.386246       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:15.789781       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:16.597705       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
E0522 22:27:18.203027       1 endpoint_resolver.go:187] "Failed to resolve Service Endpoint, requeuing" err="no endpoints available for service \"antrea\"" key="kube-system/antrea"
I0522 22:27:21.409353       1 endpoint_resolver.go:204] "Selected Endpoint has changed for Antrea Service" url="https://172.18.0.3:10349"

If we resolve every time GetAntreaClient is called, we may incur a small extra cost, but it is not significant as we just lookup the Service and Endpoints resource from the informer cache.

Let me know if you have a preference between the 2.
The current solution is tested and works, but I would not mind switching to the other solution.

endpointsInformer.Informer().AddEventHandler(cache.FilteringResourceEventHandler{
FilterFunc: func(obj interface{}) bool {
// The Endpoints resource for a Service has the same name as the Service.
if service, ok := obj.(*corev1.Service); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be Endpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😊 that explains why the tests are failing...

Comment on lines 103 to 137
// We do not care about potential Status updates.
if reflect.DeepEqual(newSvc.Spec, oldSvc.Spec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just check generation which is more efficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Generation available for all core resources, I never know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought setting/updating generation is automatic for all resources, getting that impression from Egress and AntreaNetworkPolicy CRs, but just confirmed I was wrong: Service doesn't have a generation, the generation calculation of core resources is added case-by-case.

}
return false
},
// Any change to Endpoints will trigger a resync.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems only Subsets change will trigger a resync.

if err != nil {
return err
}
// The separate Load an Store calls are safe because there is a single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The separate Load an Store calls are safe because there is a single
// The separate Load and Store calls are safe because there is a single

pkg/agent/client/endpoint_resolver.go Show resolved Hide resolved
@antoninbas antoninbas force-pushed the remove-agent-dependency-on-proxy-to-access-controller branch from 16a7c21 to d41062f Compare May 23, 2024 22:10
@antoninbas antoninbas changed the title [WIP] Remove Agent's dependency on proxy to access Antrea Service Remove Agent's dependency on proxy to access Antrea Service May 23, 2024
@antoninbas antoninbas requested a review from tnqn May 23, 2024 22:11
@antoninbas
Copy link
Contributor Author

@tnqn I addressed comments and did some improvements. I also added a few unit tests for the new code. PTAL.

// the Endpoints resource is updated in a way that will cause this function to be called again.
if errors.IsServiceUnavailable(err) {
klog.ErrorS(err, "Cannot resolve endpoint because Service is unavailable", "service", klog.KRef(r.namespace, r.serviceName))
r.updateEndpointIfNeeded(nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn I wanted to bring this to your attention, as it means that GetAntreaClient can return (nil, non-nil error) when the Antrea Service is not available (e.g. the antrea-controller Pod is restarted). I believe that with the previous behavior, GetAntreaClient was guaranteed to always succeed after the first successful call (that doesn't mean that the Service could be accessed successfully).

Let me know what you think. I think this is the right thing to do, and I don't think that should impact existing consumers of GetAntreaClient, but I can also remove the calls to r.updateEndpointIfNeeded(nil) if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me.

@antoninbas
Copy link
Contributor Author

With the latest code, the new logs look much better because we avoid retries when unnecessary.

Agent starting before Controller:

I0523 22:08:26.117956       1 endpoint_resolver.go:194] "Starting controller" name="ServiceEndpointResolver-kube-system/antrea"
E0523 22:08:26.224238       1 endpoint_resolver.go:240] "Cannot resolve endpoint because Service is unavailable" err="no endpoints available for service \"antrea\"" service="kube-system/antrea"
I0523 22:08:33.546174       1 endpoint_resolver.go:276] "Selected Endpoint has changed for Service, notifying listeners" service="kube-system/antrea" url="https://172.18.0.4:10349"

Controller restart (Pod deletion, new Pod scheduled on different Node):

I0523 22:09:30.868733       1 endpoint_resolver.go:194] "Starting controller" name="ServiceEndpointResolver-kube-system/antrea"
I0523 22:09:30.969815       1 endpoint_resolver.go:276] "Selected Endpoint has changed for Service, notifying listeners" service="kube-system/antrea" url="https://172.18.0.4:10349"
E0523 22:09:49.622553       1 endpoint_resolver.go:240] "Cannot resolve endpoint because Service is unavailable" err="no endpoints available for service \"antrea\"" service="kube-system/antrea"
I0523 22:09:49.625716       1 endpoint_resolver.go:278] "No more selected Endpoint for Service, notifying listeners" service="kube-system/antrea"
E0523 22:09:49.843244       1 endpoint_resolver.go:240] "Cannot resolve endpoint because Service is unavailable" err="no endpoints available for service \"antrea\"" service="kube-system/antrea"
I0523 22:10:00.721816       1 endpoint_resolver.go:276] "Selected Endpoint has changed for Service, notifying listeners" service="kube-system/antrea" url="https://172.18.0.3:10349"

// the Endpoints resource is updated in a way that will cause this function to be called again.
if errors.IsServiceUnavailable(err) {
klog.ErrorS(err, "Cannot resolve endpoint because Service is unavailable", "service", klog.KRef(r.namespace, r.serviceName))
r.updateEndpointIfNeeded(nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me.


func NewEndpointResolver(kubeClient kubernetes.Interface, namespace, serviceName string, servicePort int32) *EndpointResolver {
key := namespace + "/" + serviceName
controllerName := fmt.Sprintf("ServiceEndpointResolver-%s", key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ServiceEndpointResolver:kube-system/antrea more readable than ServiceEndpointResolver-kube-system/antrea?


serviceInformer.Informer().AddEventHandler(cache.FilteringResourceEventHandler{
// FilterFunc ignores all Service events which do not relate to the named Service.
// It should be redudant given the filtering that we already do at the informer level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/redudant/redundant, but do we still need to keep the filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's ok to keep it. I saw the same pattern in ConfigMapCAController, even though I agree it is not needed.

pkg/agent/client/endpoint_resolver.go Outdated Show resolved Hide resolved
@antoninbas
Copy link
Contributor Author

/test-all

Comment on lines +600 to +604
// If Antrea client is not ready within 5s, we assume that the Antrea Controller is not
// available. We proceed with our watches, which are likely to fail. In turn, this will
// trigger the fallback mechanism.
// 5s should be more than enough if the Antrea Controller is running correctly.
ctx, cancel := context.WithTimeout(wait.ContextForChannel(stopCh), 5*time.Second)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn I tried a few things, but this was the simplest and most "correct" IMO.
Given that ConfigMapCAController is third-party code, it's hard to come up with a better solution (my preferred solution would have been to wait until both controllers have "synced" and processed initial items).
This works well in practice: if the antrea-controller is already running, the client should be ready with 1 or 2 seconds so we don't hit the timeout; otherwise, the timeout is short enough that we don't wait too long before falling back to local policy files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

@antoninbas
Copy link
Contributor Author

/test-all

Copy link
Contributor

@luolanzone luolanzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this change is invisible to users and no need to be in release log?

pkg/agent/client/client.go Outdated Show resolved Hide resolved
pkg/agent/client/client.go Outdated Show resolved Hide resolved
pkg/agent/client/endpoint_resolver.go Outdated Show resolved Hide resolved
tnqn
tnqn previously approved these changes May 27, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once typos are fixed.

Comment on lines +600 to +604
// If Antrea client is not ready within 5s, we assume that the Antrea Controller is not
// available. We proceed with our watches, which are likely to fail. In turn, this will
// trigger the fallback mechanism.
// 5s should be more than enough if the Antrea Controller is running correctly.
ctx, cancel := context.WithTimeout(wait.ContextForChannel(stopCh), 5*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

@antoninbas
Copy link
Contributor Author

antoninbas commented May 28, 2024

I suppose this change is invisible to users and no need to be in release log?

I don't feel very strongly either way, but if it were me I would mention it. I will add the action/release-note label, but feel free to remove it if you disagree.

@antoninbas antoninbas added the action/release-note Indicates a PR that should be included in release notes. label May 28, 2024
We add Endpoint resolution to the AntreaClientProvider, so that when
running in-cluster, accessing the Antrea Service (i.e., accessing the
Antrea Controller API) no longer depends on the ClusterIP functionality
provided by the K8s proxy, whether it is kube-proxy or AntreaProxy.

This gives us more flexibility during Agent initialization. For example,
when kube-proxy is removed and ProxyAll is enable for AntreaProxy,
accessing the Antrea Service no longer requires any routes or OVS flows
installed by the Antrea Agent.

To implement this functionality, we add a controller (EndpointResolver),
to watch the Antrea Service and the corresponding Endpoints
resource. For every relevant update, the Endpoint is resolved and the
new URL is sent to the AntreaClientProvider. This is a similar model as
the one we already use for CA bundle updates.

Note that when the Service stops being available, we clear the Endpoint
URL and notify listeners. This means that GetAntreaClient() can now
return an error even if a previous call was successful.

We also update the NetworkPolicyController in the Agent, so that we
fallback to saved policies in case the Antrea client does not become
ready within 5s.

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas antoninbas force-pushed the remove-agent-dependency-on-proxy-to-access-controller branch from 2469992 to 00220b9 Compare May 28, 2024 18:23
@antoninbas
Copy link
Contributor Author

/test-all
/test-vm-e2e
/test-windows-all
/test-flexible-ipam-e2e

@antoninbas
Copy link
Contributor Author

Linux e2e tests currently failing because of missing script:

./ci/jenkins/docker_login.sh: No such file or directory

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor Author

/test-all

1 similar comment
@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor Author

/test-vm-e2e

@antoninbas antoninbas merged commit f20bdb7 into antrea-io:main May 31, 2024
58 of 63 checks passed
@antoninbas antoninbas deleted the remove-agent-dependency-on-proxy-to-access-controller branch May 31, 2024 04:22
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 31, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of antrea-io#6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to antrea-io#6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 31, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of antrea-io#6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to antrea-io#6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit that referenced this pull request Jun 3, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in #5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of #6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to #6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes #6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants