Delay removal of flow-restore-wait #6342

antoninbas · 2024-05-18T01:53:05Z

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced previously in #5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once.

This change is possible because of #6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to #6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed.

Fixes #6338

tnqn

LGTM

tnqn · 2024-05-20T11:03:58Z

pkg/agent/controller/noderoute/node_route_controller.go

@@ -80,6 +83,12 @@ type Controller struct {
 	// or not when IPsec is enabled with "cert" mode. The NodeRouteController must wait for the certificate
 	// to be configured before installing routes/flows to peer Nodes to prevent unencrypted traffic across Nodes.
 	ipsecCertificateManager ipseccertificate.Manager
+	// flowRestoreCompleteWait is to be decremented after installing flows for initial Nodes.
+	flowRestoreCompleteWait *utilwait.Group
+	// hasProcessedInitialList keeps track of whether the initial informer list as been


Suggested change

// hasProcessedInitialList keeps track of whether the initial informer list as been

// hasProcessedInitialList keeps track of whether the initial informer list has been

antoninbas · 2024-05-20T21:59:16Z

/test-flexible-ipam-e2e
/test-vm-e2e

antoninbas · 2024-05-20T22:01:42Z

cmd/antrea-agent/agent.go

@@ -844,6 +847,21 @@ func run(o *Options) error {
 		go mcStrechedNetworkPolicyController.Run(stopCh)
 	}

+	klog.InfoS("Waiting for flow restoration to complete")


@tnqn I wasn't sure what would be the best place for this code. I put it here because I think by that point all the controllers that need to install flows have been instantiated, initialized and "started" (even though at the moment, flowRestoreCompleteWait only waits for the NetworkPolicyController + NodeRouteController + CNIServer).

It makes sense to me. I tried to delay FlowRestoreComplete but was unsure whether moving ConnectUplinkToOVSBridge has any impact. cc @gran-vmv

We can run flexible-ipam-e2e multiple times to check if this change can work.

antoninbas · 2024-05-21T00:11:36Z

/test-flexible-ipam-e2e

tnqn

LGTM

tnqn · 2024-05-21T00:15:47Z

cmd/antrea-agent/agent.go

@@ -844,6 +847,21 @@ func run(o *Options) error {
 		go mcStrechedNetworkPolicyController.Run(stopCh)
 	}

+	klog.InfoS("Waiting for flow restoration to complete")


It makes sense to me. I tried to delay FlowRestoreComplete but was unsure whether moving ConnectUplinkToOVSBridge has any impact. cc @gran-vmv

antoninbas · 2024-05-21T05:13:56Z

/test-flexible-ipam-e2e

antoninbas · 2024-05-21T17:55:20Z

/test-flexible-ipam-e2e

antoninbas · 2024-05-21T21:38:09Z

I go the following errors for test-flexible-ipam-e2e:

    --- FAIL: TestAntreaIPAM/testAntreaIPAMStatefulSetDedicated (19.02s)
    --- FAIL: TestAntreaIPAM/testAntreaIPAMStatefulSetShared (19.03s)

However, I have also observed the same test failures for another PR unrelated to this one (#6321)

antoninbas · 2024-05-21T21:38:46Z

/test-all

antoninbas · 2024-05-22T00:04:52Z

E2e tests on a Kind cluster on Linux with non default values and E2e tests on a Kind cluster on Linux with all features enabled are failing consistently, with a lot of individual unit test failures. I am investigating. There may be some "timing" assumption I am breaking with this change. I'm trying to reproduce locally. Best guess would be something to do with ProxyAll...

antoninbas · 2024-05-22T01:05:57Z

E2e tests on a Kind cluster on Linux with non default values and E2e tests on a Kind cluster on Linux with all features enabled are failing consistently, with a lot of individual unit test failures. I am investigating. There may be some "timing" assumption I am breaking with this change. I'm trying to reproduce locally. Best guess would be something to do with ProxyAll...

So the issue seems to be that until flow-restore-wait is removed, the Agent may not be able to reach the antrea Service when proxyAll is enabled (we remove kube-proxy in that case in the CI tests), but with this change, flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NP controller has started its watchers, and that depends on antrea Service reachability.

So there is a chicken and egg problem here.
@tnqn what's your take on this? I don't have a great solution, even though I could always add a timeout:

flowRestoreCompleteWait.WaitWithTimeout(10*time.Second)
if err := agentInitializer.FlowRestoreComplete(); err != nil {
        return err
 }

Because it blocks, it would still delay a few key operations, such as starting the apiserver, which answers liveness probes.

BTW, I thought that if the watch couldn't fail in a reasonable amount of time, we would fallback to data stored in files. Do you know the timeout because I didn't see it happen during that 30s time window:

I0522 00:17:47.808376       1 agent.go:789] "Checking if AntreaProxy is ready"
I0522 00:17:47.809998       1 agent.go:794] "AntreaProxy is ready"
I0522 00:17:47.810514       1 agent.go:850] "Waiting for flow restoration to complete"
I0522 00:17:47.812074       1 egress_controller.go:491] Starting AntreaAgentEgressController
I0522 00:17:47.813086       1 shared_informer.go:311] Waiting for caches to sync for AntreaAgentEgressController
I0522 00:17:47.813839       1 networkpolicy_controller.go:599] Waiting for Antrea client to be ready
I0522 00:17:47.814993       1 networkpolicy_controller.go:610] Antrea client is ready
I0522 00:17:47.814993       1 local_ip_detector_linux.go:59] Starting localIPDetector
I0522 00:17:47.815206       1 networkpolicy_controller.go:625] Waiting for all watchers to complete full sync
I0522 00:17:47.815206       1 ip_scheduler.go:219] "Starting Egress IP scheduler"
I0522 00:17:47.816925       1 networkpolicy_controller.go:959] Starting watch for AppliedToGroup
I0522 00:17:47.817019       1 local_ip_detector_linux.go:114] Listed existing IP address: map[10.244.0.1:{} 127.0.0.1:{} 172.18.0.2:{} ::1:{} fc00:f853:ccd:e793::2:{} fe80::42:acff:fe12:2:{}]
I0522 00:17:47.817587       1 networkpolicy_controller.go:959] Starting watch for AddressGroup
I0522 00:17:47.818145       1 networkpolicy_controller.go:959] Starting watch for NetworkPolicy
I0522 00:17:47.915617       1 shared_informer.go:318] Caches are synced for AntreaAgentEgressController
I0522 00:17:47.915906       1 route_linux.go:1176] "Restoring IP routes and rules for Egress"
I0522 00:17:47.917184       1 egress_controller.go:1212] Starting watch for EgressGroup
I0522 00:17:47.918618       1 egress_controller.go:368] "Received service CIDR update"
I0522 00:17:47.934073       1 stream.go:132] "Sent outbound message" dataLength=16
I0522 00:17:47.950976       1 stream.go:132] "Sent outbound message" dataLength=120
I0522 00:17:47.954251       1 stream.go:132] "Sent outbound message" dataLength=16
I0522 00:17:47.956163       1 stream.go:132] "Sent outbound message" dataLength=16
I0522 00:17:50.849714       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:17:55.848094       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:00.851341       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:05.849188       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:10.849493       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:15.850687       1 stream.go:132] "Sent outbound message" dataLength=8
E0522 00:18:17.408533       1 server.go:446] "Cannot process CmdAdd request for container because network is not ready" err="timeout waiting for group" container="0c2a27d6d23d266ff9c3de1de2719b5ab8cd133ea5d40a73a47cd646cff8b7d9" timeout="30s"
I0522 00:18:17.415218       1 proxier.go:1010] syncProxyRules took 109.011µs
I0522 00:18:17.415480       1 runner.go:220] antrea-agent-proxy: ran, next possible in 1s, periodic in 30s

tnqn · 2024-05-22T01:19:55Z

So the issue seems to be that until flow-restore-wait is removed, the Agent may not be able to reach the antrea Service when proxyAll is enabled (we remove kube-proxy in that case in the CI tests), but with this change, flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NP controller has started its watchers, and that depends on antrea Service reachability.

So there is a chicken and egg problem here.

It reminds me that I encountered this problem as well. Since we have a separate antreaClientProvider, do you think we could break the chicken-egg problem by not relying on the proxy, resolving the service to endpoint IP directly via something like https://github.com/kubernetes/kubernetes/blob/11f37d757fc0b710245446c80a8c9578ce2c02f1/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L33?

antoninbas · 2024-05-22T01:25:12Z

It reminds me that I encountered this problem as well. Since we have a separate antreaClientProvider, do you think we could break the chicken-egg problem by not relying on the proxy, resolving the service to endpoint IP directly via something like https://github.com/kubernetes/kubernetes/blob/11f37d757fc0b710245446c80a8c9578ce2c02f1/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L33?

Thanks for the idea @tnqn, I will try it.
It is indeed different from the kubernetes Service case, since in this case we do have access to the K8s API (user has to provide an override) and hence we have access to the list of Endpoints.

cmd/antrea-agent/agent.go

pkg/agent/controller/noderoute/node_route_controller.go

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of antrea-io#6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to antrea-io#6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

antoninbas · 2024-05-31T18:14:56Z

/test-all
/test-vm-e2e
/test-windows-all
/test-flexible-ipam-e2e

gran-vmv · 2024-06-03T02:31:14Z

/test-flexible-ipam-e2e

gran-vmv · 2024-06-03T02:32:15Z

/test-flexible-ipam-e2e

tnqn

LGTM

antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 3d6baba to 7ea44b5 Compare May 18, 2024 02:15

tnqn previously approved these changes May 20, 2024

View reviewed changes

antoninbas dismissed tnqn’s stale review via 5b169d1 May 20, 2024 21:42

antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 7ea44b5 to 5b169d1 Compare May 20, 2024 21:42

antoninbas commented May 20, 2024

View reviewed changes

tnqn previously approved these changes May 21, 2024

View reviewed changes

antoninbas mentioned this pull request May 22, 2024

Remove Agent's dependency on proxy to access Antrea Service #6361

Merged

antoninbas dismissed tnqn’s stale review via e98ae09 May 31, 2024 04:36

antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 5b169d1 to e98ae09 Compare May 31, 2024 04:36

antoninbas added the action/release-note Indicates a PR that should be included in release notes. label May 31, 2024

tnqn reviewed May 31, 2024

View reviewed changes

cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved

pkg/agent/controller/noderoute/node_route_controller.go Outdated Show resolved Hide resolved

antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from e98ae09 to 665973c Compare May 31, 2024 18:12

Address review comments

313eb8f

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 665973c to 313eb8f Compare May 31, 2024 18:14

tnqn approved these changes Jun 3, 2024

View reviewed changes

antoninbas merged commit 33f2683 into antrea-io:main Jun 3, 2024
56 of 60 checks passed

antoninbas deleted the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch June 3, 2024 17:46

antoninbas mentioned this pull request Jun 18, 2024

Flaky OVS related e2e tests on the dedicated flexible IPAM testbed #6458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay removal of flow-restore-wait #6342

Delay removal of flow-restore-wait #6342

antoninbas commented May 18, 2024 •

edited

tnqn left a comment

tnqn May 20, 2024

antoninbas commented May 20, 2024

antoninbas May 20, 2024

tnqn May 21, 2024

gran-vmv Jun 3, 2024

antoninbas commented May 21, 2024

tnqn left a comment

tnqn May 21, 2024

antoninbas commented May 21, 2024

antoninbas commented May 21, 2024

antoninbas commented May 21, 2024

antoninbas commented May 21, 2024

antoninbas commented May 22, 2024

antoninbas commented May 22, 2024

tnqn commented May 22, 2024

antoninbas commented May 22, 2024

antoninbas commented May 31, 2024

gran-vmv commented Jun 3, 2024

gran-vmv commented Jun 3, 2024

tnqn left a comment

	// hasProcessedInitialList keeps track of whether the initial informer list as been
	// hasProcessedInitialList keeps track of whether the initial informer list has been

Delay removal of flow-restore-wait #6342

Delay removal of flow-restore-wait #6342

Conversation

antoninbas commented May 18, 2024 • edited

tnqn left a comment

Choose a reason for hiding this comment

tnqn May 20, 2024

Choose a reason for hiding this comment

antoninbas commented May 20, 2024

antoninbas May 20, 2024

Choose a reason for hiding this comment

tnqn May 21, 2024

Choose a reason for hiding this comment

gran-vmv Jun 3, 2024

Choose a reason for hiding this comment

antoninbas commented May 21, 2024

tnqn left a comment

Choose a reason for hiding this comment

tnqn May 21, 2024

Choose a reason for hiding this comment

antoninbas commented May 21, 2024

antoninbas commented May 21, 2024

antoninbas commented May 21, 2024

antoninbas commented May 21, 2024

antoninbas commented May 22, 2024

antoninbas commented May 22, 2024

tnqn commented May 22, 2024

antoninbas commented May 22, 2024

antoninbas commented May 31, 2024

gran-vmv commented Jun 3, 2024

gran-vmv commented Jun 3, 2024

tnqn left a comment

Choose a reason for hiding this comment

antoninbas commented May 18, 2024 •

edited