Enable Pod network after realizing initial NetworkPolicies #5777

tnqn · 2023-12-07T12:02:12Z

Pod network should only be enabled after realizing initial NetworkPolicies, otherwise traffic from/to Pods could bypass NetworkPolicy when antrea-agent restarts.

After commit f9fc979 ("Store NetworkPolicy in filesystem as fallback data source"), antrea-agent can realize either the latest NetworkPolicies got from antrea-controller or the ones got from filesystem as fallback. Therefore, waiting for NetworkPolicies to be realized should not add marked delay or make antrea-controller a failure point of Pod network.

This commit adds an implementation of wait group capable of waiting with a timeout, and uses it to wait for common initialization and NetworkPolicy realization before installing any flows for Pods. More preconditions can be added via the wait group if needed in the future.

tnqn · 2023-12-08T03:14:14Z

/test-all
/test-flexible-ipam-e2e
/test-windows-all
/test-ipv6-all

antoninbas

LGTM, some small questions

I was initially wondering if we should just not start the CNI server until the wait is over (wait -> reconcile fully -> start CNI server), as I think it would make the code simpler. However, I also think that the current approach may be better from a UX and troubleshooting perspective (if there is a bug and we are waiting indefinitely, at least we can handle CNI ADDs and log error messages, instead of having the container runtime fail to connect to the server).

antoninbas · 2023-12-08T19:37:21Z

pkg/agent/cniserver/pod_configuration.go

-			); err != nil {
-				klog.ErrorS(err, "Error when re-installing flows for Pod", "Pod", klog.KRef(namespace, name))
-			}
+			go func(containerID, pod, namespace string) {


before there was a guarantee that the CNI server would not be started until reconciliation was complete (i.e., until this function returned). I assume that since the only thing that happens asynchronously is installation of Pod flows, a race condition should not be possible?

antoninbas · 2023-12-08T19:37:54Z

pkg/agent/cniserver/pod_configuration.go

+				containerConfig, exists := pc.ifaceStore.GetContainerInterface(containerID)
+				if !exists {
+					klog.InfoS("The container interface had been deleted, skip installing flows for Pod", "Pod", klog.KRef(namespace, name), "containerID", containerID)
+					return
+				}


related to my comment above, I assume this will take care of the case where a CNI DEL comes in before we have a chance to do reconciliation for this Pod?

If Pod IP or ofPort is reused, there can be an issue too.

for ofport: wouldn't this require a situation where 1) the port has been released (deleted from OVSDB), yet 2) the interface configuration is still present in the ifaceStore? I don't know if that's possible. That would require a CNI DEL with a successful pc.ovsBridgeClient.DeletePort and a failed pc.ifaceStore.DeleteInterface. I don't think it's possible for pc.ifaceStore.DeleteInterface to fail.

for the IP address: I think it's theoretically possible. The host-ipam release happens before OVS port deletion and pc.ifaceStore.DeleteInterface. So we could have this:

A CNI DEL comes in for the Pod

IPAM deletion runs successfully, IP is released

OVS port deletion fails, interface is not deleted from ifaceStore

this logic runs, we call InstallPodFlows for a Pod that has been deleted already

A CNI Add comes in for a new Pod, the same IP address is allocated (unlikely with host-ipam, with IPs being assigned sequentially, but maybe this is the only IP available). Can the flows installed in 4) disrupt networking for the new Pod.

A new CNI DEL comes in for the old Pod (as per the CNI spec, given that the previous one did not succeed).

I guess this situation does not arise when reconciliation runs to completion before the CNI server starts. What do you think @tnqn ?

related to my comment above, I assume this will take care of the case where a CNI DEL comes in before we have a chance to do reconciliation for this Pod?

Yes, the lock and the check are to prevent race with CNI Del handler.

If Pod IP or ofPort is reused, there can be an issue too.

ofport reuse is not a problem as @antoninbas explained.

The situation @antoninbas described for Pod IP reuse is possible, however, not specific to this case if that could happen. Even when processing CNI events, a similar situation can happen:

A CNI DEL comes in for the Pod

IPAM deletion succeeds, released the IP

OVS flow deletion fails

A CNI ADD comes in for a new Pod, the same IP is reused. There is no problem at this point as it should override the IP's L3 flow.

A new CNI DEL comes in for the old Pod, removing some flows installed by 4. This will cause a problem because it could remove the IP's L3 flow.

To avoid it, perhaps we could use a Pod-specific cookie ID to prevent removing Pod flows of Pod A from affecting Pod B even they share same IP.

If this situation can already occur before that change, then there is no need to address it in this PR.

To avoid it, perhaps we could use a Pod-specific cookie ID to prevent removing Pod flows of Pod A from affecting Pod B even they share same IP.

That sounds fine.

BTW, would it help to just release the IP address last in the delete sequence, or do we expose ourselves to other problems in this case (e.g., easier to run out of IPs when we have a lot of Pods on a Node, > 200, and a lot of churn)?

Ok. Agree we need not to resolve all existing issues in this PR.

BTW, would it help to just release the IP address last in the delete sequence, or do we expose ourselves to other problems in this case (e.g., easier to run out of IPs when we have a lot of Pods on a Node, > 200, and a lot of churn)?

Great suggestion. I think it's better because ovs flows references to Pod IPs, deleting references before deleting the referenced resource makes more sense and avoids the above issue.

Even if removing OVS resource succeeds and releasing Pod IP fails, the IP and the ofport has been decoupled, there is no problem to reuse the ofport with another Pod IP, and eventually container runtime should keep calling CNI to release the IP, or the startup cleanup will do it when container runtime behaves wrongly.

I could create another PR for this improvement.

#5788 implements the suggestion.

antoninbas · 2023-12-08T19:42:40Z

pkg/agent/cniserver/pod_configuration.go

+				); err != nil {
+					klog.ErrorS(err, "Error when re-installing flows for Pod", "Pod", klog.KRef(namespace, name))
+				}
+			}(containerConfig.ContainerID, name, namespace)


Do you think having several goroutines instead of a single goroutine taking care of the Pods can help reduce initialization time if we have 100+ Pods on the Node? Or would that be insignificant? I am just wondering if this the reason why you went this route.

It's just for simple given goroutine is lightweight, otherwise it would need to creata another slice and iterate it.

antoninbas · 2023-12-08T19:45:52Z

pkg/util/wait/wait_test.go

+			for i := 0; i < tt.done; i++ {
+				g.Done()
+			}
+			err := g.WaitWithTimeout(100 * time.Millisecond)


I know I always bring it up :/, but I think it's better for everything time-based if we use a virtual clock from the get go when it comes to unit tests (with the addition of a newGroupWithClock function). I have updated too many unit tests which would fail from time to time on Windows CI runners.

This unit tests in this file are deterministic, but was because I didn't write some corner cases. Will rewrite it with virtual clock and add some cases.

jianjuns · 2023-12-08T21:47:49Z

pkg/agent/cniserver/pod_configuration.go

+				containerConfig, exists := pc.ifaceStore.GetContainerInterface(containerID)
+				if !exists {
+					klog.InfoS("The container interface had been deleted, skip installing flows for Pod", "Pod", klog.KRef(namespace, name), "containerID", containerID)
+					return
+				}


If Pod IP or ofPort is reused, there can be an issue too.

jianjuns · 2023-12-08T21:50:39Z

pkg/agent/cniserver/pod_configuration.go

@@ -413,7 +414,7 @@ func parsePrevResult(conf *types.NetworkConfig) error {
 	return nil
 }

-func (pc *podConfigurator) reconcile(pods []corev1.Pod, containerAccess *containerAccessArbitrator) error {
+func (pc *podConfigurator) reconcile(pods []corev1.Pod, containerAccess *containerAccessArbitrator, podNetworkWait *wait.Group) error {


Could you remind me when we flush existing flows in OVS after agent restart? Is that before or after this func?

They are asynchonous, likely after this func in practice:

antrea/pkg/agent/agent.go

Lines 568 to 580 in 4ba9451

// Delete stale flows from previous round. We need to wait long enough to ensure

// that all the flow which are still required have received an updated cookie (with

// the new round number), otherwise we would disrupt the dataplane. Unfortunately,

// the time required for convergence may be large and there is no simple way to

// determine when is a right time to perform the cleanup task.

// TODO: introduce a deterministic mechanism through which the different entities

// responsible for installing flows can notify the agent that this deletion

// operation can take place. A waitGroup can be created here and notified when

// full sync in agent networkpolicy controller is complete. This would signal NP

// flows have been synced once. Other mechanisms are still needed for node flows

// fullSync check.

time.Sleep(10 * time.Second)

klog.Info("Deleting stale flows from previous round if any")

We may synchronize them using another WaitGroup, but I feel keeping existing flows until new flows are installed doesn't seem working as expected because it has removed all groups and meters upon initialziation.

tnqn · 2023-12-11T05:00:34Z

I was initially wondering if we should just not start the CNI server until the wait is over (wait -> reconcile fully -> start CNI server), as I think it would make the code simpler. However, I also think that the current approach may be better from a UX and troubleshooting perspective (if there is a bug and we are waiting indefinitely, at least we can handle CNI ADDs and log error messages, instead of having the container runtime fail to connect to the server).

There was a problem that if we don't handle CNI events (especially the DEL ones) before some initializations are finished, the system would never reconcile due to a circular dependency. When iptables is very slow:

agent can't acquire iptables lock, blocking the startup of CNI server
container runtime calls CNI del, using a lot of time on portmap which also needs iptables lock, but always fails on primary CNI deletion due to 1. Thus it keeps calling portmap and acquring iptables lock.

Although the case won't happen again with an improvement in portmap, we also made enhancements on antrea side that it will always process CNI DEL as it has no dependency on these flows/network initializations, and waits for a while instead of failing immediately. #1497 has more details.

It's less risky to keep the original behavior.

pkg/util/wait/wait_test.go

antoninbas · 2023-12-11T19:08:14Z

pkg/util/wait/wait_test.go

+	}
+	g.Done()
+	select {
+	case <-time.After(100 * time.Millisecond):


please increase this duration to 500ms to avoid flakes

Pod network should only be enabled after realizing initial NetworkPolicies, otherwise traffic from/to Pods could bypass NetworkPolicy when antrea-agent restarts. After commit f9fc979 ("Store NetworkPolicy in filesystem as fallback data source"), antrea-agent can realize either the latest NetworkPolicies got from antrea-controller or the ones got from filesystem as fallback. Therefore, waiting for NetworkPolicies to be realized should not add marked delay or make antrea-controller a failure point of Pod network. This commit adds an implementation of wait group capable of waiting with a timeout, and uses it to wait for common initialization and NetworkPolicy realization before installing any flows for Pods. More preconditions can be added via the wait group if needed in the future. Signed-off-by: Quan Tian <qtian@vmware.com>

tnqn · 2023-12-12T09:10:54Z

/test-all

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of antrea-io#6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to antrea-io#6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in #5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of #6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to #6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes #6338 Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>

tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Dec 7, 2023

tnqn force-pushed the ensure-netpol-realization branch 2 times, most recently from 8e706b5 to 2966360 Compare December 7, 2023 16:11

tnqn marked this pull request as ready for review December 8, 2023 03:14

tnqn requested review from antoninbas, Dyanngg, GraysonWu and jianjuns December 8, 2023 03:15

antoninbas reviewed Dec 8, 2023

View reviewed changes

jianjuns reviewed Dec 8, 2023

View reviewed changes

tnqn force-pushed the ensure-netpol-realization branch from 2966360 to 9e88fdb Compare December 11, 2023 15:35

antoninbas previously approved these changes Dec 11, 2023

View reviewed changes

tnqn dismissed antoninbas’s stale review via b97ddf6 December 12, 2023 03:24

tnqn force-pushed the ensure-netpol-realization branch from 9e88fdb to b97ddf6 Compare December 12, 2023 03:24

antoninbas approved these changes Dec 12, 2023

View reviewed changes

tnqn merged commit 4134ee5 into antrea-io:main Dec 13, 2023
46 of 53 checks passed

tnqn deleted the ensure-netpol-realization branch December 13, 2023 01:50

tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2024

antoninbas mentioned this pull request May 17, 2024

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338

Closed

antoninbas mentioned this pull request May 18, 2024

Delay removal of flow-restore-wait #6342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Pod network after realizing initial NetworkPolicies #5777

Enable Pod network after realizing initial NetworkPolicies #5777

tnqn commented Dec 7, 2023

tnqn commented Dec 8, 2023

antoninbas left a comment

antoninbas Dec 8, 2023

antoninbas Dec 8, 2023

jianjuns Dec 8, 2023

antoninbas Dec 8, 2023

tnqn Dec 11, 2023

antoninbas Dec 11, 2023

jianjuns Dec 11, 2023

tnqn Dec 12, 2023

tnqn Dec 12, 2023

antoninbas Dec 8, 2023

tnqn Dec 11, 2023

antoninbas Dec 8, 2023

tnqn Dec 11, 2023

jianjuns Dec 8, 2023

jianjuns Dec 8, 2023

tnqn Dec 11, 2023

tnqn commented Dec 11, 2023

antoninbas Dec 11, 2023

tnqn Dec 12, 2023

tnqn commented Dec 12, 2023

	// Delete stale flows from previous round. We need to wait long enough to ensure
	// that all the flow which are still required have received an updated cookie (with
	// the new round number), otherwise we would disrupt the dataplane. Unfortunately,
	// the time required for convergence may be large and there is no simple way to
	// determine when is a right time to perform the cleanup task.
	// TODO: introduce a deterministic mechanism through which the different entities
	// responsible for installing flows can notify the agent that this deletion
	// operation can take place. A waitGroup can be created here and notified when
	// full sync in agent networkpolicy controller is complete. This would signal NP
	// flows have been synced once. Other mechanisms are still needed for node flows
	// fullSync check.
	time.Sleep(10 * time.Second)
	klog.Info("Deleting stale flows from previous round if any")

Enable Pod network after realizing initial NetworkPolicies #5777

Enable Pod network after realizing initial NetworkPolicies #5777

Conversation

tnqn commented Dec 7, 2023

tnqn commented Dec 8, 2023

antoninbas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn commented Dec 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn commented Dec 12, 2023