Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Pod network after realizing initial NetworkPolicies #5777

Merged
merged 1 commit into from
Dec 13, 2023

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Dec 7, 2023

Pod network should only be enabled after realizing initial NetworkPolicies, otherwise traffic from/to Pods could bypass NetworkPolicy when antrea-agent restarts.

After commit f9fc979 ("Store NetworkPolicy in filesystem as fallback data source"), antrea-agent can realize either the latest NetworkPolicies got from antrea-controller or the ones got from filesystem as fallback. Therefore, waiting for NetworkPolicies to be realized should not add marked delay or make antrea-controller a failure point of Pod network.

This commit adds an implementation of wait group capable of waiting with a timeout, and uses it to wait for common initialization and NetworkPolicy realization before installing any flows for Pods. More preconditions can be added via the wait group if needed in the future.

@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Dec 7, 2023
@tnqn tnqn force-pushed the ensure-netpol-realization branch 2 times, most recently from 8e706b5 to 2966360 Compare December 7, 2023 16:11
@tnqn
Copy link
Member Author

tnqn commented Dec 8, 2023

/test-all
/test-flexible-ipam-e2e
/test-windows-all
/test-ipv6-all

@tnqn tnqn marked this pull request as ready for review December 8, 2023 03:14
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some small questions

I was initially wondering if we should just not start the CNI server until the wait is over (wait -> reconcile fully -> start CNI server), as I think it would make the code simpler. However, I also think that the current approach may be better from a UX and troubleshooting perspective (if there is a bug and we are waiting indefinitely, at least we can handle CNI ADDs and log error messages, instead of having the container runtime fail to connect to the server).

); err != nil {
klog.ErrorS(err, "Error when re-installing flows for Pod", "Pod", klog.KRef(namespace, name))
}
go func(containerID, pod, namespace string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before there was a guarantee that the CNI server would not be started until reconciliation was complete (i.e., until this function returned). I assume that since the only thing that happens asynchronously is installation of Pod flows, a race condition should not be possible?

Comment on lines +456 to +460
containerConfig, exists := pc.ifaceStore.GetContainerInterface(containerID)
if !exists {
klog.InfoS("The container interface had been deleted, skip installing flows for Pod", "Pod", klog.KRef(namespace, name), "containerID", containerID)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment above, I assume this will take care of the case where a CNI DEL comes in before we have a chance to do reconciliation for this Pod?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Pod IP or ofPort is reused, there can be an issue too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for ofport: wouldn't this require a situation where 1) the port has been released (deleted from OVSDB), yet 2) the interface configuration is still present in the ifaceStore? I don't know if that's possible. That would require a CNI DEL with a successful pc.ovsBridgeClient.DeletePort and a failed pc.ifaceStore.DeleteInterface. I don't think it's possible for pc.ifaceStore.DeleteInterface to fail.

for the IP address: I think it's theoretically possible. The host-ipam release happens before OVS port deletion and pc.ifaceStore.DeleteInterface. So we could have this:

  1. A CNI DEL comes in for the Pod
  2. IPAM deletion runs successfully, IP is released
  3. OVS port deletion fails, interface is not deleted from ifaceStore
  4. this logic runs, we call InstallPodFlows for a Pod that has been deleted already
  5. A CNI Add comes in for a new Pod, the same IP address is allocated (unlikely with host-ipam, with IPs being assigned sequentially, but maybe this is the only IP available). Can the flows installed in 4) disrupt networking for the new Pod.
  6. A new CNI DEL comes in for the old Pod (as per the CNI spec, given that the previous one did not succeed).

I guess this situation does not arise when reconciliation runs to completion before the CNI server starts. What do you think @tnqn ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment above, I assume this will take care of the case where a CNI DEL comes in before we have a chance to do reconciliation for this Pod?

Yes, the lock and the check are to prevent race with CNI Del handler.

If Pod IP or ofPort is reused, there can be an issue too.

ofport reuse is not a problem as @antoninbas explained.

The situation @antoninbas described for Pod IP reuse is possible, however, not specific to this case if that could happen. Even when processing CNI events, a similar situation can happen:

  1. A CNI DEL comes in for the Pod
  2. IPAM deletion succeeds, released the IP
  3. OVS flow deletion fails
  4. A CNI ADD comes in for a new Pod, the same IP is reused. There is no problem at this point as it should override the IP's L3 flow.
  5. A new CNI DEL comes in for the old Pod, removing some flows installed by 4. This will cause a problem because it could remove the IP's L3 flow.

To avoid it, perhaps we could use a Pod-specific cookie ID to prevent removing Pod flows of Pod A from affecting Pod B even they share same IP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this situation can already occur before that change, then there is no need to address it in this PR.

To avoid it, perhaps we could use a Pod-specific cookie ID to prevent removing Pod flows of Pod A from affecting Pod B even they share same IP.

That sounds fine.

BTW, would it help to just release the IP address last in the delete sequence, or do we expose ourselves to other problems in this case (e.g., easier to run out of IPs when we have a lot of Pods on a Node, > 200, and a lot of churn)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Agree we need not to resolve all existing issues in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, would it help to just release the IP address last in the delete sequence, or do we expose ourselves to other problems in this case (e.g., easier to run out of IPs when we have a lot of Pods on a Node, > 200, and a lot of churn)?

Great suggestion. I think it's better because ovs flows references to Pod IPs, deleting references before deleting the referenced resource makes more sense and avoids the above issue.

Even if removing OVS resource succeeds and releasing Pod IP fails, the IP and the ofport has been decoupled, there is no problem to reuse the ofport with another Pod IP, and eventually container runtime should keep calling CNI to release the IP, or the startup cleanup will do it when container runtime behaves wrongly.

I could create another PR for this improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5788 implements the suggestion.

); err != nil {
klog.ErrorS(err, "Error when re-installing flows for Pod", "Pod", klog.KRef(namespace, name))
}
}(containerConfig.ContainerID, name, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think having several goroutines instead of a single goroutine taking care of the Pods can help reduce initialization time if we have 100+ Pods on the Node? Or would that be insignificant? I am just wondering if this the reason why you went this route.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just for simple given goroutine is lightweight, otherwise it would need to creata another slice and iterate it.

for i := 0; i < tt.done; i++ {
g.Done()
}
err := g.WaitWithTimeout(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I always bring it up :/, but I think it's better for everything time-based if we use a virtual clock from the get go when it comes to unit tests (with the addition of a newGroupWithClock function). I have updated too many unit tests which would fail from time to time on Windows CI runners.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit tests in this file are deterministic, but was because I didn't write some corner cases. Will rewrite it with virtual clock and add some cases.

Comment on lines +456 to +460
containerConfig, exists := pc.ifaceStore.GetContainerInterface(containerID)
if !exists {
klog.InfoS("The container interface had been deleted, skip installing flows for Pod", "Pod", klog.KRef(namespace, name), "containerID", containerID)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Pod IP or ofPort is reused, there can be an issue too.

@@ -413,7 +414,7 @@ func parsePrevResult(conf *types.NetworkConfig) error {
return nil
}

func (pc *podConfigurator) reconcile(pods []corev1.Pod, containerAccess *containerAccessArbitrator) error {
func (pc *podConfigurator) reconcile(pods []corev1.Pod, containerAccess *containerAccessArbitrator, podNetworkWait *wait.Group) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remind me when we flush existing flows in OVS after agent restart? Is that before or after this func?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are asynchonous, likely after this func in practice:

antrea/pkg/agent/agent.go

Lines 568 to 580 in 4ba9451

// Delete stale flows from previous round. We need to wait long enough to ensure
// that all the flow which are still required have received an updated cookie (with
// the new round number), otherwise we would disrupt the dataplane. Unfortunately,
// the time required for convergence may be large and there is no simple way to
// determine when is a right time to perform the cleanup task.
// TODO: introduce a deterministic mechanism through which the different entities
// responsible for installing flows can notify the agent that this deletion
// operation can take place. A waitGroup can be created here and notified when
// full sync in agent networkpolicy controller is complete. This would signal NP
// flows have been synced once. Other mechanisms are still needed for node flows
// fullSync check.
time.Sleep(10 * time.Second)
klog.Info("Deleting stale flows from previous round if any")

We may synchronize them using another WaitGroup, but I feel keeping existing flows until new flows are installed doesn't seem working as expected because it has removed all groups and meters upon initialziation.

@tnqn
Copy link
Member Author

tnqn commented Dec 11, 2023

I was initially wondering if we should just not start the CNI server until the wait is over (wait -> reconcile fully -> start CNI server), as I think it would make the code simpler. However, I also think that the current approach may be better from a UX and troubleshooting perspective (if there is a bug and we are waiting indefinitely, at least we can handle CNI ADDs and log error messages, instead of having the container runtime fail to connect to the server).

There was a problem that if we don't handle CNI events (especially the DEL ones) before some initializations are finished, the system would never reconcile due to a circular dependency. When iptables is very slow:

  1. agent can't acquire iptables lock, blocking the startup of CNI server
  2. container runtime calls CNI del, using a lot of time on portmap which also needs iptables lock, but always fails on primary CNI deletion due to 1. Thus it keeps calling portmap and acquring iptables lock.

Although the case won't happen again with an improvement in portmap, we also made enhancements on antrea side that it will always process CNI DEL as it has no dependency on these flows/network initializations, and waits for a while instead of failing immediately. #1497 has more details.

It's less risky to keep the original behavior.

antoninbas
antoninbas previously approved these changes Dec 11, 2023
pkg/util/wait/wait_test.go Show resolved Hide resolved
}
g.Done()
select {
case <-time.After(100 * time.Millisecond):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please increase this duration to 500ms to avoid flakes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Pod network should only be enabled after realizing initial
NetworkPolicies, otherwise traffic from/to Pods could bypass
NetworkPolicy when antrea-agent restarts.

After commit f9fc979 ("Store NetworkPolicy in filesystem as
fallback data source"), antrea-agent can realize either the latest
NetworkPolicies got from antrea-controller or the ones got from
filesystem as fallback. Therefore, waiting for NetworkPolicies to be
realized should not add marked delay or make antrea-controller a failure
point of Pod network.

This commit adds an implementation of wait group capable of waiting with
a timeout, and uses it to wait for common initialization and
NetworkPolicy realization before installing any flows for Pods. More
preconditions can be added via the wait group if needed in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn
Copy link
Member Author

tnqn commented Dec 12, 2023

/test-all

@tnqn tnqn merged commit 4134ee5 into antrea-io:main Dec 13, 2023
46 of 53 checks passed
@tnqn tnqn deleted the ensure-netpol-realization branch December 13, 2023 01:50
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2024
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 18, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 18, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 20, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 31, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of antrea-io#6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to antrea-io#6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 31, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of antrea-io#6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to antrea-io#6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
antoninbas added a commit that referenced this pull request Jun 3, 2024
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in #5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of #6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to #6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes #6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants