Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay removal of flow-restore-wait #6342

Conversation

antoninbas
Copy link
Contributor

@antoninbas antoninbas commented May 18, 2024

Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced previously in #5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once.

This change is possible because of #6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to #6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed.

Fixes #6338

@antoninbas antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 3d6baba to 7ea44b5 Compare May 18, 2024 02:15
tnqn
tnqn previously approved these changes May 20, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -80,6 +83,12 @@ type Controller struct {
// or not when IPsec is enabled with "cert" mode. The NodeRouteController must wait for the certificate
// to be configured before installing routes/flows to peer Nodes to prevent unencrypted traffic across Nodes.
ipsecCertificateManager ipseccertificate.Manager
// flowRestoreCompleteWait is to be decremented after installing flows for initial Nodes.
flowRestoreCompleteWait *utilwait.Group
// hasProcessedInitialList keeps track of whether the initial informer list as been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// hasProcessedInitialList keeps track of whether the initial informer list as been
// hasProcessedInitialList keeps track of whether the initial informer list has been

@antoninbas antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 7ea44b5 to 5b169d1 Compare May 20, 2024 21:42
@antoninbas
Copy link
Contributor Author

/test-flexible-ipam-e2e
/test-vm-e2e

@@ -844,6 +847,21 @@ func run(o *Options) error {
go mcStrechedNetworkPolicyController.Run(stopCh)
}

klog.InfoS("Waiting for flow restoration to complete")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn I wasn't sure what would be the best place for this code. I put it here because I think by that point all the controllers that need to install flows have been instantiated, initialized and "started" (even though at the moment, flowRestoreCompleteWait only waits for the NetworkPolicyController + NodeRouteController + CNIServer).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me. I tried to delay FlowRestoreComplete but was unsure whether moving ConnectUplinkToOVSBridge has any impact. cc @gran-vmv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can run flexible-ipam-e2e multiple times to check if this change can work.

@antoninbas
Copy link
Contributor Author

/test-flexible-ipam-e2e

tnqn
tnqn previously approved these changes May 21, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -844,6 +847,21 @@ func run(o *Options) error {
go mcStrechedNetworkPolicyController.Run(stopCh)
}

klog.InfoS("Waiting for flow restoration to complete")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me. I tried to delay FlowRestoreComplete but was unsure whether moving ConnectUplinkToOVSBridge has any impact. cc @gran-vmv

@antoninbas
Copy link
Contributor Author

/test-flexible-ipam-e2e

1 similar comment
@antoninbas
Copy link
Contributor Author

/test-flexible-ipam-e2e

@antoninbas
Copy link
Contributor Author

I go the following errors for test-flexible-ipam-e2e:

    --- FAIL: TestAntreaIPAM/testAntreaIPAMStatefulSetDedicated (19.02s)
    --- FAIL: TestAntreaIPAM/testAntreaIPAMStatefulSetShared (19.03s)

However, I have also observed the same test failures for another PR unrelated to this one (#6321)

@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor Author

E2e tests on a Kind cluster on Linux with non default values and E2e tests on a Kind cluster on Linux with all features enabled are failing consistently, with a lot of individual unit test failures. I am investigating. There may be some "timing" assumption I am breaking with this change. I'm trying to reproduce locally. Best guess would be something to do with ProxyAll...

@antoninbas
Copy link
Contributor Author

E2e tests on a Kind cluster on Linux with non default values and E2e tests on a Kind cluster on Linux with all features enabled are failing consistently, with a lot of individual unit test failures. I am investigating. There may be some "timing" assumption I am breaking with this change. I'm trying to reproduce locally. Best guess would be something to do with ProxyAll...

So the issue seems to be that until flow-restore-wait is removed, the Agent may not be able to reach the antrea Service when proxyAll is enabled (we remove kube-proxy in that case in the CI tests), but with this change, flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NP controller has started its watchers, and that depends on antrea Service reachability.

So there is a chicken and egg problem here.
@tnqn what's your take on this? I don't have a great solution, even though I could always add a timeout:

flowRestoreCompleteWait.WaitWithTimeout(10*time.Second)
if err := agentInitializer.FlowRestoreComplete(); err != nil {
        return err
 }

Because it blocks, it would still delay a few key operations, such as starting the apiserver, which answers liveness probes.

BTW, I thought that if the watch couldn't fail in a reasonable amount of time, we would fallback to data stored in files. Do you know the timeout because I didn't see it happen during that 30s time window:

I0522 00:17:47.808376       1 agent.go:789] "Checking if AntreaProxy is ready"
I0522 00:17:47.809998       1 agent.go:794] "AntreaProxy is ready"
I0522 00:17:47.810514       1 agent.go:850] "Waiting for flow restoration to complete"
I0522 00:17:47.812074       1 egress_controller.go:491] Starting AntreaAgentEgressController
I0522 00:17:47.813086       1 shared_informer.go:311] Waiting for caches to sync for AntreaAgentEgressController
I0522 00:17:47.813839       1 networkpolicy_controller.go:599] Waiting for Antrea client to be ready
I0522 00:17:47.814993       1 networkpolicy_controller.go:610] Antrea client is ready
I0522 00:17:47.814993       1 local_ip_detector_linux.go:59] Starting localIPDetector
I0522 00:17:47.815206       1 networkpolicy_controller.go:625] Waiting for all watchers to complete full sync
I0522 00:17:47.815206       1 ip_scheduler.go:219] "Starting Egress IP scheduler"
I0522 00:17:47.816925       1 networkpolicy_controller.go:959] Starting watch for AppliedToGroup
I0522 00:17:47.817019       1 local_ip_detector_linux.go:114] Listed existing IP address: map[10.244.0.1:{} 127.0.0.1:{} 172.18.0.2:{} ::1:{} fc00:f853:ccd:e793::2:{} fe80::42:acff:fe12:2:{}]
I0522 00:17:47.817587       1 networkpolicy_controller.go:959] Starting watch for AddressGroup
I0522 00:17:47.818145       1 networkpolicy_controller.go:959] Starting watch for NetworkPolicy
I0522 00:17:47.915617       1 shared_informer.go:318] Caches are synced for AntreaAgentEgressController
I0522 00:17:47.915906       1 route_linux.go:1176] "Restoring IP routes and rules for Egress"
I0522 00:17:47.917184       1 egress_controller.go:1212] Starting watch for EgressGroup
I0522 00:17:47.918618       1 egress_controller.go:368] "Received service CIDR update"
I0522 00:17:47.934073       1 stream.go:132] "Sent outbound message" dataLength=16
I0522 00:17:47.950976       1 stream.go:132] "Sent outbound message" dataLength=120
I0522 00:17:47.954251       1 stream.go:132] "Sent outbound message" dataLength=16
I0522 00:17:47.956163       1 stream.go:132] "Sent outbound message" dataLength=16
I0522 00:17:50.849714       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:17:55.848094       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:00.851341       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:05.849188       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:10.849493       1 stream.go:132] "Sent outbound message" dataLength=8
I0522 00:18:15.850687       1 stream.go:132] "Sent outbound message" dataLength=8
E0522 00:18:17.408533       1 server.go:446] "Cannot process CmdAdd request for container because network is not ready" err="timeout waiting for group" container="0c2a27d6d23d266ff9c3de1de2719b5ab8cd133ea5d40a73a47cd646cff8b7d9" timeout="30s"
I0522 00:18:17.415218       1 proxier.go:1010] syncProxyRules took 109.011µs
I0522 00:18:17.415480       1 runner.go:220] antrea-agent-proxy: ran, next possible in 1s, periodic in 30s

@tnqn
Copy link
Member

tnqn commented May 22, 2024

So the issue seems to be that until flow-restore-wait is removed, the Agent may not be able to reach the antrea Service when proxyAll is enabled (we remove kube-proxy in that case in the CI tests), but with this change, flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NP controller has started its watchers, and that depends on antrea Service reachability.

So there is a chicken and egg problem here.

It reminds me that I encountered this problem as well. Since we have a separate antreaClientProvider, do you think we could break the chicken-egg problem by not relying on the proxy, resolving the service to endpoint IP directly via something like https://github.com/kubernetes/kubernetes/blob/11f37d757fc0b710245446c80a8c9578ce2c02f1/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L33?

@antoninbas
Copy link
Contributor Author

It reminds me that I encountered this problem as well. Since we have a separate antreaClientProvider, do you think we could break the chicken-egg problem by not relying on the proxy, resolving the service to endpoint IP directly via something like https://github.com/kubernetes/kubernetes/blob/11f37d757fc0b710245446c80a8c9578ce2c02f1/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L33?

Thanks for the idea @tnqn, I will try it.
It is indeed different from the kubernetes Service case, since in this case we do have access to the K8s API (user has to provide an override) and hence we have access to the list of Endpoints.

@antoninbas antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 5b169d1 to e98ae09 Compare May 31, 2024 04:36
@antoninbas antoninbas added the action/release-note Indicates a PR that should be included in release notes. label May 31, 2024
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/controller/noderoute/node_route_controller.go Outdated Show resolved Hide resolved
Until a set of "essential" flows has been installed. At the moment, we
include NetworkPolicy flows (using podNetworkWait as the signal), Pod
forwarding flows (reconciled by the CNIServer), and Node routing flows
(installed by the NodeRouteController). This set can be extended in the
future if desired.

We leverage the wrapper around sync.WaitGroup which was introduced
previously in antrea-io#5777. It simplifies unit testing, and we can achieve some
symmetry with podNetworkWait.

We can also start leveraging this new wait group
(flowRestoreCompleteWait) as the signal to delete flows from previous
rounds. However, at the moment this is incomplete, as we don't wait for
all controllers to signal that they have installed initial flows.

Because the NodeRouteController does not have an initial "reconcile"
operation (like the CNIServer) to install flows for the initial Node
list, we instead rely on a different mechanims provided by upstream K8s
for controllers. When registering event handlers, we can request for the
ADD handler to include a boolean flag indicating whether the object is
part of the initial list retrieved by the informer. Using this
mechanism, we can reliably signal through flowRestoreCompleteWait when
this initial list of Nodes has been synced at least once.

This change is possible because of antrea-io#6361, which removed the dependency
on the proxy (kube-proxy or AntreaProxy) to access the Antrea
Controller. Prior to antrea-io#6361, there would have been a circular dependency
in the case where kube-proxy was removed: flow-restore-wait will not be
removed until the Pod network is "ready", which will not happen until
the NetworkPolicy controller has started its watchers, and that depends
on antrea Service reachability which depends on flow-restore-wait being
removed.

Fixes antrea-io#6338

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from e98ae09 to 665973c Compare May 31, 2024 18:12
Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas antoninbas force-pushed the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch from 665973c to 313eb8f Compare May 31, 2024 18:14
@antoninbas
Copy link
Contributor Author

/test-all
/test-vm-e2e
/test-windows-all
/test-flexible-ipam-e2e

@gran-vmv
Copy link
Contributor

gran-vmv commented Jun 3, 2024

/test-flexible-ipam-e2e

1 similar comment
@gran-vmv
Copy link
Contributor

gran-vmv commented Jun 3, 2024

/test-flexible-ipam-e2e

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas antoninbas merged commit 33f2683 into antrea-io:main Jun 3, 2024
56 of 60 checks passed
@antoninbas antoninbas deleted the delay-removal-of-flow-restore-wait-until-essential-flows-are-installed branch June 3, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky e2e test] TestConnectivity/testOVSRestartSameNode
3 participants