Store NetworkPolicy in filesystem as fallback data source #5739

tnqn · 2023-11-21T12:26:05Z

In the previous implementation, traffic from/to a Pod may bypass NetworkPolicies applied to the Pod in a time window when the agent restarts because realizing NetworkPolicies and enabling forwarding are asynchronous.

This patch stores NetworkPolicy data in files when they are received, and makes antre-agent fallback to use the files as data source if it can't connect to antrea-controller on startup. This prevents security regression: a NetworkPolicy that has been realized on a Node will continue to work even if antrea-controller is not available after antrea-agent restarts.

The benchmark results of the storage's operations are as below:

BenchmarkFileStoreAddNetworkPolicy-40              70383             16102 ns/op             520 B/op          9 allocs/op
BenchmarkFileStoreAddAppliedToGroup-40             45382             25880 ns/op            3019 B/op          9 allocs/op
BenchmarkFileStoreAddAddressGroup-40                7400            180000 ns/op           49538 B/op          9 allocs/op
BenchmarkFileStoreReplaceAll-40                       10         127088004 ns/op        17815943 B/op      33099 allocs/op

The disk usage when storing 1k NetworkPolicies, AddressGroups, and AppliedToGroups created by BenchmarkFileStoreReplaceAll is as below:

16M     /var/run/antrea-test/file-store/address-groups
4.0M    /var/run/antrea-test/file-store/applied-to-groups
4.0M    /var/run/antrea-test/file-store/network-policies

Note that the patch doesn't synchronize NetworkPolicy initial sync and Pod flow installation, so NetworkPolicy breach could still happen on agent restart. I plan to address it via a separate PR to make each PR more focused.

jianjuns · 2023-11-22T20:11:41Z

I assume this is for new connections, as OVS kernel should be able to handle existing connections with cached state?

Dyanngg · 2023-11-22T23:12:49Z

pkg/agent/controller/networkpolicy/filestore.go

+			klog.ErrorS(err2, "Failed to decode data from file, ignore it", "file", path)
+			return nil
+		}
+		// Note: we haven't stored a different version so far but version conversion should be performed when the used


Should we add a note or todo in the apis/controlplane package then? But I'm also wondering if we make sure that the appliedToGroup etc. structs are backward compatible (we kind of have to at this point since they're v1b2 already), maybe a version conversion is not strictly needed, except for the downgrade case where the agent restarts with a lower version.

I agree this could be overlooked but I feel no much difference where the note is added. To ensure backwards compatability, I added an unit test case, "compatible with v1beta2": If a PR changes the used version but doesn't take care of converting the stored version to the used one, the test would fail.

There is no extra requirement on the conversion between the storage version and the used version compared with what we do for API versions. As long as antrea-controller can talk with old agent using N-1 API and new agent using N API, the new agent can also get N-1 version data from files and convert them to N version.

Whether it really needs to do conversion will depend on how the API evolves, the PR just ensures the version information is stored so we know how to convert them when required.

The cost of conversion wouldn't be a problem. In the worst case each agent just does the conversion once and only when the API version changes and the controller happens to be unavailable.

pkg/agent/controller/networkpolicy/filestore_test.go

tnqn · 2023-11-23T07:47:46Z

I assume this is for new connections, as OVS kernel should be able to handle existing connections with cached state?

Yes, without the PR, every time an agent restarts, before it connects to antrea-controller successfully, new connections would always be allowed regardless of what the policies define, even when the policy had been realized on this Node before.

antoninbas

LGTM, only minor comments

pkg/agent/controller/networkpolicy/cache.go

pkg/agent/controller/networkpolicy/filestore.go

pkg/agent/controller/networkpolicy/networkpolicy_controller.go

antoninbas · 2023-11-27T21:12:52Z

pkg/agent/controller/networkpolicy/networkpolicy_controller_test.go

+			initFileStore: func(networkPolicyStore, appliedToGroupStore, addressGroupStore *fileStore) {
+				// The bytes of v1beta2 objects serialized in protobuf.
+				// They are not supposed to be updated when bumping up the used version.
+				base64EncodedPolicy := "azhzAAovCh5jb250cm9scGxhbmUuYW50cmVhLmlvL3YxYmV0YTISDU5ldHdvcmtQb2xpY3kSdAoYCgR1aWQxEgAaACIAKgR1aWQxMgA4AEIAEh8KAkluEg8KDWFkZHJlc3NHcm91cDEaACgAOABKAFoAGg9hcHBsaWVkVG9Hcm91cDEyJgoQSzhzTmV0d29ya1BvbGljeRIDbnMxGgdwb2xpY3kxIgR1aWQxGgAiAA=="


I am not a big fan of hardcoding the test strings like this. Would it be possible to generate the encoding on the fly from an actual v1beta2 object?

It's made so because when agent switches to next version in the future, all v1beta2 reference will likely be updated to the next version directly, then the code supposed to generate v1beta2 object will generate an object of new version, which makes the test not different from "same storage version" test.

antoninbas · 2023-11-27T21:15:30Z

pkg/agent/controller/networkpolicy/networkpolicy_controller_test.go

+				// Rule ID is a hash value, we don't care about its exact value.
+				actualRule.ID = ""
+				assert.Equal(t, tt.expectedRule, actualRule)
+			case <-time.After(time.Millisecond * 100):


if we run these unit tests on Windows, let's use a much larger timeout. After all, unless there is a bug somewhere, the test will complete fast regardless of the timeout value we use here. So let's use 1s or even 2s here to avoid any possibility of flakes.

antoninbas · 2023-11-27T21:15:46Z

pkg/agent/controller/networkpolicy/networkpolicy_controller_test.go

+		assert.Equal(t, v1beta2.NewGroupMemberSet(atgMember2), actualRule.TargetMembers)
+		assert.Equal(t, v1beta2.NewGroupMemberSet(agMember2), actualRule.FromAddresses)
+		assert.Equal(t, policy2.SourceRef, actualRule.SourceRef)
+	case <-time.After(time.Millisecond * 100):


same comment as above

antoninbas · 2023-11-27T21:16:55Z

test/e2e/networkpolicy_test.go

+		scale, err := data.clientset.AppsV1().Deployments(antreaNamespace).GetScale(context.TODO(), antreaDeployment, metav1.GetOptions{})
+		require.NoError(t, err)
+		scale.Spec.Replicas = replicas
+		_, err = data.clientset.AppsV1().Deployments(antreaNamespace).UpdateScale(context.TODO(), antreaDeployment, scale, metav1.UpdateOptions{})


not introduced by you, but antreaDeployment should really be called antreaControllerDeployment...

While I try to update it, I found there are also antreaDaemonSet, antreaWindowsDaemonSet. To avoid controversy in this PR, we can figure out proper names in a separate one.

test/e2e/networkpolicy_test.go

In the previous implementation, traffic from/to a Pod may bypass NetworkPolicies applied to the Pod in a time window when the agent restarts because realizing NetworkPolicies and enabling forwarding are asynchronous. This patch stores NetworkPolicy data in files when they are received, and makes antre-agent fallback to use the files as data source if it can't connect to antrea-controller on startup. This prevents security regression: a NetworkPolicy that has been realized on a Node will continue to work even if antrea-controller is not available after antrea-agent restarts. The benchmark results of the storage's operations are as below: BenchmarkFileStoreAddNetworkPolicy-40 70383 16102 ns/op 520 B/op 9 allocs/op BenchmarkFileStoreAddAppliedToGroup-40 45382 25880 ns/op 3019 B/op 9 allocs/op BenchmarkFileStoreAddAddressGroup-40 7400 180000 ns/op 49538 B/op 9 allocs/op BenchmarkFileStoreReplaceAll-40 10 127088004 ns/op 17815943 B/op 33099 allocs/op The disk usage when storing 1k NetworkPolicies, AddressGroups, and AppliedToGroups created by BenchmarkFileStoreReplaceAll is as below: 16M /var/run/antrea-test/file-store/address-groups 4.0M /var/run/antrea-test/file-store/applied-to-groups 4.0M /var/run/antrea-test/file-store/network-policies Signed-off-by: Quan Tian <qtian@vmware.com>

tnqn · 2023-11-28T16:26:55Z

/test-all

Dyanngg

LGTM

tnqn force-pushed the fallback-to-local-netpol branch 3 times, most recently from 168b9f5 to df20b41 Compare November 22, 2023 04:41

tnqn requested review from antoninbas, Dyanngg, GraysonWu and jianjuns November 22, 2023 04:47

tnqn marked this pull request as ready for review November 22, 2023 04:47

tnqn force-pushed the fallback-to-local-netpol branch 2 times, most recently from 454cd19 to 287e173 Compare November 22, 2023 06:00

tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/network-policy Issues or PRs related to network policies. labels Nov 22, 2023

Dyanngg reviewed Nov 22, 2023

View reviewed changes

tnqn force-pushed the fallback-to-local-netpol branch 2 times, most recently from d10f305 to 70d3f38 Compare November 23, 2023 08:24

antoninbas reviewed Nov 27, 2023

View reviewed changes

tnqn force-pushed the fallback-to-local-netpol branch from 70d3f38 to f981eac Compare November 28, 2023 16:25

antoninbas approved these changes Nov 28, 2023

View reviewed changes

Dyanngg approved these changes Nov 28, 2023

View reviewed changes

tnqn merged commit f9fc979 into antrea-io:main Nov 29, 2023
47 of 53 checks passed

tnqn deleted the fallback-to-local-netpol branch November 29, 2023 06:15

tnqn mentioned this pull request Nov 30, 2023

Block a Pod's IP packets until its NetworkPolicies are realized #5698

Closed

tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store NetworkPolicy in filesystem as fallback data source #5739

Store NetworkPolicy in filesystem as fallback data source #5739

tnqn commented Nov 21, 2023 •

edited

jianjuns commented Nov 22, 2023

Dyanngg Nov 22, 2023

tnqn Nov 23, 2023

tnqn commented Nov 23, 2023

antoninbas left a comment

antoninbas Nov 27, 2023

tnqn Nov 28, 2023

antoninbas Nov 27, 2023

tnqn Nov 28, 2023

antoninbas Nov 27, 2023

antoninbas Nov 27, 2023

tnqn Nov 28, 2023

tnqn commented Nov 28, 2023

Dyanngg left a comment

Store NetworkPolicy in filesystem as fallback data source #5739

Store NetworkPolicy in filesystem as fallback data source #5739

Conversation

tnqn commented Nov 21, 2023 • edited

jianjuns commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn commented Nov 23, 2023

antoninbas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn commented Nov 28, 2023

Dyanngg left a comment

Choose a reason for hiding this comment

tnqn commented Nov 21, 2023 •

edited