Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store NetworkPolicy in filesystem as fallback data source #5739

Merged
merged 1 commit into from
Nov 29, 2023

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Nov 21, 2023

In the previous implementation, traffic from/to a Pod may bypass NetworkPolicies applied to the Pod in a time window when the agent restarts because realizing NetworkPolicies and enabling forwarding are asynchronous.

This patch stores NetworkPolicy data in files when they are received, and makes antre-agent fallback to use the files as data source if it can't connect to antrea-controller on startup. This prevents security regression: a NetworkPolicy that has been realized on a Node will continue to work even if antrea-controller is not available after antrea-agent restarts.

The benchmark results of the storage's operations are as below:

BenchmarkFileStoreAddNetworkPolicy-40              70383             16102 ns/op             520 B/op          9 allocs/op
BenchmarkFileStoreAddAppliedToGroup-40             45382             25880 ns/op            3019 B/op          9 allocs/op
BenchmarkFileStoreAddAddressGroup-40                7400            180000 ns/op           49538 B/op          9 allocs/op
BenchmarkFileStoreReplaceAll-40                       10         127088004 ns/op        17815943 B/op      33099 allocs/op

The disk usage when storing 1k NetworkPolicies, AddressGroups, and AppliedToGroups created by BenchmarkFileStoreReplaceAll is as below:

16M     /var/run/antrea-test/file-store/address-groups
4.0M    /var/run/antrea-test/file-store/applied-to-groups
4.0M    /var/run/antrea-test/file-store/network-policies

Note that the patch doesn't synchronize NetworkPolicy initial sync and Pod flow installation, so NetworkPolicy breach could still happen on agent restart. I plan to address it via a separate PR to make each PR more focused.

@tnqn tnqn force-pushed the fallback-to-local-netpol branch 3 times, most recently from 168b9f5 to df20b41 Compare November 22, 2023 04:41
@tnqn tnqn marked this pull request as ready for review November 22, 2023 04:47
@tnqn tnqn force-pushed the fallback-to-local-netpol branch 2 times, most recently from 454cd19 to 287e173 Compare November 22, 2023 06:00
@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/network-policy Issues or PRs related to network policies. labels Nov 22, 2023
@jianjuns
Copy link
Contributor

I assume this is for new connections, as OVS kernel should be able to handle existing connections with cached state?

klog.ErrorS(err2, "Failed to decode data from file, ignore it", "file", path)
return nil
}
// Note: we haven't stored a different version so far but version conversion should be performed when the used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note or todo in the apis/controlplane package then? But I'm also wondering if we make sure that the appliedToGroup etc. structs are backward compatible (we kind of have to at this point since they're v1b2 already), maybe a version conversion is not strictly needed, except for the downgrade case where the agent restarts with a lower version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this could be overlooked but I feel no much difference where the note is added. To ensure backwards compatability, I added an unit test case, "compatible with v1beta2": If a PR changes the used version but doesn't take care of converting the stored version to the used one, the test would fail.

There is no extra requirement on the conversion between the storage version and the used version compared with what we do for API versions. As long as antrea-controller can talk with old agent using N-1 API and new agent using N API, the new agent can also get N-1 version data from files and convert them to N version.

Whether it really needs to do conversion will depend on how the API evolves, the PR just ensures the version information is stored so we know how to convert them when required.

The cost of conversion wouldn't be a problem. In the worst case each agent just does the conversion once and only when the API version changes and the controller happens to be unavailable.

pkg/agent/controller/networkpolicy/filestore_test.go Outdated Show resolved Hide resolved
@tnqn
Copy link
Member Author

tnqn commented Nov 23, 2023

I assume this is for new connections, as OVS kernel should be able to handle existing connections with cached state?

Yes, without the PR, every time an agent restarts, before it connects to antrea-controller successfully, new connections would always be allowed regardless of what the policies define, even when the policy had been realized on this Node before.

@tnqn tnqn force-pushed the fallback-to-local-netpol branch 2 times, most recently from d10f305 to 70d3f38 Compare November 23, 2023 08:24
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only minor comments

pkg/agent/controller/networkpolicy/cache.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/cache.go Outdated Show resolved Hide resolved
pkg/agent/controller/networkpolicy/filestore.go Outdated Show resolved Hide resolved
initFileStore: func(networkPolicyStore, appliedToGroupStore, addressGroupStore *fileStore) {
// The bytes of v1beta2 objects serialized in protobuf.
// They are not supposed to be updated when bumping up the used version.
base64EncodedPolicy := "azhzAAovCh5jb250cm9scGxhbmUuYW50cmVhLmlvL3YxYmV0YTISDU5ldHdvcmtQb2xpY3kSdAoYCgR1aWQxEgAaACIAKgR1aWQxMgA4AEIAEh8KAkluEg8KDWFkZHJlc3NHcm91cDEaACgAOABKAFoAGg9hcHBsaWVkVG9Hcm91cDEyJgoQSzhzTmV0d29ya1BvbGljeRIDbnMxGgdwb2xpY3kxIgR1aWQxGgAiAA=="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of hardcoding the test strings like this. Would it be possible to generate the encoding on the fly from an actual v1beta2 object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's made so because when agent switches to next version in the future, all v1beta2 reference will likely be updated to the next version directly, then the code supposed to generate v1beta2 object will generate an object of new version, which makes the test not different from "same storage version" test.

// Rule ID is a hash value, we don't care about its exact value.
actualRule.ID = ""
assert.Equal(t, tt.expectedRule, actualRule)
case <-time.After(time.Millisecond * 100):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we run these unit tests on Windows, let's use a much larger timeout. After all, unless there is a bug somewhere, the test will complete fast regardless of the timeout value we use here. So let's use 1s or even 2s here to avoid any possibility of flakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

assert.Equal(t, v1beta2.NewGroupMemberSet(atgMember2), actualRule.TargetMembers)
assert.Equal(t, v1beta2.NewGroupMemberSet(agMember2), actualRule.FromAddresses)
assert.Equal(t, policy2.SourceRef, actualRule.SourceRef)
case <-time.After(time.Millisecond * 100):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Comment on lines +768 to +771
scale, err := data.clientset.AppsV1().Deployments(antreaNamespace).GetScale(context.TODO(), antreaDeployment, metav1.GetOptions{})
require.NoError(t, err)
scale.Spec.Replicas = replicas
_, err = data.clientset.AppsV1().Deployments(antreaNamespace).UpdateScale(context.TODO(), antreaDeployment, scale, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not introduced by you, but antreaDeployment should really be called antreaControllerDeployment...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I try to update it, I found there are also antreaDaemonSet, antreaWindowsDaemonSet. To avoid controversy in this PR, we can figure out proper names in a separate one.

test/e2e/networkpolicy_test.go Outdated Show resolved Hide resolved
In the previous implementation, traffic from/to a Pod may bypass
NetworkPolicies applied to the Pod in a time window when the agent
restarts because realizing NetworkPolicies and enabling forwarding are
asynchronous.

This patch stores NetworkPolicy data in files when they are received,
and makes antre-agent fallback to use the files as data source if it
can't connect to antrea-controller on startup. This prevents security
regression: a NetworkPolicy that has been realized on a Node will
continue to work even if antrea-controller is not available after
antrea-agent restarts.

The benchmark results of the storage's operations are as below:

BenchmarkFileStoreAddNetworkPolicy-40              70383             16102 ns/op             520 B/op          9 allocs/op
BenchmarkFileStoreAddAppliedToGroup-40             45382             25880 ns/op            3019 B/op          9 allocs/op
BenchmarkFileStoreAddAddressGroup-40                7400            180000 ns/op           49538 B/op          9 allocs/op
BenchmarkFileStoreReplaceAll-40                       10         127088004 ns/op        17815943 B/op      33099 allocs/op

The disk usage when storing 1k NetworkPolicies, AddressGroups, and
AppliedToGroups created by BenchmarkFileStoreReplaceAll is as below:

16M     /var/run/antrea-test/file-store/address-groups
4.0M    /var/run/antrea-test/file-store/applied-to-groups
4.0M    /var/run/antrea-test/file-store/network-policies

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn
Copy link
Member Author

tnqn commented Nov 28, 2023

/test-all

Copy link
Contributor

@Dyanngg Dyanngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants