New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.13 Backports 2023-10-11 #28519
v1.13 Backports 2023-10-11 #28519
Conversation
[ upstream commit 0e5bc15 ] [ Backporter's notes: Removed references to kvstoremesh as it's not part of v1.13 ] This commit introduces a new GitHub actions workflow which validates the Cilium upgrade and downgrade paths when clustermesh is enabled, ensuring that one minor version skew is correctly supported. More specifically, the test initially deploys a mixed version cluster mesh, with one cluster running Cilium from the tip of the previous stable branch, and the other the tip of the current stable branch. At this point, the first cluster is upgraded to the tip of the current stable branch, enabling kvstoremesh at the same time. Finally, the cluster is downgraded again to the tip of the previous stable branch, disabling kvstoremesh. We do not enable the connectivity disruption checks in this workflow, because Cilium versions older than 1.14 are currently affected by known issues causing possible brief connectivity distruption of cross-cluster traffic when the agents get restarted. To reduce the total amount of time required by this workflow, only a limited subset of tests is enabled, while the full suite is run in the conformance-clustermesh workflow. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3de85a1 ] We modified the UpdateStatus function to ensure that the CNP object is deep-copied before passing it as an argument. This change was necessary because the UpdateStatus function was modifying the CNP object, specifically clearing the LastAppliedConfiguration key from the annotations map. By deep-copying the CNP object, we ensure that the original object remains unmodified which fixes the following race condition: ``` Write at 0x00c002a98510 by goroutine 119834: runtime.mapassign_faststr() /usr/local/go/src/runtime/map_faststr.go:203 +0x0 github.com/cilium/cilium/pkg/k8s.(*CNPStatusUpdateContext).updateViaAPIServer.func1() ./pkg/k8s/cnp.go:215 +0x53 runtime.deferreturn() /usr/local/go/src/runtime/panic.go:477 +0x30 github.com/cilium/cilium/pkg/k8s.(*CNPStatusUpdateContext).updateStatus() ./pkg/k8s/cnp.go:78 +0x2c7 github.com/cilium/cilium/pkg/k8s.(*CNPStatusUpdateContext).UpdateStatus() ./pkg/k8s/cnp.go:146 +0x786 github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).addCiliumNetworkPolicyV2.func1() ./pkg/k8s/watchers/cilium_network_policy.go:352 +0x86 github.com/cilium/cilium/pkg/controller.(*controller).runController() ./pkg/controller/controller.go:251 +0x171 github.com/cilium/cilium/pkg/controller.(*Manager).createControllerLocked.func1() ./pkg/controller/manager.go:111 +0xa4 Previous read at 0x00c002a98510 by goroutine 1205: runtime.mapiterinit() /usr/local/go/src/runtime/map.go:816 +0x0 github.com/cilium/cilium/pkg/comparator.MapStringEqualsIgnoreKeys() ./pkg/comparator/comparator.go:82 +0xb1 github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.objectMetaDeepEqual() ./pkg/k8s/apis/cilium.io/v2/cnp_types.go:65 +0xb0 github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.(*CiliumNetworkPolicy).DeepEqual() ./pkg/k8s/apis/cilium.io/v2/cnp_types.go:54 +0x177 github.com/cilium/cilium/pkg/k8s/types.(*SlimCNP).DeepEqual() ./pkg/k8s/types/zz_generated.deepequal.go:82 +0xbd github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).onUpsert() ./pkg/k8s/watchers/cilium_network_policy.go:238 +0x170 github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).ciliumNetworkPoliciesInit.func1() ./pkg/k8s/watchers/cilium_network_policy.go:175 +0xc64 Goroutine 119834 (running) created at: github.com/cilium/cilium/pkg/controller.(*Manager).createControllerLocked() ./pkg/controller/manager.go:111 +0x757 github.com/cilium/cilium/pkg/controller.(*Manager).updateController() ./pkg/controller/manager.go:84 +0x44f github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController() ./pkg/controller/manager.go:52 +0xe6f github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).addCiliumNetworkPolicyV2() ./pkg/k8s/watchers/cilium_network_policy.go:348 +0xc75 github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).onUpsert() ./pkg/k8s/watchers/cilium_network_policy.go:271 +0x744 github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).ciliumNetworkPoliciesInit.func1() ./pkg/k8s/watchers/cilium_network_policy.go:175 +0xc64 Goroutine 1205 (running) created at: github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).ciliumNetworkPoliciesInit() ./pkg/k8s/watchers/cilium_network_policy.go:91 +0x27c github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).enableK8sWatchers.func1() ./pkg/k8s/watchers/watcher.go:578 +0x59 sync.(*Once).doSlow() /usr/local/go/src/sync/once.go:74 +0xf0 sync.(*Once).Do() /usr/local/go/src/sync/once.go:65 +0x44 github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).enableK8sWatchers() ./pkg/k8s/watchers/watcher.go:578 +0xa24 github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).InitK8sSubsystem() ./pkg/k8s/watchers/watcher.go:508 +0x104 github.com/cilium/cilium/daemon/cmd.newDaemon() ./daemon/cmd/daemon.go:1001 +0x9070 github.com/cilium/cilium/daemon/cmd.newDaemonPromise.func1() ./daemon/cmd/daemon_main.go:1687 +0xa4 github.com/cilium/cilium/pkg/hive.Hook.Start() ./pkg/hive/lifecycle.go:34 +0x70 github.com/cilium/cilium/pkg/hive.(*Hook).Start() <autogenerated>:1 +0x1f github.com/cilium/cilium/pkg/hive.(*DefaultLifecycle).Start() ./pkg/hive/lifecycle.go:103 +0x3f1 github.com/cilium/cilium/pkg/hive.(*Hive).Start() ./pkg/hive/hive.go:291 +0x152 github.com/cilium/cilium/pkg/hive.(*Hive).Run() ./pkg/hive/hive.go:191 +0xc4 github.com/cilium/cilium/daemon/cmd.NewAgentCmd.func1() ./daemon/cmd/root.go:39 +0x264 github.com/spf13/cobra.(*Command).execute() ./vendor/github.com/spf13/cobra/command.go:944 +0xcb8 github.com/spf13/cobra.(*Command).ExecuteC() ./vendor/github.com/spf13/cobra/command.go:1068 +0x5c4 github.com/spf13/cobra.(*Command).Execute() ./vendor/github.com/spf13/cobra/command.go:992 +0x2e github.com/cilium/cilium/daemon/cmd.Execute() ./daemon/cmd/root.go:79 +0x2f main.main() ./daemon/main.go:14 +0xa9 ``` Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 45549e3 ] Commit def1770 ("ipsec: Clear node ID and SPI with output-mark") updates the output-mark of all our XFRM states. Without special logic to handle the upgrade, these new states can however conflict with the existing states. In that case, the agent will remove the existing state and add the new one. It works fine, but can lead to a very short time where packets are dropped because there isn't any matching XFRM state. It also causes an error log for the conflicting states. This commit adds the special upgrade logic to handle that situation. We have two cases to handle, depending on where we upgrade from. If upgrading from a version after commit 3e59b68 ("ipsec: Per-node XFRM states & policies for EKS & AKS"), then the existing XFRM states differ only by their output-marks. In that case, we can simply update the state in place atomically using the appropriate netlink call. If upgrading from a version that doesn't include commit 3e59b68, then the XFRM states may differ significantly. In that case, we already have some logic in place to remove the old conflicting states and add the new ones (the netlink API doesn't let us perform an atomic update in that case [1]). To handle the additional difference in output-marks, we simply need to update the existing logic to not assume the output-marks are the same. This is also the correct way to detect conflicting states anyway, given the kernel doesn't consider the output-marks when looking for conflicts. This change was tested with upgrades to v1.13.7 with this patch backported to it, on: - IPAM=ENI, upgrading from v1.13.0 (before commit 3e59b68). - IPAM=ENI, upgrading from v1.13.7. - IPAM=cluster-pool, upgrading from v1.13.0. In all three cases, I ran the full connectivity tests successfully and checked that the logs didn't have any errors or warnings after the upgrade. I also deployed the migrate-svc app before the upgrade to ensure the upgrade doesn't cause any transient packet drops [2]. 1 - c0d9b8c ("ipsec: Allow old and new XFRM OUT states to coexist for upgrade") 2 - Note that the second case isn't atomically updating the XFRM states so can in principle cause packet drops. The window for such drops is however so small (time between two netlink calls) that they haven't been reproduced yet (neither now nor when testing c0d9b8c). Fixes: def1770 ("ipsec: Clear node ID and SPI with output-mark") Reported-by: Hart Hoover <hart@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
b4ab602
to
c7a140b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR looks good. Thanks!
/ci-clustermesh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR looks good, thanks! I've started the CI to check if the new workflow completes correctly.
Edit: all green 🎉
/test-backport-1.13 |
/test-1.19-4.19 |
Once this PR is merged, a GitHub action will update the labels of these PRs: