Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13 Backports 2023-10-11 #28519

Merged
merged 4 commits into from Oct 12, 2023
Merged

v1.13 Backports 2023-10-11 #28519

merged 4 commits into from Oct 12, 2023

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Oct 11, 2023

[ upstream commit 55efd61 ]

Backport the script to retrieve the previous Cilium version, originally
introduced in d51a932 ("Clean up tests-ipsec-upgrade workflow").

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki added kind/backports This PR provides functionality previously merged into master. backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. labels Oct 11, 2023
@joamaki joamaki marked this pull request as ready for review October 12, 2023 08:11
@joamaki joamaki requested review from a team as code owners October 12, 2023 08:11
giorio94 and others added 3 commits October 12, 2023 11:58
[ upstream commit 0e5bc15 ]

[ Backporter's notes: Removed references to kvstoremesh as it's
  not part of v1.13 ]

This commit introduces a new GitHub actions workflow which validates the
Cilium upgrade and downgrade paths when clustermesh is enabled, ensuring
that one minor version skew is correctly supported.

More specifically, the test initially deploys a mixed version cluster
mesh, with one cluster running Cilium from the tip of the previous stable
branch, and the other the tip of the current stable branch. At this point,
the first cluster is upgraded to the tip of the current stable branch,
enabling kvstoremesh at the same time. Finally, the cluster is downgraded
again to the tip of the previous stable branch, disabling kvstoremesh.

We do not enable the connectivity disruption checks in this workflow,
because Cilium versions older than 1.14 are currently affected by known
issues causing possible brief connectivity distruption of cross-cluster
traffic when the agents get restarted.

To reduce the total amount of time required by this workflow, only a
limited subset of tests is enabled, while the full suite is run in
the conformance-clustermesh workflow.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3de85a1 ]

We modified the UpdateStatus function to ensure that the CNP object is
deep-copied before passing it as an argument. This change was necessary
because the UpdateStatus function was modifying the CNP object, specifically
clearing the LastAppliedConfiguration key from the annotations map. By
deep-copying the CNP object, we ensure that the original object remains
unmodified which fixes the following race condition:

```
Write at 0x00c002a98510 by goroutine 119834:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/map_faststr.go:203 +0x0
  github.com/cilium/cilium/pkg/k8s.(*CNPStatusUpdateContext).updateViaAPIServer.func1()
      ./pkg/k8s/cnp.go:215 +0x53
  runtime.deferreturn()
      /usr/local/go/src/runtime/panic.go:477 +0x30
  github.com/cilium/cilium/pkg/k8s.(*CNPStatusUpdateContext).updateStatus()
      ./pkg/k8s/cnp.go:78 +0x2c7
  github.com/cilium/cilium/pkg/k8s.(*CNPStatusUpdateContext).UpdateStatus()
      ./pkg/k8s/cnp.go:146 +0x786
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).addCiliumNetworkPolicyV2.func1()
      ./pkg/k8s/watchers/cilium_network_policy.go:352 +0x86
  github.com/cilium/cilium/pkg/controller.(*controller).runController()
      ./pkg/controller/controller.go:251 +0x171
  github.com/cilium/cilium/pkg/controller.(*Manager).createControllerLocked.func1()
      ./pkg/controller/manager.go:111 +0xa4

Previous read at 0x00c002a98510 by goroutine 1205:
  runtime.mapiterinit()
      /usr/local/go/src/runtime/map.go:816 +0x0
  github.com/cilium/cilium/pkg/comparator.MapStringEqualsIgnoreKeys()
      ./pkg/comparator/comparator.go:82 +0xb1
  github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.objectMetaDeepEqual()
      ./pkg/k8s/apis/cilium.io/v2/cnp_types.go:65 +0xb0
  github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2.(*CiliumNetworkPolicy).DeepEqual()
      ./pkg/k8s/apis/cilium.io/v2/cnp_types.go:54 +0x177
  github.com/cilium/cilium/pkg/k8s/types.(*SlimCNP).DeepEqual()
      ./pkg/k8s/types/zz_generated.deepequal.go:82 +0xbd
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).onUpsert()
      ./pkg/k8s/watchers/cilium_network_policy.go:238 +0x170
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).ciliumNetworkPoliciesInit.func1()
      ./pkg/k8s/watchers/cilium_network_policy.go:175 +0xc64

Goroutine 119834 (running) created at:
  github.com/cilium/cilium/pkg/controller.(*Manager).createControllerLocked()
      ./pkg/controller/manager.go:111 +0x757
  github.com/cilium/cilium/pkg/controller.(*Manager).updateController()
      ./pkg/controller/manager.go:84 +0x44f
  github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController()
      ./pkg/controller/manager.go:52 +0xe6f
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).addCiliumNetworkPolicyV2()
      ./pkg/k8s/watchers/cilium_network_policy.go:348 +0xc75
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).onUpsert()
      ./pkg/k8s/watchers/cilium_network_policy.go:271 +0x744
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).ciliumNetworkPoliciesInit.func1()
      ./pkg/k8s/watchers/cilium_network_policy.go:175 +0xc64

Goroutine 1205 (running) created at:
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).ciliumNetworkPoliciesInit()
      ./pkg/k8s/watchers/cilium_network_policy.go:91 +0x27c
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).enableK8sWatchers.func1()
      ./pkg/k8s/watchers/watcher.go:578 +0x59
  sync.(*Once).doSlow()
      /usr/local/go/src/sync/once.go:74 +0xf0
  sync.(*Once).Do()
      /usr/local/go/src/sync/once.go:65 +0x44
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).enableK8sWatchers()
      ./pkg/k8s/watchers/watcher.go:578 +0xa24
  github.com/cilium/cilium/pkg/k8s/watchers.(*K8sWatcher).InitK8sSubsystem()
      ./pkg/k8s/watchers/watcher.go:508 +0x104
  github.com/cilium/cilium/daemon/cmd.newDaemon()
      ./daemon/cmd/daemon.go:1001 +0x9070
  github.com/cilium/cilium/daemon/cmd.newDaemonPromise.func1()
      ./daemon/cmd/daemon_main.go:1687 +0xa4
  github.com/cilium/cilium/pkg/hive.Hook.Start()
      ./pkg/hive/lifecycle.go:34 +0x70
  github.com/cilium/cilium/pkg/hive.(*Hook).Start()
      <autogenerated>:1 +0x1f
  github.com/cilium/cilium/pkg/hive.(*DefaultLifecycle).Start()
      ./pkg/hive/lifecycle.go:103 +0x3f1
  github.com/cilium/cilium/pkg/hive.(*Hive).Start()
      ./pkg/hive/hive.go:291 +0x152
  github.com/cilium/cilium/pkg/hive.(*Hive).Run()
      ./pkg/hive/hive.go:191 +0xc4
  github.com/cilium/cilium/daemon/cmd.NewAgentCmd.func1()
      ./daemon/cmd/root.go:39 +0x264
  github.com/spf13/cobra.(*Command).execute()
      ./vendor/github.com/spf13/cobra/command.go:944 +0xcb8
  github.com/spf13/cobra.(*Command).ExecuteC()
      ./vendor/github.com/spf13/cobra/command.go:1068 +0x5c4
  github.com/spf13/cobra.(*Command).Execute()
      ./vendor/github.com/spf13/cobra/command.go:992 +0x2e
  github.com/cilium/cilium/daemon/cmd.Execute()
      ./daemon/cmd/root.go:79 +0x2f
  main.main()
      ./daemon/main.go:14 +0xa9
```

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 45549e3 ]

Commit def1770 ("ipsec: Clear node ID and SPI with output-mark")
updates the output-mark of all our XFRM states. Without special logic to
handle the upgrade, these new states can however conflict with the
existing states. In that case, the agent will remove the existing state
and add the new one. It works fine, but can lead to a very short time
where packets are dropped because there isn't any matching XFRM state.
It also causes an error log for the conflicting states.

This commit adds the special upgrade logic to handle that situation. We
have two cases to handle, depending on where we upgrade from.

If upgrading from a version after commit 3e59b68 ("ipsec: Per-node
XFRM states & policies for EKS & AKS"), then the existing XFRM states
differ only by their output-marks. In that case, we can simply update
the state in place atomically using the appropriate netlink call.

If upgrading from a version that doesn't include commit 3e59b68, then
the XFRM states may differ significantly. In that case, we already have
some logic in place to remove the old conflicting states and add the new
ones (the netlink API doesn't let us perform an atomic update in that
case [1]). To handle the additional difference in output-marks, we
simply need to update the existing logic to not assume the output-marks
are the same. This is also the correct way to detect conflicting states
anyway, given the kernel doesn't consider the output-marks when looking
for conflicts.

This change was tested with upgrades to v1.13.7 with this patch
backported to it, on:
- IPAM=ENI, upgrading from v1.13.0 (before commit 3e59b68).
- IPAM=ENI, upgrading from v1.13.7.
- IPAM=cluster-pool, upgrading from v1.13.0.

In all three cases, I ran the full connectivity tests successfully and
checked that the logs didn't have any errors or warnings after the
upgrade. I also deployed the migrate-svc app before the upgrade to
ensure the upgrade doesn't cause any transient packet drops [2].

1 - c0d9b8c ("ipsec: Allow old and new XFRM OUT states to coexist for upgrade")
2 - Note that the second case isn't atomically updating the XFRM states
    so can in principle cause packet drops. The window for such drops
    is however so small (time between two netlink calls) that they
    haven't been reproduced yet (neither now nor when testing
    c0d9b8c).
Fixes: def1770 ("ipsec: Clear node ID and SPI with output-mark")
Reported-by: Hart Hoover <hart@isovalent.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/v1.13-backport-2023-10-11 branch from b4ab602 to c7a140b Compare October 12, 2023 08:58
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR looks good. Thanks!

@giorio94
Copy link
Member

/ci-clustermesh

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR looks good, thanks! I've started the CI to check if the new workflow completes correctly.

Edit: all green 🎉

@joamaki
Copy link
Contributor Author

joamaki commented Oct 12, 2023

/test-backport-1.13

@joamaki
Copy link
Contributor Author

joamaki commented Oct 12, 2023

/test-1.19-4.19

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 12, 2023
@qmonnet qmonnet merged commit 02318da into v1.13 Oct 12, 2023
134 of 143 checks passed
@qmonnet qmonnet deleted the pr/v1.13-backport-2023-10-11 branch October 12, 2023 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants