Skip to content

Commit

Permalink
docs: add upgrade note about deletion of stale entries in clustermesh
Browse files Browse the repository at this point in the history
150de13 ("clustermesh: delete stale node/service entries on
reconnect/disconnect"), along with the followup commits targeting
ipcache entries and identities modified the cilium agents behavior to
automatically clean up stale information after reconnecting to a given
remote kvstore. This was needed to fix the issue described in #24740.

The behavior differs based on the remote version of the
clustermesh-apiserver though. Indeed, newer versions support "sync
canaries" to convey that the synchronization from k8s to the kvstore
completed, while older ones don't. When sync canaries are not supported,
the agents will trigger the deletion of stale entries once the
corresponding etcd list operation completed: this might lead to the
removal of valid entries if that information had not yet been
synchronized from k8s to the kvstore, causing a temporary connectivity
disruption (until that is then synchronized and propagated again to the
agents). This commit extends the upgrade notes to detail this behavior
and the implication.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
  • Loading branch information
giorio94 authored and joestringer committed Jun 15, 2023
1 parent 9b18e1c commit a6720f1
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 0 deletions.
9 changes: 9 additions & 0 deletions Documentation/operations/upgrade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,15 @@ Annotations:
Egress rules in CiliumNetworkPolicy CRD. The old attribute name is no longer supported,
please update your CiliumNetworkPolicy CRD accordingly. Also applicable values for this
attribute are changed to ``disabled``, ``required`` and ``test-always-fail``.
* Cilium agents now automatically clean up possible stale information about meshed
clusters after reconnecting to the corresponding remote kvstores (see :gh-issue:`24740`
for the rationale behind this change). This might lead to brief connectivity disruptions
towards remote pods and global services when v1.14 Cilium agents connect to older
versions of the *clustermesh-apiserver*, and the *clustermesh-apiserver* is restarted.
Please upgrade the *clustermesh-apiserver* in all clusters before the Cilium agents

This comment has been minimized.

Copy link
@joestringer

joestringer Jul 21, 2023

Member

@giorio94 How does a user upgrade clustermesh-apiserver in all clusters before the Cilium agents? Is that a matter of changing the existing Cilium 1.13 Helm values to pull in the 1.14 clustermesh-apiserver image, ensuring they are all rolled out and successfully meshing, and then proceeding with a regular Cilium upgrade?

This comment has been minimized.

Copy link
@giorio94

giorio94 Jul 21, 2023

Author Member

Yes, it is enough to first bump the clustermesh-apiserver image version in all clusters (either manually or through helm), and then upgrade cilium to v1.14 in all clusters. It is still worth mentioning, though, that there might be brief connection drops affecting cross-cluster traffic which are unrelated from this specific aspect: #26462

to prevent the possibility of connectivity disruptions. Note: this issue does not
affect setups using a persistent etcd cluster instead of the ephemeral one bundled
with the *clustermesh-apiserver*.

Removed Options
~~~~~~~~~~~~~~~
Expand Down
1 change: 1 addition & 0 deletions Documentation/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -665,6 +665,7 @@ kubernetes
kubespray
kvstore
kvstoremesh
kvstores
labelsContext
latencies
lbExternalClusterIP
Expand Down

0 comments on commit a6720f1

Please sign in to comment.