Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missed deletion events when reconnecting to/disconnecting from remote clusters (nodes and services) #25499

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented May 17, 2023

Currently, restarting the connection to remote clusters leaves a time window in which possible deletion events are missed, and never recovered once the new watchers are started. Similarly, the status is not properly cleaned-up when removing one remote cluster from the clustermesh configuration.

This PR builds on top of #25388 and addresses this issue for nodes and services (identities and ipcache entries will be handled in a followup PR). When connecting to a remote cluster that supports synced canaries, the initial prefix list is postponed until that prefix is known to be synchronized. This ensures that we observe an up-to-date state (which would not be otherwise guaranteed when the remote etcd instance is ephemeral).

Differently, when synced canaries are not supported (e.g., because the remote cluster is running an older version of the clustermesh-apiserver), we perform listing immediately. In this case, there is the possibility that non-stale keys are temporarily removed upon reconnection, causing brief connectivity disruptions. Yet, this is not different from what already happens today if the agent is restarted when the remote kvstore has not yet been fully synchronized. Additionally, this is not an issue when the remote kvstore is backed by persistent storage, since it is already synchronized. Alternatively, we might disable the deletion of stale entries if sync canaries are not supported, at the cost of leaking those entries until the agent is restarted. This behavior (if agreed upon) will be further detailed updating the release notes in a subsequent commit (it can be prevented upgrading first all clustermesh-apiservers and then the agents) after addressing the same problem affecting the ipcache and identity entries.

Related: #25388
Related: #24740

Fix missed deletion events when reconnecting to/disconnecting from remote clusters (nodes and services)

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. sig/kvstore Impacts the KVStore package interactions. labels May 17, 2023
@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-nodes-services branch 2 times, most recently from 6b78207 to 78f71b4 Compare May 22, 2023 12:42
@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-nodes-services branch 2 times, most recently from 11e39ee to e146c1e Compare May 25, 2023 09:23
@giorio94 giorio94 requested a review from marseel May 25, 2023 09:25
@giorio94 giorio94 marked this pull request as ready for review May 25, 2023 09:25
@giorio94 giorio94 requested review from a team as code owners May 25, 2023 09:25
@giorio94
Copy link
Member Author

giorio94 commented May 25, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing With host policy Tests NodePort

Failure Output

FAIL: Can not connect to service "http://[fd04::12]:31293" from outside cluster (1/10)

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/171/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-nodes-services branch from 9bb03d8 to 0684e3b Compare May 26, 2023 08:26
// This ensures that the synchronization of the keys hosted under the given prefix
// have been successfully synchronized from the external source, even in case an
// ephemeral kvstore is used.
func NewWatchStoreManagerSync(backend WatchStoreBackend, clusterName string) WatchStoreManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a separate type necessary, or could you add the synchronization canary aspect in to the RestartableWatchStore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally in favor of the separate type, which IMO provides better separation of concerns (it would transparently work also with different implementations of the WatchStore interface) and simplifies using alternative approaches in a simpler way compared to directly embedding the logic in the RestartableWatchStore. Additionally, it also reduces the number of watchers compared to having each RestartableWatchStore waiting for its own prefix.

@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-nodes-services branch from d84b7f4 to ef4d950 Compare May 30, 2023 11:46
@giorio94 giorio94 requested a review from squeed May 30, 2023 15:21
// when no watch operation is in progress.
func (rws *restartableWatchStore) Drain() {
if rws.watching.Swap(true) {
rws.log.Panic("Cannot drain the watch store while still running")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, should we panic? 🤔 Maybe yes...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of being strict here was to easily catch possible misuses, since draining it while actively watching a kvstore would lead to inconsistent behaviors (both due to the meaning of the operation itself and being the local state not protected by a mutex for the sake of simplicity).

They Key interface represents the basic operations to enable inserting
key-value pairs into and retrieve them from a kvstore, mainly concerning
marshalling and unmarshalling. Yet, unmarshalling was performed based on
the content of the value only, which makes it impossible to reconstruct
the full object when the key is not part of the value itself (e.g., for
identity-label pairs). This commit extends the Unmarshal method to
additionally take the key (stripped of the store prefix) as parameter,
and adapts the existing implementations to ignore it.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit introduces a new simple implementation of the LocalKey interface
used to store plain key-value pairs and retrieve them from the kvstore.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit introduces the WatchStore abstraction, which models the
operations required to synchronize key/value pairs from a kvstore, and
represents a simplified "read-only" version of the SharedStore
abstraction currently available.

The implementation which is provided in this commit supports restarting
the watch operation multiple times, automatically handling the emission
of deletion events for all stale entries. This allows to correctly
handle the reconnection procedure when the kvstore client needs to be
recreated (e.g., due to a change in the parameters, or a temporary
failure), avoiding to leak the entries that have been removed during the
connection outage. It additionally supports a Drain operation, to
trigger the emission of a deletion event for all currently known entries
(e.g., for cleanup purposes).

Notably, the RestartableWatchStore implementation conveys the
synchronization status through the `kvstore_initial_sync_completed`
metric, which is updated at every restart. It is labeled by `scope`
(matching the same format adopted for the other kvstore metrics),
`source_cluster` (identifying the source of the information) and
`action` (either write or read, the WatchStore always does read
operations). Additionally, it is possible to specify an additional
metric which is automatically updated with the number of synchronized
entries.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit introduces the WatchStoreManager abstraction, which enables
to register a set of functions to be asynchronously executed when the
corresponding kvstore prefixes are synchronized.

Two different implementations are provided. The first leverages the sync
canaries to detect that the entries under a given prefix have been correctly
synchronized from an external source. The second one, instead, is provided
for backward compatibility and immediately starts all registered functions.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, a new shared store synchronizing node and service information
is created for each connection to the kvstore backend of a remote
cluster, and the old one is dropped. Yet, this approach leads to missing
the deletion event for possible entries that are removed in the remote
cluster during the reconnection process (added entries would instead be
detected by the initial "list" operation).

This commit fixes this issue using a fixed instance of node and service
stores per remote cluster, which are reused upon reconnection, while
transparently handling the emission of the appropriate deletion events
for all keys no longer present. To prevent reading an incomplete state
when watching an ephemeral kvstore instance that has not yet been
completely synchronized, the watch operation is started only once the
sync canary for the given prefix is present, if support is enabled.

In case the sync canary support is not enabled (e.g., because the remote
cluster is running an older version of the clustermesh-apiserver), there
is the possibility that non-stale keys are temporarily removed upon
reconnection, causing brief connectivity disruptions. Yet, this is not
different from what already happens today if the agent is restarted when
the remote kvstore has not yet been fully synchronized. Additionally,
this is not an issue when the remote kvstore is backed by persistent
storage, since it is already synchronized. Alternatively, we might
disable the deletion of stale entries if sync canaries are not
supported, at the cost of leaking those entries until the agent is
restarted. This behavior will be further detailed updating the release
notes in a subsequent commit (it can be prevented upgrading first all
clustermesh-apiservers and then the agents) after addressing the same
problem affecting the ipcache and identity entries.

Additionally, all keys are drained when disconnecting from a remote
cluster, to properly clean-up the status without requiring to restart
the agent. Differently, they are not drained when simply shutting down,
to avoid breaking existing connections on restart.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-nodes-services branch from ef4d950 to ff36250 Compare May 31, 2023 14:37
@giorio94
Copy link
Member Author

Rebased onto main to fix a conflict.

@giorio94
Copy link
Member Author

@squeed PTAL

@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

giorio94 commented Jun 1, 2023

/ci-eks

@giorio94
Copy link
Member Author

giorio94 commented Jun 1, 2023

Cilium Runtime (privileged) hit #22373. Manually rerunning

@giorio94
Copy link
Member Author

giorio94 commented Jun 1, 2023

All checks are green, and team reviews are in. Marking as ready-to-merge

@giorio94 giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 1, 2023
@julianwiedmann
Copy link
Member

👋 as this is release-note/bug, do we need backports?

@giorio94
Copy link
Member Author

giorio94 commented Jun 1, 2023

wave as this is release-note/bug, do we need backports?

I personally think that backporting is quite risky, since it is a fairly large change which also depends on changes on the clustermesh-apiserver side, plus the follow-ups for identities and ipcaches. I'd be more comfortable not marking it as needs-backport for the moment, and re-evaluate once all pieces are in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/kvstore Impacts the KVStore package interactions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants