Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend clustermesh status reporting with remote configuration and synchronization information #26788

Merged
merged 12 commits into from Jul 25, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jul 12, 2023

This PR enriches the remote cluster status to additionally report information concerning the configuration advertised by the remote cluster and the synchronization status. In particular, a cluster is not reported as ready only after that the initial list of entries for each resource type has been retrieved. This is intended to provide additional information while troubleshooting possible issues (as also captured by sysdumps), as well as to allow the cilium clustermesh status command to better reflect the actual status.

Example (with the --all-clusters flag set, otherwise only not-ready clusters are expanded):

ClusterMesh:            2/3 clusters ready, 0 global-services
   hedgehog: ready, 2 nodes, 4 endpoints, 3 identities, 0 services, 0 failures (last: never)
   └  etcd: 1/1 connected, leases=0, lock lease-ID=7c02894a36550948, has-quorum=true: https://hedgehog.mesh.cilium.io:32380 - 3.5.4 (Leader)
   └  remote configuration: expected=false, retrieved=false
   └  synchronization status: nodes=true, endpoints=true, identities=true, services=true
   lobster: ready, 2 nodes, 4 endpoints, 3 identities, 0 services, 0 failures (last: never)
   └  etcd: 1/1 connected, leases=0, lock lease-ID=7c02894a1ee69fa3, has-quorum=true: https://lobster.mesh.cilium.io:32379 - 3.5.4 (Leader)
   └  remote configuration: expected=true, retrieved=true cluster-id=4, kvstoremesh=false, sync-canaries=true
   └  synchronization status: nodes=true, endpoints=true, identities=true, services=true
   ocelot: not-ready, 2 nodes, 4 endpoints, 3 identities, 0 services, 0 failures (last: never)
   └  etcd: 1/1 connected, leases=0, lock lease-ID=7c02894a22a8bc93, has-quorum=true: https://ocelot.mesh.cilium.io:32379 - 3.5.4 (Leader)
   └  remote configuration: expected=false, retrieved=true cluster-id=3, kvstoremesh=false, sync-canaries=true
   └  synchronization status: nodes=true, endpoints=true, identities=false, services=true

Please refer to the individual commit messages for additional information.
Tentatively marking for backport to v1.14, given that it improves the investigation of possible issues around clustermesh.

Fixes: #26532
Related: cilium/cilium-cli#1834

Extend clustermesh status reporting with remote configuration and synchronization information 

@giorio94 giorio94 added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 12, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.0 Jul 12, 2023
@giorio94
Copy link
Member Author

/test

giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jul 13, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jul 13, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jul 13, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 marked this pull request as ready for review July 13, 2023 14:52
@giorio94 giorio94 requested review from a team as code owners July 13, 2023 14:52
@giorio94 giorio94 added the dont-merge/blocked Another PR must be merged before this one. label Jul 13, 2023
@giorio94
Copy link
Member Author

Marking as blocked to explicitly wait for the parallel reviews on cilium/cilium-cli#1834, so that modifications can be synchronized.

@giorio94
Copy link
Member Author

/test

giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jul 13, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

giorio94 commented Jul 13, 2023

/ci-aks

Hit known flake #22162

giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jul 14, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Enrich the remote cluster status to additionally include information
about the cluster configuration advertised by the remote cluster. In
particular, whether it is required to be present and has been retrieved,
along with the remote cluster ID and its capabilities. By default, this
information is shown only until the given cluster is not ready, or if
the `--all-clusters` flag is set, to simplify troubleshooting possible
issues and distinguish them from connectivity problems.

The information concerning the cluster configuration to be returned as
part of the status is stored remote cluster during the retrieval process,
initially setting whether it is required to be present, and then filling
the rest of the fields when it is actually retrieved. No cluster config
information is returned if the retrieval process has not started yet.

Additionally, this commit slightly modifies the condition used to
determine whether a given cluster is reported as ready, requiring that
the connection to the remote etcd has been established, and the cluster
configuration has either been retrieved, or determined that it is not
required to be present. Hence, no longer relying on the connection status
only, which was previously configured only after retrieving the cluster
configuration, possibly leading to a more inaccurate status reporting.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the synced variable is used in the RestartableWatchStore
to ensure that callbacks are executed only once upon the initial
synchronization. Given that a subsequent commit will start exposing
synchronization information (with a different meaning), let's get rid
of it and rework the callbacks execution to not depend on it.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's allow to retrieve whether the RestartableWatchStore is currently
synchronized or not. It is considered to be synchronized if the initial
list of entries has been retrieved from the kvstore, and new events are
being watched. When the watch operation gets interrupted, the status
transitions back to not synchronized.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's allow to retrieve whether the ipcache watcher is currently
synchronized, based on the status of the underlying watch store.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's allow to retrieve whether the remote identities cache is currently
synchronized or not. It is considered to be synchronized if the initial
list of entries has been retrieved from the kvstore, and new events are
being watched. When the watch operation gets interrupted, the status
transitions back to not synchronized.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Enrich the remote cluster status to additionally include information
concerning the synchronization status of each resource type. In
particular, a resource type is considered to be synchronized if the
initial list of entries has been completely received from the remote
cluster, and new events are currently being watched.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the remote cluster status is reported as ready once the
kvstore connection has been established. Let's extend this to also
wait for the initial synchronization to complete for each watched
resource type.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the remote cluster status includes a ready field, as well as
the information about the connection to the remote kvstore in textual
form. While this is enough for a human to distinguish whether it is
reported as not ready due to the kvstore connection not being ready
or not (e.g., the synchronization process being still in progress),
it is confusing when retrieved externally, for instance by the Cilium
CLI. Hence, let's add an explicit flag to known whether the connection
to the remote kvstore has been established or not.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Update the clustermesh troubleshooting page to reflect the output of the
extended `cilium status --all-clusters` command.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The output of the `cilium clustermesh status --wait` command has
slightly changed to fix an inaccuracy. Let's reflect it also here.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

Last force-push rebased onto main to make tests happy.

@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

Travis hit #26617. Rerunning

@giorio94 giorio94 removed the dont-merge/blocked Another PR must be merged before this one. label Jul 25, 2023
@giorio94
Copy link
Member Author

Removing the blocked label, given that reviews are almost all in, and cilium/cilium-cli#1834 also got an initial feedback.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on behalf of api. Getting this in will help debuggability, so it's a nice-to-have for v1.14.0. Merging.

@joestringer joestringer merged commit 185df80 into cilium:main Jul 25, 2023
65 checks passed
@joestringer joestringer mentioned this pull request Jul 25, 2023
3 tasks
@joestringer joestringer added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 25, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.0 Jul 25, 2023
@giorio94 giorio94 deleted the mio/remote-cluster-status branch July 26, 2023 06:27
@aanm aanm added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.0 Jul 26, 2023
giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jul 31, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jul 31, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Aug 1, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
michi-covalent pushed a commit to cilium/cilium-cli that referenced this pull request Aug 1, 2023
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
No open projects
1.14.0
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

Expose initial synchronization of ClusterMesh data in Cilium Status
7 participants