New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend clustermesh status reporting with remote configuration and synchronization information #26788
Conversation
b369f33
to
a5e79b3
Compare
/test |
a5e79b3
to
9e6c933
Compare
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
9e6c933
to
1418938
Compare
Marking as blocked to explicitly wait for the parallel reviews on cilium/cilium-cli#1834, so that modifications can be synchronized. |
/test |
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
/ci-aks Hit known flake #22162 |
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Enrich the remote cluster status to additionally include information about the cluster configuration advertised by the remote cluster. In particular, whether it is required to be present and has been retrieved, along with the remote cluster ID and its capabilities. By default, this information is shown only until the given cluster is not ready, or if the `--all-clusters` flag is set, to simplify troubleshooting possible issues and distinguish them from connectivity problems. The information concerning the cluster configuration to be returned as part of the status is stored remote cluster during the retrieval process, initially setting whether it is required to be present, and then filling the rest of the fields when it is actually retrieved. No cluster config information is returned if the retrieval process has not started yet. Additionally, this commit slightly modifies the condition used to determine whether a given cluster is reported as ready, requiring that the connection to the remote etcd has been established, and the cluster configuration has either been retrieved, or determined that it is not required to be present. Hence, no longer relying on the connection status only, which was previously configured only after retrieving the cluster configuration, possibly leading to a more inaccurate status reporting. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the synced variable is used in the RestartableWatchStore to ensure that callbacks are executed only once upon the initial synchronization. Given that a subsequent commit will start exposing synchronization information (with a different meaning), let's get rid of it and rework the callbacks execution to not depend on it. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's allow to retrieve whether the RestartableWatchStore is currently synchronized or not. It is considered to be synchronized if the initial list of entries has been retrieved from the kvstore, and new events are being watched. When the watch operation gets interrupted, the status transitions back to not synchronized. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's allow to retrieve whether the ipcache watcher is currently synchronized, based on the status of the underlying watch store. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's allow to retrieve whether the remote identities cache is currently synchronized or not. It is considered to be synchronized if the initial list of entries has been retrieved from the kvstore, and new events are being watched. When the watch operation gets interrupted, the status transitions back to not synchronized. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Enrich the remote cluster status to additionally include information concerning the synchronization status of each resource type. In particular, a resource type is considered to be synchronized if the initial list of entries has been completely received from the remote cluster, and new events are currently being watched. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the remote cluster status is reported as ready once the kvstore connection has been established. Let's extend this to also wait for the initial synchronization to complete for each watched resource type. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the remote cluster status includes a ready field, as well as the information about the connection to the remote kvstore in textual form. While this is enough for a human to distinguish whether it is reported as not ready due to the kvstore connection not being ready or not (e.g., the synchronization process being still in progress), it is confusing when retrieved externally, for instance by the Cilium CLI. Hence, let's add an explicit flag to known whether the connection to the remote kvstore has been established or not. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Update the clustermesh troubleshooting page to reflect the output of the extended `cilium status --all-clusters` command. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The output of the `cilium clustermesh status --wait` command has slightly changed to fix an inaccuracy. Let's reflect it also here. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
cc38dc3
to
d487b80
Compare
Last force-push rebased onto main to make tests happy. |
/test |
Travis hit #26617. Rerunning |
Removing the blocked label, given that reviews are almost all in, and cilium/cilium-cli#1834 also got an initial feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on behalf of api. Getting this in will help debuggability, so it's a nice-to-have for v1.14.0. Merging.
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This PR enriches the remote cluster status to additionally report information concerning the configuration advertised by the remote cluster and the synchronization status. In particular, a cluster is not reported as ready only after that the initial list of entries for each resource type has been retrieved. This is intended to provide additional information while troubleshooting possible issues (as also captured by sysdumps), as well as to allow the
cilium clustermesh status
command to better reflect the actual status.Example (with the
--all-clusters
flag set, otherwise only not-ready clusters are expanded):Please refer to the individual commit messages for additional information.
Tentatively marking for backport to v1.14, given that it improves the investigation of possible issues around clustermesh.
Fixes: #26532
Related: cilium/cilium-cli#1834