fe4dda7 ("services: prevent temporary connectivity loss on agent
restart") modified the handling of restored backends to prevent
possibly causing temporary connectivity disruption on agent restart if
a service is either associated with multiple endpointslices (e.g.,
it has more than 100 backends, or is dual stack) or has backends
spanning across multiple clusters (i.e., it is a global service).
At a high level, we now keep a list of restored backends, which continue
being merged with the ones we received an update for, until the bootstrap
phase completes. At that point, we trigger an update for each service
still associated with stale backends, so that they can be removed.
One drawback associated with this approach, though, is that when
clustermesh is enabled we currently wait for full synchronization from
all remote clusters before triggering the removal of stale backends,
regardless of whether the given service is global (i.e., possibly
includes also remote backends) or not. One specific example in which
such behavior is problematic relates to the clustermesh-apiserver
Indeed, if it gets restarted at the same time of the agents
(e.g., during an upgrade), the associated service might end up
including both the address of the previous pod (which is now stale)
and that of the new one, which is correct. When kvstoremesh is
enabled, local agents connect to it through that service. In this
case, there's a circular dependency: the agent may pick the stale
backend, and the connection to etcd fails, which in turn prevents the
synchronization from being started, and eventually complete to trigger
the removal of the stale backend. Although this dependency eventually
resolves as a different backend is picked by the service load-balancing
algorithm, unnecessary delay is introduced (the same could also happen
for remote agents connecting through a NodePort if KPR is enabled).
To remove this dependency, let's perform a two-pass cleanup of stale
backends: the first one as soon as we synchronize with Kubernetes,
targeting non-global services only; the second triggered by full
clustermesh synchronization, covering all remaining ones. Hence,
non-global services can be fully functional also before the
completion of the full clustermesh synchronization. It is worth
mentioning that this fix applies to all non-global services,
which can now converge faster also in case of large clustermeshes.
Co-authored-by: Jussi Maki <jussi@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>