Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: make consecutive quorum errors threshold configurable #16885

Merged

Conversation

ArthurChiao
Copy link
Contributor

On detecting heartbeat (written by cilium-operator) missing from kvstore
with consecutive probes, the clustermesh module in cilium-agent will
re-connect kvstore. For large clusters/meshes, e.g. clusters with
thousands of nodes, the concurrent reconnecting and list+watching
behaviors pose significant pressue on kvstore, to the extent of crashing
it.

The threshold is currently hardcoded as 2, and this patch makes it
configurable, which gives users a chance to choose from fast failure, or
being more patient on encountering kvstore/operator/k8s-control-plane
problems.

Signed-off-by: ArthurChiao arthurchiao@hotmail.com

@ArthurChiao ArthurChiao requested a review from a team July 14, 2021 15:02
@ArthurChiao ArthurChiao requested review from a team as code owners July 14, 2021 15:02
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 14, 2021
@ArthurChiao ArthurChiao force-pushed the make_quorum_threshold_configurable branch from 250a569 to dd475d9 Compare July 15, 2021 02:31
daemon/cmd/daemon_main.go Outdated Show resolved Hide resolved
@ArthurChiao ArthurChiao force-pushed the make_quorum_threshold_configurable branch 3 times, most recently from 265f9d6 to d324ad3 Compare July 19, 2021 14:25
@ArthurChiao ArthurChiao requested review from a team as code owners July 19, 2021 14:25
@ArthurChiao ArthurChiao force-pushed the make_quorum_threshold_configurable branch from d324ad3 to 24a77b7 Compare July 19, 2021 14:29
On detecting heartbeat (written by cilium-operator) missing from kvstore
with consecutive probes, the clustermesh module in cilium-agent will
re-connect kvstore.  For large clusters/meshes, e.g. clusters with
thousands of nodes, the concurrent reconnecting and list+watching
behaviors pose significant pressue on kvstore, to the extent of crashing
it.

The threshold is currently hardcoded as 2, and this patch makes it
configurable, which gives users a chance to choose from fast failure, or
being more patient on encountering kvstore/operator/k8s-control-plane
problems.

Signed-off-by: ArthurChiao <arthurchiao@hotmail.com>
@ArthurChiao ArthurChiao force-pushed the make_quorum_threshold_configurable branch from 24a77b7 to 8979e92 Compare July 19, 2021 14:40
@joestringer joestringer added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jul 19, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 19, 2021
@borkmann borkmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 2, 2021
@vadorovsky vadorovsky merged commit 7cc210e into cilium:master Aug 2, 2021
@ArthurChiao ArthurChiao deleted the make_quorum_threshold_configurable branch August 4, 2021 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants