Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: make consecutive quorum errors threshold configurable #16885

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions Documentation/cmdref/cilium-agent.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions Documentation/operations/upgrade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,13 @@ Deprecated Options
* ``native-routing-cidr``: This option has been deprecated in favor of
``ipv4-native-routing-cidr`` and will be removed in 1.12.

New Options
~~~~~~~~~~~

* ``kvstore-max-consecutive-quorum-errors``: This option configures the max
acceptable kvstore consecutive quorum errors before the agent assumes
permanent failure.

.. _1.10_upgrade_notes:

1.10 Upgrade Notes
Expand Down
3 changes: 3 additions & 0 deletions daemon/cmd/daemon_main.go
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,9 @@ func init() {
flags.MarkHidden(option.KVstoreLeaseTTL)
option.BindEnv(option.KVstoreLeaseTTL)

flags.Int(option.KVstoreMaxConsecutiveQuorumErrorsName, defaults.KVstoreMaxConsecutiveQuorumErrors, "Max acceptable kvstore consecutive quorum errors before the agent assumes permanent failure")
option.BindEnv(option.KVstoreMaxConsecutiveQuorumErrorsName)

flags.Duration(option.KVstorePeriodicSync, defaults.KVstorePeriodicSync, "Periodic KVstore synchronization interval")
option.BindEnv(option.KVstorePeriodicSync)

Expand Down
4 changes: 4 additions & 0 deletions pkg/defaults/defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,10 @@ const (
// KVstoreLeaseTTL is the time-to-live of the kvstore lease.
KVstoreLeaseTTL = 15 * time.Minute

// KVstoreMaxConsecutiveQuorumErrors is the maximum number of acceptable
// kvstore consecutive quorum errors before the agent assumes permanent failure
KVstoreMaxConsecutiveQuorumErrors = 2

// KVstoreKeepAliveIntervalFactor is the factor to calculate the interval
// from KVstoreLeaseTTL in which KVstore lease is being renewed.
KVstoreKeepAliveIntervalFactor = 3
Expand Down
6 changes: 1 addition & 5 deletions pkg/kvstore/etcd.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,6 @@ const (
EtcdRateLimitOption = "etcd.qps"

minRequiredVersionStr = ">=3.1.0"

// consecutiveQuorumErrorsThreshold is the number of acceptable quorum
// errors before the agent assumes permanent failure
consecutiveQuorumErrorsThreshold = 2
)

var (
Expand Down Expand Up @@ -1156,7 +1152,7 @@ func (e *etcdClient) statusChecker() {
e.statusLock.Lock()

switch {
case consecutiveQuorumErrors > consecutiveQuorumErrorsThreshold:
case consecutiveQuorumErrors > option.Config.KVstoreMaxConsecutiveQuorumErrors:
e.latestErrorStatus = fmt.Errorf("quorum check failed %d times in a row: %s",
consecutiveQuorumErrors, quorumError)
e.latestStatusSnapshot = e.latestErrorStatus.Error()
Expand Down
9 changes: 9 additions & 0 deletions pkg/option/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -656,6 +656,10 @@ const (
// KVstoreLeaseTTL is the time-to-live for lease in kvstore.
KVstoreLeaseTTL = "kvstore-lease-ttl"

// KVstoreMaxConsecutiveQuorumErrorsName is the maximum number of acceptable
// kvstore consecutive quorum errors before the agent assumes permanent failure
KVstoreMaxConsecutiveQuorumErrorsName = "kvstore-max-consecutive-quorum-errors"

// KVstorePeriodicSync is the time interval in which periodic
// synchronization with the kvstore occurs
KVstorePeriodicSync = "kvstore-periodic-sync"
Expand Down Expand Up @@ -1539,6 +1543,10 @@ type DaemonConfig struct {
// KVstoreLeaseTTL is the time-to-live for kvstore lease.
KVstoreLeaseTTL time.Duration

// KVstoreMaxConsecutiveQuorumErrors is the maximum number of acceptable
// kvstore consecutive quorum errors before the agent assumes permanent failure
KVstoreMaxConsecutiveQuorumErrors int

// KVstorePeriodicSync is the time interval in which periodic
// synchronization with the kvstore occurs
KVstorePeriodicSync time.Duration
Expand Down Expand Up @@ -2440,6 +2448,7 @@ func (c *DaemonConfig) Populate() {
c.KVstoreKeepAliveInterval = c.KVstoreLeaseTTL / defaults.KVstoreKeepAliveIntervalFactor
c.KVstorePeriodicSync = viper.GetDuration(KVstorePeriodicSync)
c.KVstoreConnectivityTimeout = viper.GetDuration(KVstoreConnectivityTimeout)
c.KVstoreMaxConsecutiveQuorumErrors = viper.GetInt(KVstoreMaxConsecutiveQuorumErrorsName)
c.IPAllocationTimeout = viper.GetDuration(IPAllocationTimeout)
c.LabelPrefixFile = viper.GetString(LabelPrefixFile)
c.Labels = viper.GetStringSlice(Labels)
Expand Down