New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: add a metric for max observed endpoint ifindex #27953
metrics: add a metric for max observed endpoint ifindex #27953
Conversation
Commit 5ec7915 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
76f5baf
to
4cea910
Compare
The metric was tested on Linux 4.19 interactively using an After bringing up the VM: On the local machine
On the VM
|
4cea910
to
70e9070
Compare
Upon the first run of CI for this PR, an import cycle was detected. This commit fixes that import cycle, with details in the commit message: |
/test |
caf9516
to
be5ce5e
Compare
@asauber could we add backport 1.14 label given the ifindex fix went also in there? |
@derailed yes, this can be used to implement an alert to determine if the cluster is nearing the max ifindex of 65535 (if on an affected kernel version) The current suggested mitigation in that case is to roll or reboot the Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
A few comments below. One other thing is that I don't see any commit descriptions for the commits and all the context is instead in the PR. As a project, we are not opposed to a PR description, but we state in our contrib guide that PRs should contain commits with a description. Plus, this is beneficial when bisecting / git blame.
For the sake of ease of development, we test in CI that the Ginkgo suite found in /test can compile on multiple platforms, including darwin. This test suite includes a test of the Hive framework, which initializes an instance of the daemon Cell. This Cell requires the MetricsRegistry, and so these tests depend on the MetricsRegistry. Some metrics may soon depend upon Linux feature probes, and so make the Linux probes package an indirect dependency of the Ginkgo tests. In order to support this build dependency, we allow the probes package to compile on darwin by adding stub values for netlink constants used by the package, and a linux-specific declaration of the constants using the netlink package. Signed-off-by: Andrew Sauber <2046750+asauber@users.noreply.github.com>
To avoid a circular import, move testutils/ipam.go into its own package. The circular import was observed when attempting to use datapath kernel feature probes in the metrics package. It was affecting only the probe package's tests. import cycle: github.com/cilium/cilium/pkg/datapath/linux/probes [github.com/cilium/cilium/pkg/datapath/linux/probes.test] github.com/cilium/cilium/pkg/testutils [github.com/cilium/cilium/pkg/datapath/linux/probes.test] github.com/cilium/cilium/pkg/k8s/apis/cilium.io/v2 [github.com/cilium/cilium/pkg/datapath/linux/probes.test] github.com/cilium/cilium/pkg/policy/api [github.com/cilium/cilium/pkg/datapath/linux/probes.test] github.com/cilium/cilium/pkg/metrics [github.com/cilium/cilium/pkg/datapath/linux/probes.test] Moving this package avoids the import cycle Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
be5ce5e
to
75ad939
Compare
Add a metric for the current maximum observed interface index Adds a metric `endpoint_max_ifindex`, which is the current maximum interface index for all endpoints. The metric is updated during the periodic invocation of `syncHostIPs`. This metric can be used to determine if the interface index for the next Pod will exceed Cilium's limit of max(uint16) on older kernels. On kernels which do not provide ifindex via the FIB, Cilium must store the ifindex in the CT map, and 16 bits are reserved per entry for this purpose. This presents a limit of 65535 lifetime interfaces, which can be approached quickly with sufficient pod churn. Users who are subject to this limitation (typically a kernel version less than 5.10), are advised to reboot or roll the host in this case. Its default-enabled status is dynamic. On kernels which provide ifindex via the FIB, the metric is disabled (since Cilium's ifindex limits are not subject to max(uint16) in this case). Addresses: cilium#17259 Addresses: cilium#16260 Signed-off-by: Andrew Sauber <2046750+asauber@users.noreply.github.com>
75ad939
to
daae407
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great on docs! Thank you for adding the description as to the dynamic default status, awesome stuff.
Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
0168aab
to
d9ee880
Compare
/test |
Adds a metric
endpoint_max_ifindex
, which is the current maximuminterface index for all endpoints. The metric is updated during the
periodic invocation of
syncHostIPs
.This metric can be used to determine if the interface index for the next
Pod will exceed Cilium's limit of max(uint16) on older kernels. On
kernels which do not provide ifindex via the FIB, Cilium must store the
ifindex in the CT map, and 16 bits are reserved per entry for this
purpose. This presents a limit of 65535 lifetime interfaces, which can
be approached quickly with sufficient pod churn.
Users who are subject to this limitation (typically a kernel version
less than 5.10), are advised to reboot or roll the host in this case.
Its default-enabled status is dynamic. On kernels which provide ifindex
via the FIB, the metric is disabled (since Cilium's ifindex limits are
not subject to max(uint16) in this case).
It can be enabled with the following Helm values:
On kernels which do not provide ifindex via the FIB, the metric is enabled by default.
Dynamic default based on Kernel version
On Linux 4.19 without metric explicitly enabled
On Linux 6.4 without metric explicitly enabled
On Linux 6.4 with metric explicitly enabled
Longer-running example
Roll a pod or two
and the
ifindex
increasesAddresses: #17259
Addresses: #16260