Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam: add metrics to track per node capacity #24776

Merged

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Apr 6, 2023

This PR adds metrics to help track IPAM Node capacity with more granularity.
Specifically, the goal is to help add alerting/dashboards for such questions such as:

  • How many IPs are left on my node/cluster?
  • How many/what Nodes have capacity?
  • What is the capacity of any particular node?

This adds three new Operator metrics:

cilium_operator_ipam_available_ips
cilium_operator_ipam_used_ips
cilium_operator_ipam_needed_ips

As well, this seeks to deprecate the : cilium_operator_ipam_ips metric, aside from it being redundant it also is very confusing - the type=available does not track what you expect it to track.

Which are all labelled by "target_node" (i.e. the name of the CiliumNode).

How does this differ from the existing IPAM metrics?

There's a bit of a blind spot in the current set of metrics:

  • cilium_operator_ipam_ips: this metric is confusing, as well it should probably be separate metrics rather than a single one with classifying labels (CC: @chancez).
  • empty_interface_slots/interface_candidates: These two metrics can answer whether there is capacity left in the cluster but they aren't able to give any more detail about how much capacity is left, what nodes are full etc?
    • As well, they may be misleading. For example: if you have 3 nodes each with 1 slot + 0 interface_candidates then its possible you might have: 3 possible ENIs X 10 Ips = 30 ... but if the nodes with the capacity have a different IP per interface limit then it could also be 10 x 50 = 500.
operator/ipam/metrics: Add new, more accurate, per-node available/used/needed metrics to deprecated existing ipam_ips metric.

@maintainer-s-little-helper
Copy link

Commit b98e92d31f6aca94e40acbdf2d089a3d5e1759bc does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 6, 2023
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/better-eni-capacity-metrics branch from b98e92d to 0f6ff5f Compare April 6, 2023 04:56
@maintainer-s-little-helper
Copy link

Commit b98e92d31f6aca94e40acbdf2d089a3d5e1759bc does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/better-eni-capacity-metrics branch from 0f6ff5f to 56ab89f Compare April 20, 2023 01:04
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 20, 2023
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/better-eni-capacity-metrics branch 2 times, most recently from 64d7c15 to 60c3c86 Compare April 20, 2023 04:36
@tommyp1ckles tommyp1ckles changed the title capacitipam: add new metrics and accounting for total node addr capac… ipam: add metrics to track per node capacity Apr 20, 2023
@tommyp1ckles tommyp1ckles added the kind/feature This introduces new functionality. label Apr 20, 2023
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/better-eni-capacity-metrics branch from add59a5 to 7339b89 Compare April 20, 2023 05:24
@tommyp1ckles tommyp1ckles marked this pull request as ready for review April 20, 2023 05:25
@tommyp1ckles tommyp1ckles requested review from a team as code owners April 20, 2023 05:25
@tommyp1ckles tommyp1ckles added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/ipam IP address management, including cloud IPAM labels Apr 20, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 20, 2023
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! I focused on the IPAM aspect, that change looks good to me. A few minor things that stood out to me.

pkg/alibabacloud/eni/node.go Show resolved Hide resolved
pkg/ipam/node_manager.go Outdated Show resolved Hide resolved
pkg/ipam/metrics/metrics.go Show resolved Hide resolved
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, clean code and easy to read commits!

I don't have any other comment besides a nit to follow and what Sebastian already pointed out. When reading some of the commits, I was initially confused by what "CNI pods" meant and then realized it meant "managed (by Cilium) pods". The latter is typically how we convey the relationship as you probably already know, but thought I'd call it out.

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/better-eni-capacity-metrics branch from 7339b89 to e7ed5e5 Compare April 27, 2023 05:07
@tommyp1ckles tommyp1ckles requested a review from a team as a code owner April 27, 2023 05:21
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/better-eni-capacity-metrics branch from 0ad83a1 to 97ac4c1 Compare April 27, 2023 18:47
@tommyp1ckles
Copy link
Contributor Author

/test

Updated metrics docs with new ipam metrics.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Mention the addition of ipam_{available,used,needed} metrics.
As well, notify in upgrade guide about intended deprecation of ipam_ips metric.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Previously IsPrefixDelegationEnabled implementation on *ipam.Node would check if the receiver node pointer was nil. If so it defaulted to false.

This is a bit dangerous as function on nil concrete types will still be invoked, whereas interface types will panic (i.e. because a nil interface doesn't actually have a function to lookup).

This removes that code, and defers nil checking to the caller.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/better-eni-capacity-metrics branch from 97ac4c1 to 0013798 Compare April 27, 2023 18:57
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

@zacharysarah no problem, thanks for the review!

@tommyp1ckles
Copy link
Contributor Author

@gandro @christarazi Made some changes to fix failing tests and lint issues.

Copy link
Contributor

@zacharysarah zacharysarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tommyp1ckles Thanks for the updates! ✨ LGTM

@tommyp1ckles tommyp1ckles requested review from joestringer and nebril and removed request for joestringer and nebril April 28, 2023 04:14
@tommyp1ckles tommyp1ckles added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 28, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 28, 2023
@tommyp1ckles tommyp1ckles added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 28, 2023
@joestringer joestringer merged commit cb8dbfb into cilium:main May 1, 2023
56 checks passed
tommyp1ckles added a commit to tommyp1ckles/cilium that referenced this pull request Aug 25, 2023
Affects:

* operator_ipam_available_ips
* operator_ipam_used_ips
* operator_ipam_needed_ips

Which have the label "target_name", previously when a Node was deleted
the metric continued to be emitted by the Prometheus exporter, leading
to confusing sum() values across a cluster.

Fixes changes in cilium#24776

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
aditighag pushed a commit that referenced this pull request Aug 29, 2023
Affects:

* operator_ipam_available_ips
* operator_ipam_used_ips
* operator_ipam_needed_ips

Which have the label "target_name", previously when a Node was deleted
the metric continued to be emitted by the Prometheus exporter, leading
to confusing sum() values across a cluster.

Fixes changes in #24776

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
jibi pushed a commit that referenced this pull request Sep 5, 2023
[ upstream commit 5b7b3bb ]

Affects:

* operator_ipam_available_ips
* operator_ipam_used_ips
* operator_ipam_needed_ips

Which have the label "target_name", previously when a Node was deleted
the metric continued to be emitted by the Prometheus exporter, leading
to confusing sum() values across a cluster.

Fixes changes in #24776

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
youngnick pushed a commit that referenced this pull request Sep 7, 2023
[ upstream commit 5b7b3bb ]

Affects:

* operator_ipam_available_ips
* operator_ipam_used_ips
* operator_ipam_needed_ips

Which have the label "target_name", previously when a Node was deleted
the metric continued to be emitted by the Prometheus exporter, leading
to confusing sum() values across a cluster.

Fixes changes in #24776

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
ldelossa pushed a commit to ldelossa/cilium that referenced this pull request Sep 27, 2023
[ upstream commit 5b7b3bb ]

Affects:

* operator_ipam_available_ips
* operator_ipam_used_ips
* operator_ipam_needed_ips

Which have the label "target_name", previously when a Node was deleted
the metric continued to be emitted by the Prometheus exporter, leading
to confusing sum() values across a cluster.

Fixes changes in cilium#24776

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature This introduces new functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants