Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics associated with a deleted node should not be reported #28382

Merged
merged 1 commit into from Oct 12, 2023

Conversation

derailed
Copy link
Contributor

@derailed derailed commented Oct 3, 2023

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

When a node is deleted from a cluster, metrics associated with that node are still being exported to prometheus.
Short of restarting the agent, we want to dynamically delete these metrics when a node is removed from the cluster.

This PR ensures node_connectivity_status and node_connectivity_latency no longer report metrics for nodes that are no longer present on the cluster.

Fixes: #issue-number

Cilium now properly deletes stale (deleted) nodes from the node_connectivity_status and node_connectivity_latency_seconds metrics, reducing metric cardinality.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 3, 2023
@derailed
Copy link
Contributor Author

derailed commented Oct 3, 2023

/test

@derailed derailed marked this pull request as ready for review October 4, 2023 00:08
@derailed derailed requested review from a team as code owners October 4, 2023 00:08
…er reported.

When a node is deleted from a cluster, metrics associated with that node
are still being exported to prometheus. Short of restarting the agent,
we want to dynamically delete these metrics when a node is removed from the cluster.

This PR ensures node_connectivity_status and node_connectivity_latency
no longer report metrics for nodes that are no longer present on the
cluster.

Signed-off-by: Fernand Galiana <fernand.galiana@isovalent.com>
@derailed
Copy link
Contributor Author

derailed commented Oct 4, 2023

/test

@christarazi christarazi added kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. labels Oct 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 4, 2023
@christarazi
Copy link
Member

christarazi commented Oct 4, 2023

Just a minor nit on the release note

When a node is deleted from a cluster, we must ensure prometheus metrics associated with deleted node are no longer reported. Notably: node_connectivity_status and node_connectivity_latency_seconds.

Typically we frame release notes by describing the impact, so something like

Cilium now properly deletes stale (deleted) nodes from the node_connectivity_status and node_connectivity_latency_seconds metrics, reducing metric cardinality.

@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. area/health Relates to the cilium-health component labels Oct 4, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, there's already a differentiation for metrics that can be deleted, so this PR is just following the pattern. IMO, we can merge this now to fix the ongoing problems with the node connectivity metrics, and then followup with a refactor.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 11, 2023
@squeed
Copy link
Contributor

squeed commented Oct 12, 2023

Checkpatch is complaining that the commit subject line is too long. Overridden.

@squeed squeed merged commit e9f97cd into cilium:main Oct 12, 2023
59 of 61 checks passed
@mikejennings
Copy link

did this make it into the newest release?

@derailed derailed added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch backport-pending/1.12 backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Oct 31, 2023
@derailed derailed added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Nov 3, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Nov 8, 2023
@christarazi christarazi added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Nov 8, 2023
@christarazi christarazi added the backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. label Nov 8, 2023
@github-actions github-actions bot added the backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. label Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/health Relates to the cilium-health component area/metrics Impacts statistics / metrics gathering, eg via Prometheus. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants