Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services: Fix metric from not publishing #26719

Merged

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Jul 7, 2023

  • metrics, service: Rename metric for clarity
  • services: Fix metric from not publishing
Fix missing metric "cilium_services_events_total"

Fixes: #26511

@christarazi christarazi requested review from a team as code owners July 7, 2023 20:54
@christarazi christarazi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jul 7, 2023
@christarazi christarazi added area/metrics Impacts statistics / metrics gathering, eg via Prometheus. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch area/loadbalancing Impacts load-balancing and Kubernetes service implementations needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 7, 2023
@christarazi christarazi requested a review from chancez July 7, 2023 20:54
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jul 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.0 Jul 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.5 Jul 7, 2023
@christarazi christarazi force-pushed the pr/christarazi/fix-svc-metric branch from dd8255b to 38e71ab Compare July 7, 2023 20:56
@christarazi christarazi changed the title pr/christarazi/fix svc metric services: Fix metric from not publishing Jul 7, 2023
@christarazi christarazi added the sig/agent Cilium agent related. label Jul 7, 2023
@chancez
Copy link
Contributor

chancez commented Jul 10, 2023

Odd. I'm reading the docs on WithLabelValues to figure out the cause of the behavior described in your commit message. It refers to https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#CounterVec.GetMetricWithLabelValues which says:

Keeping the Counter for later use is possible (and should be considered if performance is critical), but keep in mind that Reset, DeleteLabelValues and Delete can be used to delete the Counter from the CounterVec. In that case, the Counter will still exist, but it will not be exported anymore, even if a Counter with the same label values is created later.

Do we do anything like call Reset or Delete*?

@@ -33,12 +33,6 @@ import (

const anyPort = "*"

var (
updateMetric = metrics.ServicesCount.WithLabelValues("update")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is called before NewLegacyMetrics initializes ServicesCount, and calls NoOpCounterVec.WithLabelValues (returns NoOpGauge). That's the reason services_events_total is missing. Is my understanding correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right. I opened a small thread about this in #development and that's what Andre suspected as well. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. How can we prevent this from reoccurring?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure to be honest. We have https://github.com/cilium/customvet where we can write our own linting checkers. However, this is the only case that I've found in the code base. Moving forward there is an effort to move metrics into their modules (see #25651) which should help, but techincally not a guarantee.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seams quite reasonable to me. I guess I didn't question the structure, but now that I think about it a bit, they're likely defined this way (NoOp) because of the problem of having global metrics defined in one place rather than per module. Due to them being exposed as global vars, they can be accessed at any time without any guarantee that they've been initialized, so NoOp wrapper was created to prevent panics / crashes in case that happened. That's my guess.

cc @aanm (for adding NoOp) and @dylandreimerink (for whether we still need the idea of "NoOp" metrics going forward)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is to get rid of these NoOp metrics potentially. The approach we're using is a bit odd from my perspective but I haven't really dug into why we do it this way

The NoOp metrics exist as a stopgap, currently a lot of the code base still uses global variables to get metrics which is subject to ordering.

they're likely defined this way (NoOp) because of the problem of having global metrics defined in one place rather than per module. Due to them being exposed as global vars, they can be accessed at any time without any guarantee that they've been initialized, so NoOp wrapper was created to prevent panics / crashes in case that happened.

Yes, that is indeed correct.

Ideally, all metrics are provided to components via parameters during construction, but this conversion takes some time. We can remove NoOps once all global metric variables are gone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an intermediate step I mentioned that wasn't addressed: You can have globals, without NoOp metrics, but not call registry.Register(yourMetric). This avoids ordering issues and nil globals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I've updated the second commit's description to describe the global variable vs initialization ordering as the root cause of the bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an intermediate step I mentioned that wasn't addressed: You can have globals, without NoOp metrics, but not call registry.Register(yourMetric). This avoids ordering issues and nil globals.

Yes, you could init the global var and then upon creating the registry determine if you want to add that metric to the registry or not.

I am not sure why NoOps were chosen over that approach, it is not something invented for the modular metrics, it was already the way things worked. The main thing that changed was init ordering.

I would argue that removing all global variables in favor of dependency injection ASAP is a better use of cycles then refactoring the NoOps away first.

Copy link
Member

@ysksuzuki ysksuzuki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me

The metric name is called "cilium_services_events_total" yet the
variable name is called ServicesCount. The typical code pattern is to
name the variable after a substring of the metric name for ease of grep.

This commit does so and is a non-functional change.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-svc-metric branch 2 times, most recently from 8b79650 to 4f66806 Compare July 20, 2023 22:26
When the metric variable is defined as a global variable (within the
`var` scope at the package level), then it will be instantiated as a
NoOp metric. Once the metrics package is initialized, then all the
metrics variables will transition from NoOp metrics to a real metric
type. This problem occurred because the global variables instantiation
happened before the metrics package initialization.

This commit fixes it by using the metrics variable after the metrics
package has been initialized. We can assume it's been initialized when
the code executed is production ("live") code.

Fixes: cilium#26511
Fixes: 978b27c ("Metrics: Add services metrics")
Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-svc-metric branch from 4f66806 to 722fcea Compare July 20, 2023 22:29
@christarazi
Copy link
Member Author

@chancez Is there anything that you think is blocking merging this PR?

@christarazi
Copy link
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 21, 2023
@dylandreimerink dylandreimerink merged commit b86dab8 into cilium:main Jul 21, 2023
65 checks passed
@christarazi christarazi deleted the pr/christarazi/fix-svc-metric branch July 21, 2023 20:50
@nbusseneau nbusseneau mentioned this pull request Jul 24, 2023
8 tasks
@nbusseneau nbusseneau added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Jul 24, 2023
@nbusseneau nbusseneau mentioned this pull request Jul 24, 2023
21 tasks
@nbusseneau nbusseneau added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 24, 2023
@aanm aanm moved this from Needs backport from main to Backport done to v1.14 in 1.14.0 Jul 26, 2023
@gentoo-root gentoo-root added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Jul 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport done to v1.13 in 1.13.5 Jul 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport done to v1.13 in 1.13.5 Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations area/metrics Impacts statistics / metrics gathering, eg via Prometheus. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related.
Projects
No open projects
1.13.5
Backport done to v1.13
1.14.0
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

Missing services_events_total metric
6 participants