Remove relevant metrics series on pod deletion (#23162). #23385

marqc · 2023-01-26T13:30:02Z

Signed-off-by: Marek Chodor mchodor@google.com

Please ensure your pull request adheres to the following guidelines:

For first time contributors, read Submitting a pull request
All code is covered by unit and/or runtime tests where feasible.
All commits contain a well written commit description including a title,
description and a Fixes: #XXX line if the commit addresses a particular
GitHub issue.
If your commit description contains a Fixes: <commit-id> tag, then
please add the commit author[s] as reviewer[s] to this issue.
All commits are signed off. See the section Developer’s Certificate of Origin
Provide a title or release-note blurb suitable for the release notes.
Thanks for contributing!

Remove pod attributed hubble metrics after pod gets deleted.

Fixes: #23162

christarazi

Thanks for the PR. A few comments below.

pkg/hubble/metrics/api/api.go

pkg/hubble/metrics/api/context.go

pkg/hubble/metrics/metrics.go

pkg/hubble/metrics/api/api.go

pkg/hubble/metrics/api/context.go

chancez · 2023-02-14T18:13:08Z

pkg/hubble/metrics/metrics.go

@@ -64,6 +86,18 @@ func initMetrics(address string, enabled api.Map, grpcMetrics *grpc_prometheus.S
 		errChan <- srv.ListenAndServe()
 	}()

+	go func() {


There is nothing ensuring this go routine gets cleaned up, we should be waiting on it somewhere.

I changed the implementation to workqueue and right now goroutine will exit after calling queue.Shutdown() but I'm not sure where to put it. A similar goroutine for webserver with metrics seems to not have anything that could stop it. Can you point me where to implement this "shutdown hook"?

Good point, I didn't notice that. I guess we don't do properly handle that either. It would require refactoring most of the code in this package to work on a struct with methods. We can probably we can fix this later, I think it's a bit unrelated to this PR.

pkg/hubble/metrics/metrics.go

chancez · 2023-02-14T18:27:33Z

pkg/hubble/metrics/metrics.go

@@ -40,7 +48,21 @@ func ProcessFlow(ctx context.Context, flow *pb.Flow) error {
 	return nil
 }

-// initMetrics initialies the metrics system
+func ProcessPodDeletion(pod *slim_corev1.Pod) error {


Similar to @christarazi 's comment, I wonder if this could be a method on some object instead, and then the queue would be a field on a struct rather than a global.

chancez

This looks great, I think I'd like to have another hubble maintainer review it also though.

marqc · 2023-02-17T10:19:50Z

Please, do not remove review assignments. (I'm not re-adding Aditi because you already have reviews from members of sig-policy.)

Sorry. It looks like github automatically removes assignments when I click on re-request review.

This is not the case. The change in your PR is not trivial, so I would really appreciate some more detailed context and motivation in the commit description. It should be possible to understand the issue and how it's addressed just by looking at the Git logs, without having to look up for the GitHub issue. Nit: The issue fixed by the commit should be in a Fixes: tag at the end of the description, rather than in the commit object.

I updated the commit message.

qmonnet

Sorry. It looks like github automatically removes assignments when I click on re-request review.

OK, I didn't know that

I updated the commit message.

Great description, thanks a lot!

qmonnet · 2023-02-17T14:30:30Z

/test

qmonnet · 2023-02-17T17:03:25Z

Runtime test failure looks unrelated, I filed an issue for it.
/test-runtime

qmonnet · 2023-02-17T17:12:31Z

stderr: fatal: unable to access 'https://github.com/cilium/cilium/': Failed to connect to github.com port 443: Connection refused (link) - We're not in luck.

/test-runtime

qmonnet · 2023-02-20T11:57:11Z

Reviews are in and the tests passed eventually! Deferring to TopHat for the rest of the operations. Thanks!

pchaigno · 2023-02-20T17:23:37Z

Looks like Travis CI failed to run. Do we know why? If not, we may need to rebase and rerun CI 😞

qmonnet · 2023-02-20T17:45:57Z

Looks like Travis CI failed to run. Do we know why?

I don't :/. Scrolling back until 26th of January on Travis (couldn't find a way to filter based on PRs), I can't find any build for this PR. Not sure what happened.

pchaigno · 2023-02-20T18:22:30Z

I tried to close and reopen to trigger the Travis CI job but that didn't seem to work.
@marqc Could you rebase your branch to see if it helps? Sorry for the trouble 😕

Using pod source/destinationContext may result in big metrics cardinality. Additionally this one component is exposing metrics for all pods running on the same k8s node. Over time with each created and removed pod number of metric data series will grow which will eventually result in slower and bigger responses on metrics endpoint and may even lead to OOM problems in cilium-agent itself or prometheus metrics collector. This change removes data series bound do certain pods when they are deleted. This is safe, because after pod deletion its metric value will never change. There is a 1 minute delay introduced between pod deletion and metric data series removal, so scrapers will have time to collect last value of given metric. Fixes: cilium#23162 Signed-off-by: Marek Chodor <mchodor@google.com>

marqc · 2023-02-20T21:49:51Z

I tried to close and reopen to trigger the Travis CI job but that didn't seem to work. @marqc Could you rebase your branch to see if it helps? Sorry for the trouble 😕

@pchaigno done

pchaigno · 2023-02-20T23:53:57Z

/test

qmonnet · 2023-02-21T11:18:18Z

k8s-1.16-kernel-4.19 hit #23845, re-triggering:

/test-1.16-4.19

I still can't see any link to the details for the Travis build though 🤔

pchaigno · 2023-02-21T11:22:17Z

At this point, I suspect it's something weird with this branch, @marqc's fork, or @marqc's account itself 🤷 I've opened #23900 with these changes, just to run the Travis CI job.

EDIT: Hm, just when I did, the job started 🤔

qmonnet · 2023-02-21T11:22:35Z

Here it is at last https://github.com/cilium/cilium/pull/23385/checks?check_run_id=11488636578

Wait, that's a good one. The link to the Travis job on the current PR redirects in fact to the build from Paul's draft PR. Go figure 🙃

qmonnet · 2023-02-21T14:05:11Z

Same error
/test-1.16-4.19

pchaigno · 2023-02-21T21:36:57Z

4.19 is very flaky. We're working on the flake affecting those CI jobs, but in the meantime I'd still like to get a green run because that's a lot of coverage to ignore 😬
/test-1.16-4.19

qmonnet · 2023-02-22T10:08:11Z

Once more
/test-1.16-4.19

marqc requested review from a team as code owners January 26, 2023 13:30

marqc requested review from sayboras and chancez January 26, 2023 13:30

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 26, 2023

marqc requested a review from gandro January 26, 2023 13:30

marqc marked this pull request as draft January 26, 2023 13:30

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 26, 2023

marqc force-pushed the stale_metric_series_removal branch from 6697e71 to 521d22d Compare February 2, 2023 15:42

marqc force-pushed the stale_metric_series_removal branch from 521d22d to 6174ef8 Compare February 14, 2023 15:08

marqc marked this pull request as ready for review February 14, 2023 15:09

marqc force-pushed the stale_metric_series_removal branch from 6174ef8 to 61dc3e9 Compare February 14, 2023 15:40

christarazi requested changes Feb 14, 2023

View reviewed changes

pkg/hubble/metrics/api/api.go Outdated Show resolved Hide resolved

pkg/hubble/metrics/api/context.go Outdated Show resolved Hide resolved

pkg/hubble/metrics/metrics.go Outdated Show resolved Hide resolved

chancez reviewed Feb 14, 2023

View reviewed changes

pkg/hubble/metrics/api/api.go Outdated Show resolved Hide resolved

chancez reviewed Feb 14, 2023

View reviewed changes

pkg/hubble/metrics/api/context.go Show resolved Hide resolved

chancez reviewed Feb 14, 2023

View reviewed changes

pkg/hubble/metrics/metrics.go Outdated Show resolved Hide resolved

chancez reviewed Feb 14, 2023

View reviewed changes

marqc force-pushed the stale_metric_series_removal branch 5 times, most recently from 629e271 to 006ebf9 Compare February 15, 2023 14:47

chancez approved these changes Feb 15, 2023

View reviewed changes

christarazi added kind/enhancement This would improve or streamline existing functionality. area/daemon Impacts operation of the Cilium daemon. release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay labels Feb 15, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 15, 2023

marqc requested a review from qmonnet February 17, 2023 10:17

qmonnet approved these changes Feb 17, 2023

View reviewed changes

qmonnet mentioned this pull request Feb 17, 2023

CI: Runtime: TestGetIdentity/Multiple_identities: Identity not found in the store #23856

Closed

qmonnet added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 20, 2023

pchaigno removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 20, 2023

pchaigno closed this Feb 20, 2023

pchaigno reopened this Feb 20, 2023

marqc force-pushed the stale_metric_series_removal branch from 6faa010 to 5ee8e22 Compare February 20, 2023 20:02

pchaigno mentioned this pull request Feb 21, 2023

Test "Remove relevant metrics series on pod deletion" #23900

Closed

pchaigno merged commit 782f934 into cilium:master Feb 22, 2023

chancez mentioned this pull request Mar 13, 2023

Fix panic in hubble http v2 metrics #24350

Merged

marqc mentioned this pull request Nov 24, 2023

Add marqc as organization members cilium/community#74

Merged

chancez mentioned this pull request Mar 27, 2024

Hubble: add no-metrics label suport for namespace or pod #30789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove relevant metrics series on pod deletion (#23162). #23385

Remove relevant metrics series on pod deletion (#23162). #23385

marqc commented Jan 26, 2023 •

edited

christarazi left a comment

chancez Feb 14, 2023

marqc Feb 15, 2023

chancez Feb 15, 2023

chancez Feb 14, 2023

chancez left a comment

marqc commented Feb 17, 2023

qmonnet left a comment

qmonnet commented Feb 17, 2023

qmonnet commented Feb 17, 2023 •

edited

qmonnet commented Feb 17, 2023 •

edited

qmonnet commented Feb 20, 2023

pchaigno commented Feb 20, 2023

qmonnet commented Feb 20, 2023

pchaigno commented Feb 20, 2023

marqc commented Feb 20, 2023

pchaigno commented Feb 20, 2023

qmonnet commented Feb 21, 2023 •

edited

pchaigno commented Feb 21, 2023 •

edited

qmonnet commented Feb 21, 2023 •

edited

qmonnet commented Feb 21, 2023 •

edited

pchaigno commented Feb 21, 2023 •

edited

qmonnet commented Feb 22, 2023 •

edited

Remove relevant metrics series on pod deletion (#23162). #23385

Remove relevant metrics series on pod deletion (#23162). #23385

Conversation

marqc commented Jan 26, 2023 • edited

christarazi left a comment

Choose a reason for hiding this comment

chancez Feb 14, 2023

Choose a reason for hiding this comment

marqc Feb 15, 2023

Choose a reason for hiding this comment

chancez Feb 15, 2023

Choose a reason for hiding this comment

chancez Feb 14, 2023

Choose a reason for hiding this comment

chancez left a comment

Choose a reason for hiding this comment

marqc commented Feb 17, 2023

qmonnet left a comment

Choose a reason for hiding this comment

qmonnet commented Feb 17, 2023

qmonnet commented Feb 17, 2023 • edited

qmonnet commented Feb 17, 2023 • edited

qmonnet commented Feb 20, 2023

pchaigno commented Feb 20, 2023

qmonnet commented Feb 20, 2023

pchaigno commented Feb 20, 2023

marqc commented Feb 20, 2023

pchaigno commented Feb 20, 2023

qmonnet commented Feb 21, 2023 • edited

pchaigno commented Feb 21, 2023 • edited

qmonnet commented Feb 21, 2023 • edited

qmonnet commented Feb 21, 2023 • edited

pchaigno commented Feb 21, 2023 • edited

qmonnet commented Feb 22, 2023 • edited

marqc commented Jan 26, 2023 •

edited

qmonnet commented Feb 17, 2023 •

edited

qmonnet commented Feb 17, 2023 •

edited

qmonnet commented Feb 21, 2023 •

edited

pchaigno commented Feb 21, 2023 •

edited

qmonnet commented Feb 21, 2023 •

edited

qmonnet commented Feb 21, 2023 •

edited

pchaigno commented Feb 21, 2023 •

edited

qmonnet commented Feb 22, 2023 •

edited