Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove relevant metrics series on pod deletion (#23162). #23385

Merged
merged 1 commit into from Feb 22, 2023

Conversation

marqc
Copy link
Contributor

@marqc marqc commented Jan 26, 2023

Signed-off-by: Marek Chodor mchodor@google.com

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Thanks for contributing!

Remove pod attributed hubble metrics after pod gets deleted.

Fixes: #23162

@marqc marqc requested review from a team as code owners January 26, 2023 13:30
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 26, 2023
@marqc marqc requested a review from gandro January 26, 2023 13:30
@marqc marqc marked this pull request as draft January 26, 2023 13:30
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 26, 2023
@marqc marqc force-pushed the stale_metric_series_removal branch from 6697e71 to 521d22d Compare February 2, 2023 15:42
@marqc marqc force-pushed the stale_metric_series_removal branch from 521d22d to 6174ef8 Compare February 14, 2023 15:08
@marqc marqc marked this pull request as ready for review February 14, 2023 15:09
@marqc marqc force-pushed the stale_metric_series_removal branch from 6174ef8 to 61dc3e9 Compare February 14, 2023 15:40
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. A few comments below.

pkg/hubble/metrics/api/api.go Outdated Show resolved Hide resolved
pkg/hubble/metrics/api/context.go Outdated Show resolved Hide resolved
pkg/hubble/metrics/metrics.go Outdated Show resolved Hide resolved
@@ -64,6 +86,18 @@ func initMetrics(address string, enabled api.Map, grpcMetrics *grpc_prometheus.S
errChan <- srv.ListenAndServe()
}()

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing ensuring this go routine gets cleaned up, we should be waiting on it somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the implementation to workqueue and right now goroutine will exit after calling queue.Shutdown() but I'm not sure where to put it. A similar goroutine for webserver with metrics seems to not have anything that could stop it. Can you point me where to implement this "shutdown hook"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I didn't notice that. I guess we don't do properly handle that either. It would require refactoring most of the code in this package to work on a struct with methods. We can probably we can fix this later, I think it's a bit unrelated to this PR.

@@ -40,7 +48,21 @@ func ProcessFlow(ctx context.Context, flow *pb.Flow) error {
return nil
}

// initMetrics initialies the metrics system
func ProcessPodDeletion(pod *slim_corev1.Pod) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to @christarazi 's comment, I wonder if this could be a method on some object instead, and then the queue would be a field on a struct rather than a global.

@marqc marqc force-pushed the stale_metric_series_removal branch 5 times, most recently from 629e271 to 006ebf9 Compare February 15, 2023 14:47
Copy link
Contributor

@chancez chancez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I think I'd like to have another hubble maintainer review it also though.

@christarazi christarazi added kind/enhancement This would improve or streamline existing functionality. area/daemon Impacts operation of the Cilium daemon. release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay labels Feb 15, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 15, 2023
@marqc marqc requested a review from qmonnet February 17, 2023 10:17
@marqc
Copy link
Contributor Author

marqc commented Feb 17, 2023

Please, do not remove review assignments. (I'm not re-adding Aditi because you already have reviews from members of sig-policy.)

Sorry. It looks like github automatically removes assignments when I click on re-request review.

This is not the case. The change in your PR is not trivial, so I would really appreciate some more detailed context and motivation in the commit description. It should be possible to understand the issue and how it's addressed just by looking at the Git logs, without having to look up for the GitHub issue. Nit: The issue fixed by the commit should be in a Fixes: tag at the end of the description, rather than in the commit object.

I updated the commit message.

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. It looks like github automatically removes assignments when I click on re-request review.

OK, I didn't know that

I updated the commit message.

Great description, thanks a lot!

@qmonnet
Copy link
Member

qmonnet commented Feb 17, 2023

/test

@qmonnet
Copy link
Member

qmonnet commented Feb 17, 2023

Runtime test failure looks unrelated, I filed an issue for it.
/test-runtime

@qmonnet
Copy link
Member

qmonnet commented Feb 17, 2023

stderr: fatal: unable to access 'https://github.com/cilium/cilium/': Failed to connect to github.com port 443: Connection refused (link) - We're not in luck.

/test-runtime

@qmonnet
Copy link
Member

qmonnet commented Feb 20, 2023

Reviews are in and the tests passed eventually! Deferring to TopHat for the rest of the operations. Thanks!

@qmonnet qmonnet added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 20, 2023
@pchaigno
Copy link
Member

Looks like Travis CI failed to run. Do we know why? If not, we may need to rebase and rerun CI 😞

@pchaigno pchaigno removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 20, 2023
@qmonnet
Copy link
Member

qmonnet commented Feb 20, 2023

Looks like Travis CI failed to run. Do we know why?

I don't :/. Scrolling back until 26th of January on Travis (couldn't find a way to filter based on PRs), I can't find any build for this PR. Not sure what happened.

@pchaigno pchaigno closed this Feb 20, 2023
@pchaigno pchaigno reopened this Feb 20, 2023
@pchaigno
Copy link
Member

I tried to close and reopen to trigger the Travis CI job but that didn't seem to work.
@marqc Could you rebase your branch to see if it helps? Sorry for the trouble 😕

Using pod source/destinationContext may result in big metrics
cardinality. Additionally this one component is exposing metrics for all
pods running on the same k8s node. Over time with each created and
removed pod number of metric data series will grow which will eventually
result in slower and bigger responses on metrics endpoint and may even
lead to OOM problems in cilium-agent itself or prometheus metrics
collector.

This change removes data series bound do certain pods when they are
deleted. This is safe, because after pod deletion its metric value will
never change.

There is a 1 minute delay introduced between pod deletion and metric
data series removal, so scrapers will have time to collect last value of
given metric.

Fixes: cilium#23162
Signed-off-by: Marek Chodor <mchodor@google.com>
@marqc marqc force-pushed the stale_metric_series_removal branch from 6faa010 to 5ee8e22 Compare February 20, 2023 20:02
@marqc
Copy link
Contributor Author

marqc commented Feb 20, 2023

I tried to close and reopen to trigger the Travis CI job but that didn't seem to work. @marqc Could you rebase your branch to see if it helps? Sorry for the trouble 😕

@pchaigno done

@pchaigno
Copy link
Member

/test

@qmonnet
Copy link
Member

qmonnet commented Feb 21, 2023

k8s-1.16-kernel-4.19 hit #23845, re-triggering:

/test-1.16-4.19

I still can't see any link to the details for the Travis build though 🤔

@pchaigno
Copy link
Member

pchaigno commented Feb 21, 2023

At this point, I suspect it's something weird with this branch, @marqc's fork, or @marqc's account itself 🤷 I've opened #23900 with these changes, just to run the Travis CI job.

EDIT: Hm, just when I did, the job started 🤔

@qmonnet
Copy link
Member

qmonnet commented Feb 21, 2023

Here it is at last https://github.com/cilium/cilium/pull/23385/checks?check_run_id=11488636578

Wait, that's a good one. The link to the Travis job on the current PR redirects in fact to the build from Paul's draft PR. Go figure 🙃

@qmonnet
Copy link
Member

qmonnet commented Feb 21, 2023

Same error
/test-1.16-4.19

@pchaigno
Copy link
Member

pchaigno commented Feb 21, 2023

4.19 is very flaky. We're working on the flake affecting those CI jobs, but in the meantime I'd still like to get a green run because that's a lot of coverage to ignore 😬
/test-1.16-4.19

@qmonnet
Copy link
Member

qmonnet commented Feb 22, 2023

Once more
/test-1.16-4.19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. release-note/misc This PR makes changes that have no direct user impact. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CFP: Improvements in hubble metrics cardinality
8 participants