New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove relevant metrics series on pod deletion (#23162). #23385
Conversation
6697e71
to
521d22d
Compare
521d22d
to
6174ef8
Compare
6174ef8
to
61dc3e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. A few comments below.
@@ -64,6 +86,18 @@ func initMetrics(address string, enabled api.Map, grpcMetrics *grpc_prometheus.S | |||
errChan <- srv.ListenAndServe() | |||
}() | |||
|
|||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing ensuring this go routine gets cleaned up, we should be waiting on it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation to workqueue and right now goroutine will exit after calling queue.Shutdown()
but I'm not sure where to put it. A similar goroutine for webserver with metrics seems to not have anything that could stop it. Can you point me where to implement this "shutdown hook"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I didn't notice that. I guess we don't do properly handle that either. It would require refactoring most of the code in this package to work on a struct with methods. We can probably we can fix this later, I think it's a bit unrelated to this PR.
@@ -40,7 +48,21 @@ func ProcessFlow(ctx context.Context, flow *pb.Flow) error { | |||
return nil | |||
} | |||
|
|||
// initMetrics initialies the metrics system | |||
func ProcessPodDeletion(pod *slim_corev1.Pod) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to @christarazi 's comment, I wonder if this could be a method on some object instead, and then the queue would be a field on a struct rather than a global.
629e271
to
006ebf9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, I think I'd like to have another hubble maintainer review it also though.
Sorry. It looks like github automatically removes assignments when I click on re-request review.
I updated the commit message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. It looks like github automatically removes assignments when I click on re-request review.
OK, I didn't know that
I updated the commit message.
Great description, thanks a lot!
/test |
Runtime test failure looks unrelated, I filed an issue for it. |
/test-runtime |
Reviews are in and the tests passed eventually! Deferring to TopHat for the rest of the operations. Thanks! |
Looks like Travis CI failed to run. Do we know why? If not, we may need to rebase and rerun CI 😞 |
I don't :/. Scrolling back until 26th of January on Travis (couldn't find a way to filter based on PRs), I can't find any build for this PR. Not sure what happened. |
I tried to close and reopen to trigger the Travis CI job but that didn't seem to work. |
Using pod source/destinationContext may result in big metrics cardinality. Additionally this one component is exposing metrics for all pods running on the same k8s node. Over time with each created and removed pod number of metric data series will grow which will eventually result in slower and bigger responses on metrics endpoint and may even lead to OOM problems in cilium-agent itself or prometheus metrics collector. This change removes data series bound do certain pods when they are deleted. This is safe, because after pod deletion its metric value will never change. There is a 1 minute delay introduced between pod deletion and metric data series removal, so scrapers will have time to collect last value of given metric. Fixes: cilium#23162 Signed-off-by: Marek Chodor <mchodor@google.com>
6faa010
to
5ee8e22
Compare
/test |
k8s-1.16-kernel-4.19 hit #23845, re-triggering: /test-1.16-4.19 I still can't see any link to the details for the Travis build though 🤔 |
Here it is at last https://github.com/cilium/cilium/pull/23385/checks?check_run_id=11488636578 Wait, that's a good one. The link to the Travis job on the current PR redirects in fact to the build from Paul's draft PR. Go figure 🙃 |
Same error |
4.19 is very flaky. We're working on the flake affecting those CI jobs, but in the meantime I'd still like to get a green run because that's a lot of coverage to ignore 😬 |
Once more |
Signed-off-by: Marek Chodor mchodor@google.com
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Remove pod attributed hubble metrics after pod gets deleted.
Fixes: #23162