-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove high cardinality port-distribution metric from default install #13734
Remove high cardinality port-distribution metric from default install #13734
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems straightforward to me 👍
I'd welcome someone from Hubble side to provide feedback as well to (1) make sure hubble folks are aware of this change and (2) whether there are any other mitigations or alternatives to consider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change seems reasonable to me, but there are other places that refer to the port-distribution metric. probably makes sense to fix these as well:
- https://github.com/cilium/cilium/blob/master/examples/kubernetes/addons/prometheus/files/grafana-dashboards/hubble-dashboard.json sample grafana dashboard. we could keep it here i guess, but then the graph will be empty by default.
- https://github.com/cilium/cilium/blob/master/Documentation/gettingstarted/grafana.rst there is a screenshot for the dashboard as well.
- https://github.com/cilium/cilium/blob/master/install/kubernetes/Makefile: used to generate {experimental,quick}-install.yaml.
cilium/install/kubernetes/cilium/values.yaml
Line 448 in 91748c7
# --set metrics.enabled="{dns:query;ignoreAAAA,drop,tcp,flow,port-distribution,icmp,http}"
Thanks for pointing out the additional default references, @michi-covalent. Should I remove the references to those files as well in this PR? |
Open to discussion: The port-distribution metric covers "Number of packets by destination port number" which results in a high cardinality metric with arguably minimal value from a default installation. We should consider removing the metric from the default metric GSG. Signed-off-by: Jed Salazar jed@isovalent.com
4bd1f2b
to
7740778
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced up with @jedsalazar offline.
- fixed install/kubernetes/Makefile.
- cleaned up all the other references to port-distribution so that it's disabled by default.
- added a note in grafana section that port-distribution is disabled by default.
thanks @jedsalazar for taking care of this 💯 . one thing we should follow up is https://cilium.slack.com/archives/CQRL1EPAA/p1603439414032600 it looks like reply packets are not marked as such, and that caused source ports to show up as destination ports.
@@ -164,7 +164,6 @@ data: | |||
drop | |||
tcp | |||
flow | |||
port-distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for experimental, it might make sense to keep it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glibsm we can turn this back on, but i'd feel a bit safer if we can figure out the underlying issue first https://cilium.slack.com/archives/CQRL1EPAA/p1603439414032600
The underlying reason for this is that the I'm not sure we need to disable the metric if we fix the bug. Do we believe that the port distribution metric would also cause high cardinality if it did not incooperate the ports of reply packets? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving a blocking review here, as I have a PR now which fixes the underlying bug of the port distribution sometimes confusing source ports as destination ports: #13750
If port-distribution
has high cardinally that causes problems even with the bug fix applied, then I think it makes sense to remove it from the defaults and I'll approve this PR. But from the original description, it seems that the main motivation for this change here is to circumvent the bug we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. The motivation for the PR is not only informed by the bug (which my PR addresses), but to reduce the amount of noise in general. In that regard, lgtm!
Open to discussion: The
port-distribution
metric covers"Number of packets by destination port number" which
results in a high cardinality metric which arguably provides
minimal value in a default installation. We've seen this
metric cause performance and OOM kill issues in Prometheus
environments in the wild. We should consider removing the
metric from the default metric GSG.
Signed-off-by: Jed Salazar jed@isovalent.com