Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove high cardinality port-distribution metric from default install #13734

Merged

Conversation

jedsalazar
Copy link
Contributor

Open to discussion: The port-distribution metric covers
"Number of packets by destination port number" which
results in a high cardinality metric which arguably provides
minimal value in a default installation. We've seen this
metric cause performance and OOM kill issues in Prometheus
environments in the wild. We should consider removing the
metric from the default metric GSG.

Signed-off-by: Jed Salazar jed@isovalent.com

@jedsalazar jedsalazar requested a review from a team as a code owner October 23, 2020 16:18
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Oct 23, 2020
@michi-covalent michi-covalent added needs-backport/1.8 release-note/misc This PR makes changes that have no direct user impact. labels Oct 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.0-rc3 Oct 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 23, 2020
@joestringer joestringer requested a review from a team October 23, 2020 17:02
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems straightforward to me 👍

I'd welcome someone from Hubble side to provide feedback as well to (1) make sure hubble folks are aware of this change and (2) whether there are any other mitigations or alternatives to consider

Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change seems reasonable to me, but there are other places that refer to the port-distribution metric. probably makes sense to fix these as well:

@jedsalazar
Copy link
Contributor Author

Thanks for pointing out the additional default references, @michi-covalent. Should I remove the references to those files as well in this PR?

Open to discussion: The port-distribution metric covers "Number of packets by destination port number" which results in a high cardinality metric with arguably minimal value from a default installation. We should consider removing the metric from the default metric GSG.

Signed-off-by: Jed Salazar jed@isovalent.com
@jedsalazar jedsalazar force-pushed the pr/remove-high-cardinality-metric-from-gsg branch from 4bd1f2b to 7740778 Compare October 23, 2020 18:31
@jedsalazar jedsalazar requested a review from a team as a code owner October 23, 2020 18:31
@jedsalazar jedsalazar requested a review from a team October 23, 2020 18:31
Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced up with @jedsalazar offline.

  • fixed install/kubernetes/Makefile.
  • cleaned up all the other references to port-distribution so that it's disabled by default.
  • added a note in grafana section that port-distribution is disabled by default.

thanks @jedsalazar for taking care of this 💯 . one thing we should follow up is https://cilium.slack.com/archives/CQRL1EPAA/p1603439414032600 it looks like reply packets are not marked as such, and that caused source ports to show up as destination ports.

@@ -164,7 +164,6 @@ data:
drop
tcp
flow
port-distribution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for experimental, it might make sense to keep it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glibsm we can turn this back on, but i'd feel a bit safer if we can figure out the underlying issue first https://cilium.slack.com/archives/CQRL1EPAA/p1603439414032600

@gandro
Copy link
Member

gandro commented Oct 26, 2020

The underlying reason for this is that the port-distribution implementation contains the following bug: #13744 (comment)

I'm not sure we need to disable the metric if we fix the bug. Do we believe that the port distribution metric would also cause high cardinality if it did not incooperate the ports of reply packets?

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving a blocking review here, as I have a PR now which fixes the underlying bug of the port distribution sometimes confusing source ports as destination ports: #13750

If port-distribution has high cardinally that causes problems even with the bug fix applied, then I think it makes sense to remove it from the defaults and I'll approve this PR. But from the original description, it seems that the main motivation for this change here is to circumvent the bug we have.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. The motivation for the PR is not only informed by the bug (which my PR addresses), but to reduce the amount of noise in general. In that regard, lgtm!

@aanm aanm merged commit 8909c7f into cilium:master Oct 26, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.5 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.0-rc3 Oct 28, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 28, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.5 Oct 28, 2020
@joestringer joestringer moved this from In progress to Done in 1.10.0 Feb 12, 2021
@joestringer joestringer moved this from In progress to Done in 1.10.0 Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
No open projects
1.8.5
Backport done to v1.8
1.9.0-rc3
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

None yet

10 participants