Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Remove name label from some k8s client metrics #2851

Merged
merged 1 commit into from
Jul 7, 2023

Conversation

SuperQ
Copy link
Contributor

@SuperQ SuperQ commented Jun 19, 2023

The name label in the controller_clientset_k8s_request_total metric
produce an excessive amount of cardinality for events and replicasets.
This can lead to hundreds of thousands of unique metrics over a couple
weeks in a large deployment. Set the name to "N/A" for these client request
types.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@SuperQ SuperQ changed the title Remove name label from k8s client metrics fix: Remove name label from k8s client metrics Jun 19, 2023
@SuperQ SuperQ changed the title fix: Remove name label from k8s client metrics fix(controller): Remove name label from k8s client metrics Jun 19, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 19, 2023

Go Published Test Results

1 989 tests   1 989 ✔️  2m 35s ⏱️
   118 suites         0 💤
       1 files           0

Results for commit e508d29.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 19, 2023

E2E Tests Published Test Results

    4 files      4 suites   3h 19m 28s ⏱️
  96 tests   82 ✔️   5 💤   9
402 runs  364 ✔️ 20 💤 18

For more details on these failures, see this check.

Results for commit e508d29.

♻️ This comment has been updated with latest results.

@codecov
Copy link

codecov bot commented Jun 22, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.01 ⚠️

Comparison is base (51c867d) 81.66% compared to head (61740e9) 81.66%.

❗ Current head 61740e9 differs from pull request most recent head e508d29. Consider uploading reports for the commit e508d29 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2851      +/-   ##
==========================================
- Coverage   81.66%   81.66%   -0.01%     
==========================================
  Files         133      133              
  Lines       20192    20187       -5     
==========================================
- Hits        16490    16485       -5     
  Misses       2849     2849              
  Partials      853      853              
Impacted Files Coverage Δ
controller/metrics/prommetrics.go 100.00% <ø> (ø)
controller/metrics/client.go 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zachaller
Copy link
Collaborator

I wonder if instead of dropping name from everything we just set name to something like "N/A" for the resources that have random strings in it or we strip the random parts off if that is an easy task as well.

It's nice to have the name of the rollout object that is causing a request to be made but agree the random fields are a bad anti-pattern and should probably be cleaned up.

@SuperQ
Copy link
Contributor Author

SuperQ commented Jun 30, 2023

@zachaller The problem is that name causes huge cardinality leaks, since every rollout generates a bunch of new series. We end up with 50,000 new series every week in one of our dev clusters. We currently drop the whole metric in order to reduce the load on our Prometheus server. If the argo-rollouts binary is up for 2 weeks, we end up with 100,000 metrics and it takes 3 seconds just to scrape the data.

@SuperQ
Copy link
Contributor Author

SuperQ commented Jun 30, 2023

Let me look over some of the raw data to see if it's possible to filter out specific resources.

@SuperQ
Copy link
Contributor Author

SuperQ commented Jun 30, 2023

Another source of cardinality problems is namespace. We generate ephemeral namespaces for PRs. So we end up with things like

controller_clientset_k8s_request_total{kind="rollouts",name="some-service-deployment",namespace="test-service-12345",status_code="409",verb="Update"}

Over a week we get 2000 of these metrics.

@SuperQ
Copy link
Contributor Author

SuperQ commented Jun 30, 2023

For kind="events" we generate 20k/week.
For kind="replicasets" we generate 10k/week.

If you really want to keep the name label just for kind="rollouts", it's not ideal, but not the worst cardinality problem.

@zachaller
Copy link
Collaborator

One other thing what version of rollouts are you on because pre this it would be even worse, however I do agree the cardinality here can get bad we have turned it off as well. if we remove name/namespace on the RO object though I think the whole stat actually becomes even less usefull because k8s itself has request metrics at the kind level

@SuperQ
Copy link
Contributor Author

SuperQ commented Jun 30, 2023

We are running 1.5.x.

The `name` label in the `controller_clientset_k8s_request_total` metric
produce an excessive amount of cardinality for `events` and `replicasets`.
This can lead to hundreds of thousands of unique metrics over a couple
weeks in a large deployment. Set the name to "N/A" for these client request
types.

Signed-off-by: SuperQ <superq@gmail.com>
@sonarcloud
Copy link

sonarcloud bot commented Jun 30, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@SuperQ SuperQ changed the title fix(controller): Remove name label from k8s client metrics fix(controller): Remove name label from some k8s client metrics Jun 30, 2023
@SuperQ
Copy link
Contributor Author

SuperQ commented Jun 30, 2023

Ok, I've simplified this to just the top two sources of cardinality leaks.

@zachaller zachaller merged commit 65fbefe into argoproj:master Jul 7, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants