Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator/metrics: Adjust bucket sizes #21860

Merged
merged 3 commits into from
Nov 24, 2022

Conversation

AwesomePatrol
Copy link
Contributor

@AwesomePatrol AwesomePatrol commented Oct 24, 2022

In CiliumEndpointSliceDensity histogram buckets configuration option was unset, so defaults were used (they have values form 0 to 10 as seen here: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#pkg-variables).
This change makes the width of the buckets 10 as documented.

In CiliumEndpointSliceQueueDelay also defaults were used. After this change the largest bucket is 1 hour which corresponds to the values observed in large clusters while keeping granularity at small queue delays.

Before:

# HELP cilium_operator_number_of_ceps_per_ces The number of CEPs batched in a CES
# TYPE cilium_operator_number_of_ceps_per_ces histogram
cilium_operator_number_of_ceps_per_ces_bucket{le="0.005"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="0.01"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="0.025"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="0.05"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="0.1"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="0.25"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="0.5"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="1"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="2.5"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="5"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="10"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="+Inf"} 0
cilium_operator_number_of_ceps_per_ces_sum 0
cilium_operator_number_of_ceps_per_ces_count 0

(cilium_operator_ces_queueing_delay_seconds has the same default distribution)

After:

# HELP cilium_operator_number_of_ceps_per_ces The number of CEPs batched in a CES
# TYPE cilium_operator_number_of_ceps_per_ces histogram
cilium_operator_number_of_ceps_per_ces_bucket{le="1"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="10"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="25"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="50"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="100"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="200"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="500"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="1000"} 0
cilium_operator_number_of_ceps_per_ces_bucket{le="+Inf"} 0
cilium_operator_number_of_ceps_per_ces_sum 0
cilium_operator_number_of_ceps_per_ces_count 0
# HELP cilium_operator_ces_queueing_delay_seconds CiliumEndpointSlice queueing delay in seconds
# TYPE cilium_operator_ces_queueing_delay_seconds histogram
cilium_operator_ces_queueing_delay_seconds_bucket{le="0.005"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="0.01"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="0.025"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="0.05"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="0.1"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="0.25"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="0.5"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="1"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="2.5"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="5"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="10"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="60"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="300"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="900"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="1800"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="3600"} 0
cilium_operator_ces_queueing_delay_seconds_bucket{le="+Inf"} 0
cilium_operator_ces_queueing_delay_seconds_sum 0
cilium_operator_ces_queueing_delay_seconds_count 0
Adjust CES bucket sizes for metrics

@AwesomePatrol AwesomePatrol requested a review from a team as a code owner October 24, 2022 12:54
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 24, 2022
@dlapcevic
Copy link
Contributor

/lgtm

@maintainer-s-little-helper
Copy link

Commits 76b1ca82dd4101334da1d77729d7a3d2d92d3f58, ca6f28688778dc9c15f850e29b0309464da671f0 do not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 27, 2022
@AwesomePatrol AwesomePatrol changed the title operator: Fix bucket width for CEP histogram to the documented values operator/metrics: Adjust bucket sizes Oct 27, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 27, 2022
@nebril nebril added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Nov 2, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 2, 2022
@nebril
Copy link
Member

nebril commented Nov 2, 2022

Thanks for the PR!

@nebril nebril added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Nov 2, 2022
@nebril
Copy link
Member

nebril commented Nov 2, 2022

/test

Job 'Cilium-PR-K8s-1.23-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig Iptables Skip conntrack for pod traffic

Failure Output

FAIL: Cannot flush conntrack table

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-5.4 so I can create one.

Job 'Cilium-PR-K8s-1.24-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathLRPTests Checks local redirect policy LRP connectivity

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-4.19 so I can create one.

operator/metrics/metrics.go Outdated Show resolved Hide resolved
Copy link
Contributor

@chancez chancez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, configurable buckets is a bigger topic, so we can defer that :)

@AwesomePatrol
Copy link
Contributor Author

No changes aside from removing Change-Id from commit messages, because some checks complained

@AwesomePatrol
Copy link
Contributor Author

I rebased the changes hoping that CI pipeline will pass this time, but now it looks like it requires some approval and/or is failing. Could you please take a look? I doubt this small change is causing any of the tests to fail

In CiliumEndpointSliceDensity histogram buckets configuration option was
unset, so defaults were used (they have values form 0 to 10 as seen
here:
https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#pkg-variables).
This change makes the width of the buckets 10 as documented.

BUG=254474623

Signed-off-by: Aleksander Mistewicz <amistewicz@google.com>
In CiliumEndpointSliceQueueDelay histogram buckets configuration option was
unset, so defaults were used
(https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#pkg-variables).
This change doubles the number of buckets and increases the end of the last
bucket to 1 hour as values larger than this can be observed in large clusters.

Signed-off-by: Aleksander Mistewicz <amistewicz@google.com>
It gives nicer looking values than the computed version.

Signed-off-by: Aleksander Mistewicz <amistewicz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants