pkg/ipam: Update histogram buckets for trigger metrics #25600

hemanthmalla · 2023-05-22T20:01:43Z

Currently, trigger related histogram metrics in pgk/ipam use the default prometheus histogram buckets. Resync operation in cloud providers like Azure tend to take a long time and the current buckets are inadequate to track changes in behavior. This commit extends the buckets to allow for measuring longer durations.

Currently, some metrics plateau at 10 secs

Reusing buckets defined in Kubernetes API server to measure request duration

gandro

The change overall looks fine by me. But I don't know much about metric best practises. Is there any overlap with the ongoing work in #25256?

hemanthmalla · 2023-05-23T17:07:24Z

I don't think its related since we're only updating the histogram buckets. #25256 doesn't seem to touch bucket values.

hemanthmalla · 2023-05-23T17:07:51Z

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing With host policy Tests NodePort

Failure Output

FAIL: Can not connect to service "tftp://192.168.56.12:30528/hello" from outside cluster (1/10)

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/95/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

hemanthmalla · 2023-05-24T15:19:16Z

@gandro net-next and test-runtime seems to be flaky now ? I don't see the corresponding Jenkins jobs for failures. Are we cleaning up jenkins jobs in less than a day now ?

gandro · 2023-05-25T09:06:36Z

@gandro net-next and test-runtime seems to be flaky now ? I don't see the corresponding Jenkins jobs for failures. Are we cleaning up jenkins jobs in less than a day now ?

We had a Jenkins outage last week and had to re-provision all Jenkins instances. During that process, we accidentally fell back to a Jenkins config to only retained the last 30 jobs. That has now been fixed to retain job logs up to 15 days. If you rerun those pipelines, you should see be able to access the failure now.

Currently, trigger related histogram metrics in pgk/ipam use the default prometheus histogram buckets. Resync operation in cloud providers like Azure tend to take a long time and the current buckets are inadequate to track changes in behavior. This commit extends the buckets to allow for measuring longer durations. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

hemanthmalla · 2023-06-05T21:26:23Z

/test

[CMPT-1682] Backport cilium#25600 to 1.11

hemanthmalla requested a review from a team as a code owner May 22, 2023 20:01

hemanthmalla requested a review from gandro May 22, 2023 20:01

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 22, 2023

gandro approved these changes May 23, 2023

View reviewed changes

hemanthmalla added the release-note/misc This PR makes changes that have no direct user impact. label May 23, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 23, 2023

hemanthmalla force-pushed the hmalla/ipam_histogram_buckets branch from 12f34e3 to 3410629 Compare June 5, 2023 21:26

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 7, 2023

dylandreimerink merged commit 6df9d10 into cilium:main Jun 8, 2023
61 checks passed

antonipp mentioned this pull request Jul 4, 2023

[CMPT-1682] Backport https://github.com/cilium/cilium/pull/25600 to 1.11 DataDog/cilium#512

Merged

antonipp added a commit to DataDog/cilium that referenced this pull request Jul 4, 2023

Merge pull request #512 from DataDog/ai/backport-pr-25600

c50d793

[CMPT-1682] Backport cilium#25600 to 1.11

jaredledvina mentioned this pull request Jul 12, 2023

v1.12 Cherry-picks DataDog/cilium#514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/ipam: Update histogram buckets for trigger metrics #25600

pkg/ipam: Update histogram buckets for trigger metrics #25600

hemanthmalla commented May 22, 2023 •

edited

gandro left a comment •

edited

hemanthmalla commented May 23, 2023

hemanthmalla commented May 23, 2023 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

hemanthmalla commented May 24, 2023

gandro commented May 25, 2023

hemanthmalla commented Jun 5, 2023

pkg/ipam: Update histogram buckets for trigger metrics #25600

pkg/ipam: Update histogram buckets for trigger metrics #25600

Conversation

hemanthmalla commented May 22, 2023 • edited

gandro left a comment • edited

Choose a reason for hiding this comment

hemanthmalla commented May 23, 2023

hemanthmalla commented May 23, 2023 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

hemanthmalla commented May 24, 2023

gandro commented May 25, 2023

hemanthmalla commented Jun 5, 2023

hemanthmalla commented May 22, 2023 •

edited

gandro left a comment •

edited

hemanthmalla commented May 23, 2023 •

edited by maintainer-s-little-helper bot