Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds metric for monitoring the service programming duration #32055

Merged
merged 2 commits into from
Apr 29, 2024

Conversation

ovidiutirla
Copy link
Contributor

@ovidiutirla ovidiutirla commented Apr 18, 2024

Adds a new metric (service_implementation_delay ) for monitoring the service programming duration. The metrics measures the execution time of the service event handler reflecting the time it took to program the service. From the time the service or pod was changed to the time the change was propagated.

The metric also take into consideration the duration to program the in-cluster network (Network programming latency SLIs) which is useful to monitor and understand network bottlenecks in large clusters.

The metric should not impact the current performance of the handler allowing us to have it enabled by default.

Adds `service_implementation_delay` metric accounting the duration in seconds to propagate the data plane programming of a service, its network and endpoints from the time the service or the service pod was changed excluding the event queue latency

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 18, 2024
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Apr 18, 2024
@ovidiutirla ovidiutirla force-pushed the feature/network-programming-duration branch from e197036 to 1ac4e82 Compare April 18, 2024 14:20
@ovidiutirla ovidiutirla marked this pull request as ready for review April 18, 2024 14:20
@ovidiutirla ovidiutirla requested review from a team as code owners April 18, 2024 14:20
Copy link
Contributor

@dlapcevic dlapcevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks Ovidiu!

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/k8s/watchers/watcher.go Outdated Show resolved Hide resolved
@ovidiutirla ovidiutirla force-pushed the feature/network-programming-duration branch 2 times, most recently from 8311f7a to f5e7ff6 Compare April 18, 2024 17:35
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ovidiutirla Thank you for this PR! I think we should perhaps revisit the naming here as service_programming is a little confusion since we are measure cilium k8s service event handling time.

Documentation/observability/metrics.rst Outdated Show resolved Hide resolved
pkg/k8s/watchers/watcher.go Outdated Show resolved Hide resolved
pkg/k8s/watchers/watcher.go Outdated Show resolved Hide resolved
@qmonnet qmonnet added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. labels Apr 22, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 22, 2024
@qmonnet
Copy link
Member

qmonnet commented Apr 22, 2024

/test

@ovidiutirla ovidiutirla force-pushed the feature/network-programming-duration branch from 0a23e49 to ebb9caf Compare April 22, 2024 10:36
@qmonnet
Copy link
Member

qmonnet commented Apr 22, 2024

/test

@ovidiutirla
Copy link
Contributor Author

Looks like all tests passed, I'll squash the commits and rebase on main.

@ovidiutirla ovidiutirla force-pushed the feature/network-programming-duration branch from fcf22cc to de3981d Compare April 22, 2024 14:50
Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! just one question

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@ovidiutirla ovidiutirla force-pushed the feature/network-programming-duration branch from de3981d to a829122 Compare April 23, 2024 08:56
@squeed
Copy link
Contributor

squeed commented Apr 23, 2024

The documentation says, "Duration in seconds to propagate the programming of a service [...] from the time the service was changed." The code as-written ignores queue latency. Is that desired? I assume not. I'd suggest setting the "start" timestamp in ServiceCache.emitEvent() instead. Or, update the documentation to indicate it is dataplane programming latency, not end-to-end latency.

Ideally we could determine the original enqueue time in the Resource queue itself (thoughts, @joamaki), but that's overkill for this PR.

@ovidiutirla ovidiutirla force-pushed the feature/network-programming-duration branch from eaf2edb to fa03a55 Compare April 25, 2024 08:08
@qmonnet
Copy link
Member

qmonnet commented Apr 25, 2024

/test

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the update to pkg/metrics/metrics.go could all go into the first commit, rather than having the description fixed in the second commit. But that's not blocking.

Looks good from my side, thanks!

@qmonnet qmonnet closed this Apr 26, 2024
@qmonnet qmonnet reopened this Apr 26, 2024
@qmonnet qmonnet enabled auto-merge April 26, 2024 08:57
@qmonnet qmonnet disabled auto-merge April 26, 2024 08:57
@qmonnet
Copy link
Member

qmonnet commented Apr 26, 2024

Sorry, there was a bit of fat-fingering on the phone UI on my end for the auto-merge. The close/reopen was to re-trigger the image build.
/test

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ovidiutirla Thank you for the updates!

pkg/k8s/watchers/watcher.go Outdated Show resolved Hide resolved
@aanm aanm requested a review from marseel April 29, 2024 08:20
@ovidiutirla ovidiutirla force-pushed the feature/network-programming-duration branch from fa03a55 to 5fadfa3 Compare April 29, 2024 09:57
@aanm
Copy link
Member

aanm commented Apr 29, 2024

/test

The metrics measures the execution time of the service event handler

Signed-off-by: Ovidiu Tirla <otirla@google.com>
Signed-off-by: Ovidiu Tirla <otirla@google.com>
@aanm aanm force-pushed the feature/network-programming-duration branch from 5fadfa3 to f7fc729 Compare April 29, 2024 10:15
@aanm
Copy link
Member

aanm commented Apr 29, 2024

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 29, 2024
@aanm aanm enabled auto-merge April 29, 2024 13:54
@aanm aanm added this pull request to the merge queue Apr 29, 2024
Merged via the queue into cilium:main with commit 31571be Apr 29, 2024
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants