Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace metricsmap-bpf-prom-sync with Prometheus Collector pattern #27370

Merged
merged 1 commit into from
Nov 2, 2023

Conversation

carnerito
Copy link
Contributor

@carnerito carnerito commented Aug 9, 2023

Previously, the 'metricsmap-bpf-prom-sync' controller was responsible for periodically collecting cilium_datapath drop
and forward metrics.
This PR introduces the metricsmapCollector, which implements the Prometheus Collector interface within the
metricsmap package. As a result, the aforementioned controller has been removed.
The metricsmapCollector is registered within the global Prometheus registry during the initialization of metricsmap.
The metrics.Register function has been modified to propagate errors from the registry.Register function instead of
simply overriding it with nil.
The metricsmapCollector comprises two metrics maps: forwardedMetricsMap and droppedMetricsMap.
These maps are populated within a callback function passed to the IterateWithCallback function.
This approach serves two primary purposes:

  1. Separation of Map Iteration and Metric Update: By separating the iteration over the BPF map and the updating of
    Prometheus metrics, the implementation ensures that no partial metrics are exposed in case of map iteration failure.
  2. Normalization of exposed Metrics: Unlike the statement in the bpf/lib/metrics.h comments, which suggests exposing
    only one reason label for forwarded metrics, the eBPF code exposes multiple reasons. Through testing, it was found
    that reasons 0 and 3 were exposed which triggered this error in prometheus client:
    It occurs "collected metric %s %s was collected before with the same name and label values" error in the lastest version prometheus/client_golang#242

Furthermore, cilium command has an indirect dependency on metricsmap package. To prevent metrics map initialization
from happening each time cilium command is executed, metricsmap is converted to hive Cell and injected in cilium daemon
Infrastructure module. For details see this comment: #27370 (comment).

Fixes: #27058

Signed-off-by: Boris Petrovic carnerito.b@gmail.com

@carnerito carnerito requested review from a team as code owners August 9, 2023 07:51
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 9, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Aug 9, 2023
@maintainer-s-little-helper
Copy link

Commit 971a40e9b4830fd094a9fb641dd97b0832be9a8e does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 9, 2023
@maintainer-s-little-helper
Copy link

Commit 971a40e9b4830fd094a9fb641dd97b0832be9a8e does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@dylandreimerink
Copy link
Member

Hi @carnerito, thank you for this contribution. Could you sign off on your commit? https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@carnerito
Copy link
Contributor Author

carnerito commented Aug 14, 2023

@dylandreimerink Can I squash all commits and force-push?

Update
Ignore this, I didn't know that I can amend specific commit not just latest one.

@carnerito carnerito requested review from a team as code owners August 14, 2023 19:13
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 14, 2023
Copy link
Contributor

@zacharysarah zacharysarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carnerito Thanks for the updates! Some minor stylistic changes are required, with strong encouragement to fix some of the inherited language touched by this PR.

Documentation/bpf/toolchain.rst Outdated Show resolved Hide resolved
Documentation/community/governance/commit_access.rst Outdated Show resolved Hide resolved
Documentation/community/governance/commit_access.rst Outdated Show resolved Hide resolved
Documentation/community/roadmap.rst Outdated Show resolved Hide resolved
Documentation/community/roadmap.rst Outdated Show resolved Hide resolved
Documentation/configuration/argocd-issues.rst Outdated Show resolved Hide resolved
Documentation/configuration/argocd-issues.rst Outdated Show resolved Hide resolved
Documentation/glossary.rst Outdated Show resolved Hide resolved
Documentation/installation/cni-chaining-generic-veth.rst Outdated Show resolved Hide resolved
pkg/maps/metricsmap/metricsmap.go Outdated Show resolved Hide resolved
@carnerito carnerito force-pushed the metricsmap-collector branch 2 times, most recently from 42c4cde to d1988ae Compare August 15, 2023 06:27
@dylandreimerink dylandreimerink requested review from zacharysarah and removed request for a team August 16, 2023 08:47
@aanm aanm removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Oct 30, 2023
christarazi pushed a commit to carnerito/cilium that referenced this pull request Oct 30, 2023
Previously, the 'metricsmap-bpf-prom-sync' controller was responsible for periodically collecting cilium_datapath drop
and forward metrics.
This commit introduces the metricsmapCollector, which implements the Prometheus Collector interface within the
metricsmap package. As a result, the aforementioned controller has been removed.
The metricsmapCollector is registered within the global Prometheus registry during the initialization of metricsmap.
The metrics.Register function has been modified to propagate errors from the registry.Register function instead of
simply overriding it with nil.
The metricsmapCollector comprises two metrics maps: forwardedMetricsMap and droppedMetricsMap.
These maps are populated within a callback function passed to the IterateWithCallback function.
This approach serves two primary purposes:
1. Separation of Map Iteration and Metric Update: By separating the iteration over the BPF map and the updating of
   Prometheus metrics, the implementation ensures that no partial metrics are exposed in case of map iteration failure.
2. Normalization of exposed Metrics: Unlike the statement in the bpf/lib/metrics.h comments, which suggests exposing
   only one reason label for forwarded metrics, the eBPF code exposes multiple reasons. Through testing, it was found
   that reasons 0 and 3 were exposed which triggered this error in prometheus client:
   prometheus/client_golang#242

Furthermore, cilium command has indirect dependency on metricsmap package. To prevent metrics map initialization
to happen each time cilium command is executed, metricsmap is converted to hive Cell and injected in cilium daemon
Infrastructure module. For details see this comment: cilium#27370 (comment).

Fixes: cilium#27058

Signed-off-by: Boris Petrovic <carnerito.b@gmail.com>
@christarazi
Copy link
Member

I've rebased the PR to re-run the tests.

@christarazi
Copy link
Member

/test

aanm pushed a commit to carnerito/cilium that referenced this pull request Oct 31, 2023
Previously, the 'metricsmap-bpf-prom-sync' controller was responsible for periodically collecting cilium_datapath drop
and forward metrics.
This commit introduces the metricsmapCollector, which implements the Prometheus Collector interface within the
metricsmap package. As a result, the aforementioned controller has been removed.
The metricsmapCollector is registered within the global Prometheus registry during the initialization of metricsmap.
The metrics.Register function has been modified to propagate errors from the registry.Register function instead of
simply overriding it with nil.
The metricsmapCollector comprises two metrics maps: forwardedMetricsMap and droppedMetricsMap.
These maps are populated within a callback function passed to the IterateWithCallback function.
This approach serves two primary purposes:
1. Separation of Map Iteration and Metric Update: By separating the iteration over the BPF map and the updating of
   Prometheus metrics, the implementation ensures that no partial metrics are exposed in case of map iteration failure.
2. Normalization of exposed Metrics: Unlike the statement in the bpf/lib/metrics.h comments, which suggests exposing
   only one reason label for forwarded metrics, the eBPF code exposes multiple reasons. Through testing, it was found
   that reasons 0 and 3 were exposed which triggered this error in prometheus client:
   prometheus/client_golang#242

Furthermore, cilium command has indirect dependency on metricsmap package. To prevent metrics map initialization
to happen each time cilium command is executed, metricsmap is converted to hive Cell and injected in cilium daemon
Infrastructure module. For details see this comment: cilium#27370 (comment).

Fixes: cilium#27058

Signed-off-by: Boris Petrovic <carnerito.b@gmail.com>
@aanm
Copy link
Member

aanm commented Oct 31, 2023

/test

@carnerito
Copy link
Contributor Author

@aanm Cilium build was failing because of unused import. It's fixed now, could you please rerun tests?

@aanm
Copy link
Member

aanm commented Oct 31, 2023

@carnerito Thanks, can you add the last commit as part of 8e741df? We try that each commit is in a "buildable state"

@carnerito
Copy link
Contributor Author

@aanm Just to be sure, do you want me to squash all commits and force-push?

@aanm
Copy link
Member

aanm commented Oct 31, 2023

@aanm Just to be sure, do you want me to squash all commits and force-push?

@carnerito I didn't look at all commits but now that you mentioned it, yes please squash all commits together and force-push. Thank you!

Previously, the 'metricsmap-bpf-prom-sync' controller was responsible for periodically collecting cilium_datapath drop
and forward metrics.
This commit introduces the metricsmapCollector, which implements the Prometheus Collector interface within the
metricsmap package. As a result, the aforementioned controller has been removed.
The metricsmapCollector is registered within the global Prometheus registry during the initialization of metricsmap.
The metrics.Register function has been modified to propagate errors from the registry.Register function instead of
simply overriding it with nil.
The metricsmapCollector comprises two metrics maps: forwardedMetricsMap and droppedMetricsMap.
These maps are populated within a callback function passed to the IterateWithCallback function.
This approach serves two primary purposes:
1. Separation of Map Iteration and Metric Update: By separating the iteration over the BPF map and the updating of
   Prometheus metrics, the implementation ensures that no partial metrics are exposed in case of map iteration failure.
2. Normalization of exposed Metrics: Unlike the statement in the bpf/lib/metrics.h comments, which suggests exposing
   only one reason label for forwarded metrics, the eBPF code exposes multiple reasons. Through testing, it was found
   that reasons 0 and 3 were exposed which triggered this error in prometheus client:
   prometheus/client_golang#242

Furthermore, cilium command has indirect dependency on metricsmap package. To prevent metrics map initialization
to happen each time cilium command is executed, metricsmap is converted to hive Cell and injected in cilium daemon
Infrastructure module. For details see this comment: cilium#27370 (comment).

Fixes: cilium#27058

Signed-off-by: Boris Petrovic <carnerito.b@gmail.com>
@carnerito
Copy link
Contributor Author

@aanm Done

@aanm
Copy link
Member

aanm commented Oct 31, 2023

/test

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carnerito Thank you for this PR and updates Boris!!

@joestringer
Copy link
Member

Required tests are passing. Seems to be mainly waiting on @cilium/sig-foundations review due to the cells.go change.

@joestringer joestringer removed their request for review November 1, 2023 23:04
@joestringer
Copy link
Member

Removing my review, I was only involved here as triager trying to keep PRs moving.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 2, 2023
@aanm aanm merged commit f8e9472 into cilium:main Nov 2, 2023
57 of 62 checks passed
@pchaigno
Copy link
Member

pchaigno commented Dec 5, 2023

Is it me or this pull request involuntary renamed a bunch of metrics? For example, what used to be called cilium_drop_count_total is now called cilium_datapath_drop_count_total. Isn't that a breaking change?

@carnerito
Copy link
Contributor Author

@pchaigno Yes, it's fixed in #29226

@pchaigno
Copy link
Member

pchaigno commented Dec 5, 2023

Ah, perfect! Thanks! The fix simply didn't reach my PR yet 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace Metricsmap-bpf-prom-sync with Prometheus Collector pattern
10 participants