Expand agent metric Policy Import Errors to count all policy changes #23349

dlapcevic · 2023-01-25T15:35:35Z

Change agent metric name from policy_import_errors_total to policy_change_total. Now it counts all policy changes (Add, Update, Delete) based on outcome ("success" or "fail"). The metric can be used to show the percentage of success/failed network policy changes.

Signed-off-by: Dorde Lapcevic <dordel@google.com>

sayboras

The goal of this PR is making sense to me.

I vaguely remember that we normally go with a 2-phase commit:

Add new metrics and mark existing metrics as deprecated
Remove deprecated metrics in the next release.

Another convention is to split the metric into 2 separate metrics (e.g. policy_change_success_total, policy_change_failure_total).

🤔

dlapcevic · 2023-01-26T13:29:36Z

Hi @sayboras,
Do you suggest that in this PR I add the new metric and only mark as deprecated but not remove the old one?
And then in the next PR remove the deprecated metric?

I would prefer to make it a single metric instead of having two, for success and failure total. I think it’s easier to track and maintain this way. I was looking at how existing metrics of totals are implemented, such as EndpointRegenerationTotal (https://github.com/cilium/cilium/blob/master/pkg/metrics/metrics.go#L694).

sayboras · 2023-01-30T00:45:15Z

@chancez

FYI. I think you have shared the best practice on when to use 2 metrics (e.g. avoid using filter all time time), keen to hear your view here.

dlapcevic · 2023-01-30T15:06:05Z

Rebase done.
I made changes to only add the new metric, and mark the old metric as deprecated in documentation and with comments.

chancez

Generally I think this is reasonable, but I feel like we could potentially include more information in the metric about failures in particular.

I see a few places where we could have a "reason" that a policy change failed, for example: parsing errors. Is this something we should consider adding as a label? How many failure causes do we have for policy updates?

Additionally, it seems we could potentially have the source of the change (add/update/delete). Would any of these dimensions be useful for us?

Lastly, I noticed you added a whole new location to where these metrics are updated, in pkg/k8s/watchers/cilium_network_policy.go which previously did not touch the import errors metric. Aren't we double counting by doing it here and inside the Handle method?

dlapcevic · 2023-01-31T12:37:38Z

Hi @chancez, thanks for the review and the suggestions.

I think that your suggestions to add "reason" and "source" labels make sense. Both can be useful when debugging, and provide better insight when something goes wrong with applying network policies.
I would prefer if we make it simple in the first commit, and add more labels as follow-up.

I don't see where metric generation is duplicated. The Handle method in cilium/daemon/cmd/policy.go is for the rest API client, and isn't in the NP and CNP watchers code path. All 3 (Handle, NP watcher, CNP watcher) rely on PolicyAdd(), which doesn't generate this metric.
If I missed something, please point it out.

chancez · 2023-01-31T18:05:04Z

I don't see where metric generation is duplicated. The Handle method in cilium/daemon/cmd/policy.go is for the rest API client, and isn't in the NP and CNP watchers code path. All 3 (Handle, NP watcher, CNP watcher) rely on PolicyAdd(), which doesn't generate this metric.

I'm not as familiar with this area of the project, so I was mostly just making sure, since you added new lines of code in new areas that previously didn't have metrics. What consumes the rest API client where that was added then? In general that area previously wasn't touching these metrics, so I'm trying to figure out how that impacts when/where metrics are updated.

dlapcevic · 2023-01-31T18:11:36Z

Ok, this should explain.
I'm just moving the metric generation out to watcher handler funcs. They existed in methods addCiliumNetworkPolicyV2() and updateCiliumNetworkPolicyV2() for old metric, so I didn't add metric generation code there for the new metric, because they were actually calling each other, and it was messy -- one line of code for generating metrics for each error. But it already returns an error and can be handled in the caller func.

chancez · 2023-01-31T18:20:35Z

Okay yeah I see that now. Thanks for the explanation. Overall I don't think we need to block on adding any other labels for failure case, but i think it's worth considering. LGTM overall though.

sayboras

Can you help to rebase ? The diff looks good to me, only one minor comment on using new metrics in dashboard and monitor.

Documentation/operations/upgrade.rst

dlapcevic · 2023-02-01T15:01:12Z

I updated the code and resolved the comments. Please rerun the tests.

qmonnet

Looks OK to me, thanks

qmonnet · 2023-02-01T15:35:49Z

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.56.11:80 from testclient-host-lcwfz

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

dlapcevic · 2023-02-03T08:43:08Z

The failed tests are not related to the change, and most of them seem to be particularly flaking.

@qmonnet, can you please help me move it forward?

Change agent metric name from policy_import_errors_total to policy_change_total. Now it counts all policy changes (Add, Update, Delete) based on outcome ("success" or "fail"). The metric can be used to show the percentage of success/failed network policy changes. Signed-off-by: Dorde Lapcevic <dordel@google.com>

qmonnet · 2023-02-06T09:34:27Z

/test

dlapcevic · 2023-02-06T10:27:41Z

The IntegrationTests fails only for one case that is a recent flake and is being fixed here 84e9641

qmonnet · 2023-02-06T14:29:54Z

Yes, and the other failures are due to issues with VM provisioning. I'll re-trigger them.

qmonnet · 2023-02-06T14:30:04Z

/test-1.16-4.9

qmonnet · 2023-02-06T14:30:19Z

/test-1.25-4.19

qmonnet · 2023-02-06T14:30:32Z

/test-1.26-net-next

dlapcevic · 2023-02-10T12:40:12Z

Hi @qmonnet, can we merge this?

qmonnet · 2023-02-10T13:11:57Z

/test-1.16-4.9

qmonnet · 2023-02-10T13:12:19Z

/test-1.25-4.19

qmonnet · 2023-02-10T13:12:37Z

/test-1.26-net-next

qmonnet · 2023-02-10T13:14:57Z

Hi @qmonnet, can we merge this?

Apparently not, looks like the CI was still not behaving last time I triggered. Let's give it another try. We're also missing some reviews, cc @aditighag @nathanjsweet

sayboras · 2023-02-13T04:20:02Z

CI seems to be good now, ci-verifier workflow requires rebase, but I don't think it's related to this change, so we can skip re-run unless there are other changes as part of review.

dlapcevic · 2023-02-14T12:48:56Z

Hi @aditighag @nathanjsweet,
Could you please take this review?

aditighag

Changes look fine to me. I was wondering how do we add labels for the failure cases specifically in future?

// Deprecated in Cilium 1.14, to be removed in 1.15.

Can you file an issue for this task?

dlapcevic · 2023-02-14T17:16:28Z

We would add more labels (e.g. “reason”) to the metric initialization, and then also add the reason to each WithLabelValues() call, in the same order that the labels are defined.
Example: APILimiterProcessingDuration - Init code, WithLabelValues code

Created an issue for removing the deprecated metric in Cilium 1.15. #23747

dlapcevic · 2023-02-15T09:53:12Z

@sayboras can we merge this now?

sayboras · 2023-02-15T10:13:59Z

Reviews from required teams are done, CI jobs are all passed, marking this ready to merge.

nbusseneau · 2023-02-15T13:47:25Z

@dlapcevic Thanks for your contribution!

qmonnet · 2023-02-15T14:06:04Z

I would like to add, thanks for bearing with your reviewers, and for the pings to drive this to completion :)

dlapcevic requested review from a team as code owners January 25, 2023 15:35

dlapcevic requested review from nathanjsweet and aditighag January 25, 2023 15:35

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 25, 2023

dlapcevic requested review from sayboras and qmonnet January 25, 2023 15:35

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 25, 2023

sayboras added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jan 26, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 26, 2023

sayboras added area/metrics Impacts statistics / metrics gathering, eg via Prometheus. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 26, 2023

maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 26, 2023

sayboras reviewed Jan 26, 2023

View reviewed changes

aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jan 29, 2023

aanm mentioned this pull request Jan 30, 2023

Modify operator metric CES errors sync to count all CES sync events #23335

Merged

dlapcevic force-pushed the agent-metrics branch from d0d8f75 to c279699 Compare January 30, 2023 15:02

chancez reviewed Jan 30, 2023

View reviewed changes

sayboras approved these changes Jan 31, 2023

View reviewed changes

Documentation/operations/upgrade.rst Show resolved Hide resolved

dlapcevic force-pushed the agent-metrics branch from ca51f33 to 7b979f0 Compare February 1, 2023 13:46

qmonnet approved these changes Feb 1, 2023

View reviewed changes

dlapcevic force-pushed the agent-metrics branch from 7b979f0 to 6a079ca Compare February 3, 2023 11:04

aditighag reviewed Feb 14, 2023

View reviewed changes

dlapcevic mentioned this pull request Feb 14, 2023

Remove PolicyImportErrorsTotal agent metric in Cilium 1.15 #23747

Open

aditighag approved these changes Feb 14, 2023

View reviewed changes

sayboras added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 15, 2023

nbusseneau merged commit 7ab3303 into cilium:master Feb 15, 2023

Expand agent metric Policy Import Errors to count all policy changes #23349

Expand agent metric Policy Import Errors to count all policy changes #23349

Conversation

dlapcevic commented Jan 25, 2023

sayboras left a comment • edited

Choose a reason for hiding this comment

dlapcevic commented Jan 26, 2023

sayboras commented Jan 30, 2023

dlapcevic commented Jan 30, 2023

chancez left a comment

Choose a reason for hiding this comment

dlapcevic commented Jan 31, 2023

chancez commented Jan 31, 2023

dlapcevic commented Jan 31, 2023

chancez commented Jan 31, 2023

sayboras left a comment

Choose a reason for hiding this comment

dlapcevic commented Feb 1, 2023

qmonnet left a comment

Choose a reason for hiding this comment

qmonnet commented Feb 1, 2023 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

dlapcevic commented Feb 3, 2023

qmonnet commented Feb 6, 2023

dlapcevic commented Feb 6, 2023

qmonnet commented Feb 6, 2023

qmonnet commented Feb 6, 2023

qmonnet commented Feb 6, 2023

qmonnet commented Feb 6, 2023

dlapcevic commented Feb 10, 2023

qmonnet commented Feb 10, 2023

qmonnet commented Feb 10, 2023

qmonnet commented Feb 10, 2023

qmonnet commented Feb 10, 2023

sayboras commented Feb 13, 2023

dlapcevic commented Feb 14, 2023

aditighag left a comment

Choose a reason for hiding this comment

dlapcevic commented Feb 14, 2023

dlapcevic commented Feb 15, 2023

sayboras commented Feb 15, 2023

nbusseneau commented Feb 15, 2023

qmonnet commented Feb 15, 2023

sayboras left a comment •

edited

qmonnet commented Feb 1, 2023 •

edited by maintainer-s-little-helper bot