Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand agent metric Policy Import Errors to count all policy changes #23349

Merged
merged 1 commit into from Feb 15, 2023

Conversation

dlapcevic
Copy link
Contributor

Change agent metric name from policy_import_errors_total to policy_change_total. Now it counts all policy changes (Add, Update, Delete) based on outcome ("success" or "fail"). The metric can be used to show the percentage of success/failed network policy changes.

Signed-off-by: Dorde Lapcevic <dordel@google.com>

@dlapcevic dlapcevic requested review from a team as code owners January 25, 2023 15:35
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 25, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 25, 2023
@sayboras sayboras added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jan 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 26, 2023
@sayboras sayboras added area/metrics Impacts statistics / metrics gathering, eg via Prometheus. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 26, 2023
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR is making sense to me.

I vaguely remember that we normally go with a 2-phase commit:

  • Add new metrics and mark existing metrics as deprecated
  • Remove deprecated metrics in the next release.

Another convention is to split the metric into 2 separate metrics (e.g. policy_change_success_total, policy_change_failure_total).

🤔

@dlapcevic
Copy link
Contributor Author

Hi @sayboras,
Do you suggest that in this PR I add the new metric and only mark as deprecated but not remove the old one?
And then in the next PR remove the deprecated metric?

I would prefer to make it a single metric instead of having two, for success and failure total. I think it’s easier to track and maintain this way. I was looking at how existing metrics of totals are implemented, such as EndpointRegenerationTotal (https://github.com/cilium/cilium/blob/master/pkg/metrics/metrics.go#L694).

@aanm aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jan 29, 2023
@sayboras
Copy link
Member

@chancez

FYI. I think you have shared the best practice on when to use 2 metrics (e.g. avoid using filter all time time), keen to hear your view here.

@dlapcevic
Copy link
Contributor Author

Rebase done.
I made changes to only add the new metric, and mark the old metric as deprecated in documentation and with comments.

Copy link
Contributor

@chancez chancez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think this is reasonable, but I feel like we could potentially include more information in the metric about failures in particular.

I see a few places where we could have a "reason" that a policy change failed, for example: parsing errors. Is this something we should consider adding as a label? How many failure causes do we have for policy updates?

Additionally, it seems we could potentially have the source of the change (add/update/delete). Would any of these dimensions be useful for us?

Lastly, I noticed you added a whole new location to where these metrics are updated, in pkg/k8s/watchers/cilium_network_policy.go which previously did not touch the import errors metric. Aren't we double counting by doing it here and inside the Handle method?

@dlapcevic
Copy link
Contributor Author

Hi @chancez, thanks for the review and the suggestions.

I think that your suggestions to add "reason" and "source" labels make sense. Both can be useful when debugging, and provide better insight when something goes wrong with applying network policies.
I would prefer if we make it simple in the first commit, and add more labels as follow-up.

I don't see where metric generation is duplicated. The Handle method in cilium/daemon/cmd/policy.go is for the rest API client, and isn't in the NP and CNP watchers code path. All 3 (Handle, NP watcher, CNP watcher) rely on PolicyAdd(), which doesn't generate this metric.
If I missed something, please point it out.

@chancez
Copy link
Contributor

chancez commented Jan 31, 2023

I don't see where metric generation is duplicated. The Handle method in cilium/daemon/cmd/policy.go is for the rest API client, and isn't in the NP and CNP watchers code path. All 3 (Handle, NP watcher, CNP watcher) rely on PolicyAdd(), which doesn't generate this metric.

I'm not as familiar with this area of the project, so I was mostly just making sure, since you added new lines of code in new areas that previously didn't have metrics. What consumes the rest API client where that was added then? In general that area previously wasn't touching these metrics, so I'm trying to figure out how that impacts when/where metrics are updated.

@dlapcevic
Copy link
Contributor Author

Ok, this should explain.
I'm just moving the metric generation out to watcher handler funcs. They existed in methods addCiliumNetworkPolicyV2() and updateCiliumNetworkPolicyV2() for old metric, so I didn't add metric generation code there for the new metric, because they were actually calling each other, and it was messy -- one line of code for generating metrics for each error. But it already returns an error and can be handled in the caller func.

@chancez
Copy link
Contributor

chancez commented Jan 31, 2023

Okay yeah I see that now. Thanks for the explanation. Overall I don't think we need to block on adding any other labels for failure case, but i think it's worth considering. LGTM overall though.

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help to rebase ? The diff looks good to me, only one minor comment on using new metrics in dashboard and monitor.

Documentation/operations/upgrade.rst Show resolved Hide resolved
@dlapcevic
Copy link
Contributor Author

I updated the code and resolved the comments. Please rerun the tests.

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, thanks

@qmonnet
Copy link
Member

qmonnet commented Feb 1, 2023

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.56.11:80 from testclient-host-lcwfz

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

@dlapcevic
Copy link
Contributor Author

The failed tests are not related to the change, and most of them seem to be particularly flaking.

@qmonnet, can you please help me move it forward?

Change agent metric name from policy_import_errors_total to policy_change_total. Now it counts all policy changes (Add, Update, Delete) based on outcome ("success" or "fail"). The metric can be used to show the percentage of success/failed network policy changes.

Signed-off-by: Dorde Lapcevic <dordel@google.com>
@qmonnet
Copy link
Member

qmonnet commented Feb 6, 2023

/test

@dlapcevic
Copy link
Contributor Author

The IntegrationTests fails only for one case that is a recent flake and is being fixed here 84e9641

@qmonnet
Copy link
Member

qmonnet commented Feb 6, 2023

Yes, and the other failures are due to issues with VM provisioning. I'll re-trigger them.

@qmonnet
Copy link
Member

qmonnet commented Feb 6, 2023

/test-1.16-4.9

@qmonnet
Copy link
Member

qmonnet commented Feb 6, 2023

/test-1.25-4.19

@qmonnet
Copy link
Member

qmonnet commented Feb 6, 2023

/test-1.26-net-next

@dlapcevic
Copy link
Contributor Author

Hi @qmonnet, can we merge this?

@qmonnet
Copy link
Member

qmonnet commented Feb 10, 2023

/test-1.16-4.9

@qmonnet
Copy link
Member

qmonnet commented Feb 10, 2023

/test-1.25-4.19

@qmonnet
Copy link
Member

qmonnet commented Feb 10, 2023

/test-1.26-net-next

@qmonnet
Copy link
Member

qmonnet commented Feb 10, 2023

Hi @qmonnet, can we merge this?

Apparently not, looks like the CI was still not behaving last time I triggered. Let's give it another try. We're also missing some reviews, cc @aditighag @nathanjsweet

@sayboras
Copy link
Member

CI seems to be good now, ci-verifier workflow requires rebase, but I don't think it's related to this change, so we can skip re-run unless there are other changes as part of review.

@dlapcevic
Copy link
Contributor Author

Hi @aditighag @nathanjsweet,
Could you please take this review?

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine to me. I was wondering how do we add labels for the failure cases specifically in future?

// Deprecated in Cilium 1.14, to be removed in 1.15.

Can you file an issue for this task?

@dlapcevic
Copy link
Contributor Author

We would add more labels (e.g. “reason”) to the metric initialization, and then also add the reason to each WithLabelValues() call, in the same order that the labels are defined.
Example: APILimiterProcessingDuration - Init code, WithLabelValues code

Created an issue for removing the deprecated metric in Cilium 1.15. #23747

@dlapcevic
Copy link
Contributor Author

@sayboras can we merge this now?

@sayboras
Copy link
Member

Reviews from required teams are done, CI jobs are all passed, marking this ready to merge.

@sayboras sayboras added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 15, 2023
@nbusseneau nbusseneau merged commit 7ab3303 into cilium:master Feb 15, 2023
@nbusseneau
Copy link
Member

@dlapcevic Thanks for your contribution!

@qmonnet
Copy link
Member

qmonnet commented Feb 15, 2023

I would like to add, thanks for bearing with your reviewers, and for the pings to drive this to completion :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants