New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand agent metric Policy Import Errors to count all policy changes #23349
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this PR is making sense to me.
I vaguely remember that we normally go with a 2-phase commit:
- Add new metrics and mark existing metrics as deprecated
- Remove deprecated metrics in the next release.
Another convention is to split the metric into 2 separate metrics (e.g. policy_change_success_total, policy_change_failure_total).
🤔
Hi @sayboras, I would prefer to make it a single metric instead of having two, for success and failure total. I think it’s easier to track and maintain this way. I was looking at how existing metrics of totals are implemented, such as |
FYI. I think you have shared the best practice on when to use 2 metrics (e.g. avoid using filter all time time), keen to hear your view here. |
d0d8f75
to
c279699
Compare
Rebase done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think this is reasonable, but I feel like we could potentially include more information in the metric about failures in particular.
I see a few places where we could have a "reason" that a policy change failed, for example: parsing errors. Is this something we should consider adding as a label? How many failure causes do we have for policy updates?
Additionally, it seems we could potentially have the source of the change (add/update/delete). Would any of these dimensions be useful for us?
Lastly, I noticed you added a whole new location to where these metrics are updated, in pkg/k8s/watchers/cilium_network_policy.go
which previously did not touch the import errors metric. Aren't we double counting by doing it here and inside the Handle
method?
Hi @chancez, thanks for the review and the suggestions. I think that your suggestions to add "reason" and "source" labels make sense. Both can be useful when debugging, and provide better insight when something goes wrong with applying network policies. I don't see where metric generation is duplicated. The |
I'm not as familiar with this area of the project, so I was mostly just making sure, since you added new lines of code in new areas that previously didn't have metrics. What consumes the rest API client where that was added then? In general that area previously wasn't touching these metrics, so I'm trying to figure out how that impacts when/where metrics are updated. |
Ok, this should explain. |
Okay yeah I see that now. Thanks for the explanation. Overall I don't think we need to block on adding any other labels for failure case, but i think it's worth considering. LGTM overall though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help to rebase ? The diff looks good to me, only one minor comment on using new metrics in dashboard and monitor.
ca51f33
to
7b979f0
Compare
I updated the code and resolved the comments. Please rerun the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me, thanks
/test Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
The failed tests are not related to the change, and most of them seem to be particularly flaking. @qmonnet, can you please help me move it forward? |
Change agent metric name from policy_import_errors_total to policy_change_total. Now it counts all policy changes (Add, Update, Delete) based on outcome ("success" or "fail"). The metric can be used to show the percentage of success/failed network policy changes. Signed-off-by: Dorde Lapcevic <dordel@google.com>
7b979f0
to
6a079ca
Compare
/test |
The IntegrationTests fails only for one case that is a recent flake and is being fixed here 84e9641 |
Yes, and the other failures are due to issues with VM provisioning. I'll re-trigger them. |
/test-1.16-4.9 |
/test-1.25-4.19 |
/test-1.26-net-next |
Hi @qmonnet, can we merge this? |
/test-1.16-4.9 |
/test-1.25-4.19 |
/test-1.26-net-next |
Apparently not, looks like the CI was still not behaving last time I triggered. Let's give it another try. We're also missing some reviews, cc @aditighag @nathanjsweet |
CI seems to be good now, ci-verifier workflow requires rebase, but I don't think it's related to this change, so we can skip re-run unless there are other changes as part of review. |
Hi @aditighag @nathanjsweet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look fine to me. I was wondering how do we add labels for the failure cases specifically in future?
// Deprecated in Cilium 1.14, to be removed in 1.15.
Can you file an issue for this task?
We would add more labels (e.g. “reason”) to the metric initialization, and then also add the reason to each Created an issue for removing the deprecated metric in Cilium 1.15. #23747 |
@sayboras can we merge this now? |
Reviews from required teams are done, CI jobs are all passed, marking this ready to merge. |
@dlapcevic Thanks for your contribution! |
I would like to add, thanks for bearing with your reviewers, and for the pings to drive this to completion :) |
Change agent metric name from
policy_import_errors_total
topolicy_change_total
. Now it counts all policy changes (Add, Update, Delete) based on outcome ("success" or "fail"). The metric can be used to show the percentage of success/failed network policy changes.Signed-off-by: Dorde Lapcevic <dordel@google.com>