New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/helpers: Fail test on errors #16395
Conversation
92e52ac
to
bba2f99
Compare
3b55ff4
to
fa70dfe
Compare
This commit adds level=error to the known-bad error messages, such that we will fail any test that has error log messages. This is to prevent error log messages from sneaking in. Signed-off-by: Paul Chaignon <paul@cilium.io>
fa70dfe
to
19d1b0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, this feels like we're moving closer to properly addressing all potential such errors as part of the dev/testing effort rather than ignoring them. Just one note below.
// ...and their exceptions. | ||
lrpExists = "local-redirect service exists for frontend" // cf. https://github.com/cilium/cilium/issues/16400 | ||
opCannotBeFulfilled = "Operation cannot be fulfilled on leases.coordination.k8s.io" // cf. https://github.com/cilium/cilium/issues/16402 | ||
lockDeletedEp = "lock failed: endpoint is in the process of being removed" // cf. https://github.com/cilium/cilium/issues/16422 | ||
listenAndServeFailed = "ListenAndServe failed for service health server" // cf. https://github.com/cilium/cilium/pull/16477 | ||
globalDataSupport = "kernel doesn't support global data" // cf. https://github.com/cilium/cilium/issues/16418 | ||
removingInexistantID = "removing identity not added to the identity manager!" // cf. https://github.com/cilium/cilium/issues/16419 | ||
failedToListCRDs = "the server could not find the requested resource" // cf. https://github.com/cilium/cilium/issues/16425 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, we will basically ignore these messages but they could still occur in real clusters and may represent real logical inconsistencies (possibly even bugs) in the agent code. What do you think is the best way to approach investigating & driving these down to the absolute minimum?
Should we at least put a big disclaimer as a log above this section saying "There should ideally be no items below this list. If you're considering adding more, please reconsider whether there's a real bug we should be addressing instead" or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, we will basically ignore these messages but they could still occur in real clusters and may represent real logical inconsistencies (possibly even bugs) in the agent code.
My first impression is that most of these error logs should just be downgraded to warnings or lower. The exception so far are globalDataSupport
and maybe failedToListCRDs
. That would explain why they went unnoticed for so long (there's often no other impact).
What do you think is the best way to approach investigating & driving these down to the absolute minimum?
Some of these issues should be easy to address (e.g., change log level) and are currently marked as good-first-issues
. For some we may not be able to do anything about it (e.g., opCannotBeFulfilled
). Given this is not high priority and we're all struggling to find cycles, it will probably take a bit of time before this is reduced to the absolute minimum.
Should we at least put a big disclaimer as a log above this section saying "There should ideally be no items below this list. If you're considering adding more, please reconsider whether there's a real bug we should be addressing instead" or similar?
I was a bit hesitant to do that just now because some of these errors occur in very specific, racy conditions. That means we still see more errors that we need to add here in the week following the merge, despite my ~15 runs of the full CI in this pull request. For that time, I'd probably prefer if we accept new items in the list. We can always add the big warning next time we remove an item.
[ upstream commit 82d4422 ] [ Backport notes: had to resolve conflicts manually due to cilium#16395 previously introducing exceptions not having been backported to v1.10. The changes in this PR completely supersede cilium#16395 so there should be no need to backport it first. ] Occasionally the cilium-operator will run into a transient issue where it cannot get/update/release the leaselock with K8s that it uses to adjudicate its leader election. This error message is part and parcel of this failure and can be ignored. cf. cilium#16402 Signed-off-by: Nate Sweet <nathanjsweet@pm.me> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
[ upstream commit 82d4422 ] [ Backport notes: had to resolve conflicts manually due to #16395 previously introducing exceptions not having been backported to v1.10. The changes in this PR completely supersede #16395 so there should be no need to backport it first. ] Occasionally the cilium-operator will run into a transient issue where it cannot get/update/release the leaselock with K8s that it uses to adjudicate its leader election. This error message is part and parcel of this failure and can be ignored. cf. #16402 Signed-off-by: Nate Sweet <nathanjsweet@pm.me> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
This pull request adds
level=error
to the known-bad error messages, such that we will fail any test that has error log messages. This is to prevent error log messages from sneaking in.I ran the full test suite ~15 times. The last 5 runs had zero errors due to this new check.
Updates: #13359.