Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/helpers: Fail test on errors #16395

Merged
merged 1 commit into from Jun 14, 2021
Merged

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Jun 2, 2021

This pull request adds level=error to the known-bad error messages, such that we will fail any test that has error log messages. This is to prevent error log messages from sneaking in.

I ran the full test suite ~15 times. The last 5 runs had zero errors due to this new check.

Updates: #13359.

@pchaigno pchaigno added area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-note/ci This PR makes changes to the CI. labels Jun 2, 2021
@pchaigno pchaigno force-pushed the ci-fail-on-errors branch 6 times, most recently from 92e52ac to bba2f99 Compare June 3, 2021 21:48
@pchaigno pchaigno force-pushed the ci-fail-on-errors branch 4 times, most recently from 3b55ff4 to fa70dfe Compare June 9, 2021 12:38
This commit adds level=error to the known-bad error messages, such that
we will fail any test that has error log messages. This is to prevent
error log messages from sneaking in.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno marked this pull request as ready for review June 10, 2021 08:41
@pchaigno pchaigno requested a review from a team as a code owner June 10, 2021 08:41
@pchaigno pchaigno requested a review from tklauser June 10, 2021 08:41
@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 11, 2021
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this feels like we're moving closer to properly addressing all potential such errors as part of the dev/testing effort rather than ignoring them. Just one note below.

Comment on lines +243 to +250
// ...and their exceptions.
lrpExists = "local-redirect service exists for frontend" // cf. https://github.com/cilium/cilium/issues/16400
opCannotBeFulfilled = "Operation cannot be fulfilled on leases.coordination.k8s.io" // cf. https://github.com/cilium/cilium/issues/16402
lockDeletedEp = "lock failed: endpoint is in the process of being removed" // cf. https://github.com/cilium/cilium/issues/16422
listenAndServeFailed = "ListenAndServe failed for service health server" // cf. https://github.com/cilium/cilium/pull/16477
globalDataSupport = "kernel doesn't support global data" // cf. https://github.com/cilium/cilium/issues/16418
removingInexistantID = "removing identity not added to the identity manager!" // cf. https://github.com/cilium/cilium/issues/16419
failedToListCRDs = "the server could not find the requested resource" // cf. https://github.com/cilium/cilium/issues/16425
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we will basically ignore these messages but they could still occur in real clusters and may represent real logical inconsistencies (possibly even bugs) in the agent code. What do you think is the best way to approach investigating & driving these down to the absolute minimum?

Should we at least put a big disclaimer as a log above this section saying "There should ideally be no items below this list. If you're considering adding more, please reconsider whether there's a real bug we should be addressing instead" or similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we will basically ignore these messages but they could still occur in real clusters and may represent real logical inconsistencies (possibly even bugs) in the agent code.

My first impression is that most of these error logs should just be downgraded to warnings or lower. The exception so far are globalDataSupport and maybe failedToListCRDs. That would explain why they went unnoticed for so long (there's often no other impact).

What do you think is the best way to approach investigating & driving these down to the absolute minimum?

Some of these issues should be easy to address (e.g., change log level) and are currently marked as good-first-issues. For some we may not be able to do anything about it (e.g., opCannotBeFulfilled). Given this is not high priority and we're all struggling to find cycles, it will probably take a bit of time before this is reduced to the absolute minimum.

Should we at least put a big disclaimer as a log above this section saying "There should ideally be no items below this list. If you're considering adding more, please reconsider whether there's a real bug we should be addressing instead" or similar?

I was a bit hesitant to do that just now because some of these errors occur in very specific, racy conditions. That means we still see more errors that we need to add here in the week following the merge, despite my ~15 runs of the full CI in this pull request. For that time, I'd probably prefer if we accept new items in the list. We can always add the big warning next time we remove an item.

@borkmann borkmann merged commit 11fb42a into cilium:master Jun 14, 2021
@pchaigno pchaigno deleted the ci-fail-on-errors branch June 14, 2021 13:33
nbusseneau pushed a commit to nbusseneau/cilium that referenced this pull request Dec 10, 2021
[ upstream commit 82d4422 ]

[ Backport notes: had to resolve conflicts manually due to cilium#16395
  previously introducing exceptions not having been backported to v1.10.
  The changes in this PR completely supersede cilium#16395 so there should be
  no need to backport it first. ]

Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. cilium#16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this pull request Dec 14, 2021
[ upstream commit 82d4422 ]

[ Backport notes: had to resolve conflicts manually due to #16395
  previously introducing exceptions not having been backported to v1.10.
  The changes in this PR completely supersede #16395 so there should be
  no need to backport it first. ]

Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. #16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI-improvement Topic or proposal to improve the Continuous Integration workflow ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants