Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Fix pod cleanup after various tests #18448

Merged
merged 10 commits into from
Jan 27, 2022

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jan 11, 2022

See individual commits for more details.

According to the Cilium Ginkgo e2e test docs, All AfterEach statements are executed before all AfterAll statements.

Previously, some of this code was assuming that AfterAll inside a context would run first (to delete pods) and then the AfterEach would run outside the context (to wait for the pods to terminate), then AfterAll outside the context to finally clean up the Cilium pods. The result was that in issue #18447, the next test to run could hit a race condition where pods were deleted but did not fully terminate before Cilium was removed. Cilium ends up getting deleted before all the pods, which means that there is no longer any way to execute the CNI DEL operation for those pods, so they get stuck in Terminating state.

Other parts of the code seemed to just omit the check for whether the pods were correctly terminated or not.

Fix these isseus by moving checks that the test pods are fully terminated into the statements where the pods are deleted.

Fixes: 412e299 ("test: Move LRP tests to a separate suite")
Fixes: #18447
Fixes: #18566

Note, I initially marked this for backport to v1.10 based on the first LRP test refactor commit, but left it like this since it seems like it should make the testsuite a bit more robust in general. I don't know if each of the changes will successfully backport to v1.10 or not. If not, we can maybe just drop those changes from the commit. Or we could just decide to avoid backporting the PR to v1.10 altogether.

@joestringer joestringer requested a review from a team as a code owner January 11, 2022 23:04
@joestringer joestringer added needs-backport/1.10 release-note/ci This PR makes changes to the CI. labels Jan 11, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 11, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 11, 2022
@joestringer joestringer requested a review from brb January 11, 2022 23:05
@joestringer
Copy link
Member Author

joestringer commented Jan 11, 2022

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: terminating containers are not deleted after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

@joestringer
Copy link
Member Author

joestringer commented Jan 12, 2022

😂 of course this PR hits the exact same problem, but with leftover state from a different test:

https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.19/251/testReport/junit/Suite-k8s-1/22/K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master/

16:20:27  K8sDatapathConfig
16:20:27  /home/jenkins/workspace/Cilium-PR-K8s-1.22-kernel-4.19/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:473
16:20:27    Iptables
16:20:27    /home/jenkins/workspace/Cilium-PR-K8s-1.22-kernel-4.19/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:473
16:20:27      Skip conntrack for pod traffic
16:20:27      /home/jenkins/workspace/Cilium-PR-K8s-1.22-kernel-4.19/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:527

Am I approaching this the wrong way? Should we be making sure all of the tests clean up after themselves properly or should we just make sure that the first test in each test group gets the environment back into a good state?

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted!

@nbusseneau
Copy link
Member

Am I approaching this the wrong way? Should we be making sure all of the tests clean up after themselves properly or should we just make sure that the first test in each test group gets the environment back into a good state?

(Thanks GH for not refreshing the PR comments when I'm reviewing)

To be honest I kinda assumed that every test was already supposed to clean up after itself, unless a test suite is specifically designed with environment being kept around for several tests in a row (in which case the suite should clean up). So I was convinced your approach would be right...

@joestringer
Copy link
Member Author

joestringer commented Jan 12, 2022

To be honest I kinda assumed that every test was already supposed to clean up after itself, unless a test suite is specifically designed with environment being kept around for several tests in a row (in which case the suite should clean up). So I was convinced your approach would be right...

I think this is basically a decision we have to make :-) Note that this approach can be right, but the same class of bug is present elsewhere in the testsuite. The failure that happened here is not the exact same test or conditions as the one I reported in #18447, since the previous test in the failure on this PR was in K8sDatapathConfig, not K8sLRPTests.

The likelihood of this failure is based on (1) removing and redeploying Cilium, (2) race conditions between deleting app pods and starting the next test, and (3) how many different test files there are[1]

Maybe we just need to apply the same fix to K8sDatapathConfig and then we're done. I guess I'll just look over the other test files and look for a similar pattern.

[[1]] If a bug like this is caused by incorrect ordering between AfterEach() at the top level of the file and AfterAll() inside a context, then the bug only triggers wherever the AfterEach() happens at the wrong time, so based on the specific files and the probability of running a particular file after a test that doesn't clean up after itself correctly.

@joestringer
Copy link
Member Author

I'm hoping that commit d4739ad will fix the above issue.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed d4739ad. The change looks correct to me.

Thanks! 🙏

While at it, we should remove this duplicate code that deploys the yaml - https://github.com/cilium/cilium/blob/master/test/k8sT/Services.go#L1639-L1641. I can fix this in a separate PR if you don't want to do it in this PR.

The next commit will then shift the cleanup logic
out of a Describe level AfterEach and into the individual Contexts.

Can you confirm the commit? Might as well double check that too.

@joestringer
Copy link
Member Author

joestringer commented Jan 27, 2022

While at it, we should remove this duplicate code that deploys the yaml - https://github.com/cilium/cilium/blob/master/test/k8sT/Services.go#L1639-L1641. I can fix this in a separate PR if you don't want to do it in this PR.

Please do, I'm hoping that this PR is now in a good state to merge & stabilize the tree.

Can you confirm the commit? Might as well double check that too.

The commit is 30a6d90 . I can't confirm it in the commit message unfortunately because github will rewrite the commit shas when we merge the PR. Hence I just rearranged the commits to be in order so that this statement would be correct for the next commit. (The only reason for these gymnastics is to prevent introducing the failure into the tree then removing it again, so I moved the commit earlier in the ordering in this branch).

@joestringer
Copy link
Member Author

CodeQL analysis seems like it's maybe a new checker that has been introduced which is triggering false positives on regexes in code that this PR doesn't touch. Ignoring.

@joestringer
Copy link
Member Author

/test

@joestringer
Copy link
Member Author

Travis hit pull ratelimit issue, ignoring (prior versions passed successfully and this PR is focused on ginkgo test code which isn't tested by Travis). Everything else passed, the PR has been reviewed, and this fixes a common CI flake on master. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. release-note/ci This PR makes changes to the CI.
Projects
None yet
8 participants