test: Fix pod cleanup after various tests #18448

joestringer · 2022-01-11T23:04:40Z

See individual commits for more details.

According to the Cilium Ginkgo e2e test docs, All AfterEach statements are executed before all AfterAll statements.

Previously, some of this code was assuming that AfterAll inside a context would run first (to delete pods) and then the AfterEach would run outside the context (to wait for the pods to terminate), then AfterAll outside the context to finally clean up the Cilium pods. The result was that in issue #18447, the next test to run could hit a race condition where pods were deleted but did not fully terminate before Cilium was removed. Cilium ends up getting deleted before all the pods, which means that there is no longer any way to execute the CNI DEL operation for those pods, so they get stuck in Terminating state.

Other parts of the code seemed to just omit the check for whether the pods were correctly terminated or not.

Fix these isseus by moving checks that the test pods are fully terminated into the statements where the pods are deleted.

Fixes: 412e299 ("test: Move LRP tests to a separate suite")
Fixes: #18447
Fixes: #18566

Note, I initially marked this for backport to v1.10 based on the first LRP test refactor commit, but left it like this since it seems like it should make the testsuite a bit more robust in general. I don't know if each of the changes will successfully backport to v1.10 or not. If not, we can maybe just drop those changes from the commit. Or we could just decide to avoid backporting the PR to v1.10 altogether.

joestringer · 2022-01-11T23:18:59Z

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.19' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: terminating containers are not deleted after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.19 so I can create a new GitHub issue to track it.

joestringer · 2022-01-12T01:15:15Z

😂 of course this PR hits the exact same problem, but with leftover state from a different test:

https://jenkins.cilium.io/job/Cilium-PR-K8s-1.22-kernel-4.19/251/testReport/junit/Suite-k8s-1/22/K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master/

16:20:27  K8sDatapathConfig
16:20:27  /home/jenkins/workspace/Cilium-PR-K8s-1.22-kernel-4.19/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:473
16:20:27    Iptables
16:20:27    /home/jenkins/workspace/Cilium-PR-K8s-1.22-kernel-4.19/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:473
16:20:27      Skip conntrack for pod traffic
16:20:27      /home/jenkins/workspace/Cilium-PR-K8s-1.22-kernel-4.19/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:527

Am I approaching this the wrong way? Should we be making sure all of the tests clean up after themselves properly or should we just make sure that the first test in each test group gets the environment back into a good state?

nbusseneau

Well spotted!

nbusseneau · 2022-01-12T14:44:34Z

Am I approaching this the wrong way? Should we be making sure all of the tests clean up after themselves properly or should we just make sure that the first test in each test group gets the environment back into a good state?

(Thanks GH for not refreshing the PR comments when I'm reviewing)

To be honest I kinda assumed that every test was already supposed to clean up after itself, unless a test suite is specifically designed with environment being kept around for several tests in a row (in which case the suite should clean up). So I was convinced your approach would be right...

joestringer · 2022-01-12T19:18:15Z

To be honest I kinda assumed that every test was already supposed to clean up after itself, unless a test suite is specifically designed with environment being kept around for several tests in a row (in which case the suite should clean up). So I was convinced your approach would be right...

I think this is basically a decision we have to make :-) Note that this approach can be right, but the same class of bug is present elsewhere in the testsuite. The failure that happened here is not the exact same test or conditions as the one I reported in #18447, since the previous test in the failure on this PR was in K8sDatapathConfig, not K8sLRPTests.

The likelihood of this failure is based on (1) removing and redeploying Cilium, (2) race conditions between deleting app pods and starting the next test, and (3) how many different test files there are[1]

Maybe we just need to apply the same fix to K8sDatapathConfig and then we're done. I guess I'll just look over the other test files and look for a similar pattern.

[[1]] If a bug like this is caused by incorrect ordering between AfterEach() at the top level of the file and AfterAll() inside a context, then the bug only triggers wherever the AfterEach() happens at the wrong time, so based on the specific files and the probability of running a particular file after a test that doesn't clean up after itself correctly.

joestringer · 2022-01-27T15:44:25Z

I'm hoping that commit d4739ad will fix the above issue.

aditighag

Reviewed d4739ad. The change looks correct to me.

Thanks! 🙏

While at it, we should remove this duplicate code that deploys the yaml - https://github.com/cilium/cilium/blob/master/test/k8sT/Services.go#L1639-L1641. I can fix this in a separate PR if you don't want to do it in this PR.

The next commit will then shift the cleanup logic
out of a Describe level AfterEach and into the individual Contexts.

Can you confirm the commit? Might as well double check that too.

joestringer · 2022-01-27T16:08:32Z

While at it, we should remove this duplicate code that deploys the yaml - https://github.com/cilium/cilium/blob/master/test/k8sT/Services.go#L1639-L1641. I can fix this in a separate PR if you don't want to do it in this PR.

Please do, I'm hoping that this PR is now in a good state to merge & stabilize the tree.

Can you confirm the commit? Might as well double check that too.

The commit is 30a6d90 . I can't confirm it in the commit message unfortunately because github will rewrite the commit shas when we merge the PR. Hence I just rearranged the commits to be in order so that this statement would be correct for the next commit. (The only reason for these gymnastics is to prevent introducing the failure into the tree then removing it again, so I moved the commit earlier in the ordering in this branch).

joestringer · 2022-01-27T16:09:03Z

CodeQL analysis seems like it's maybe a new checker that has been introduced which is triggering false positives on regexes in code that this PR doesn't touch. Ignoring.

joestringer · 2022-01-27T16:09:16Z

/test

joestringer · 2022-01-27T18:48:57Z

Travis hit pull ratelimit issue, ignoring (prior versions passed successfully and this PR is focused on ginkgo test code which isn't tested by Travis). Everything else passed, the PR has been reviewed, and this fixes a common CI flake on master. Merging.

joestringer requested a review from a team as a code owner January 11, 2022 23:04

joestringer requested a review from nbusseneau January 11, 2022 23:04

joestringer added needs-backport/1.10 release-note/ci This PR makes changes to the CI. labels Jan 11, 2022

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 11, 2022

maintainer-s-little-helper bot assigned nbusseneau Jan 11, 2022

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 11, 2022

joestringer requested a review from brb January 11, 2022 23:05

maintainer-s-little-helper bot assigned brb Jan 11, 2022

nbusseneau approved these changes Jan 12, 2022

View reviewed changes

maintainer-s-little-helper bot unassigned nbusseneau Jan 12, 2022

joestringer force-pushed the submit/fix-lrp-test-cleanup branch from d595a65 to f32d94c Compare January 13, 2022 00:09

joestringer requested review from a team January 13, 2022 00:09

joestringer requested a review from a team as a code owner January 13, 2022 00:09

joestringer requested review from a team and gandro January 13, 2022 00:09

maintainer-s-little-helper bot assigned gandro Jan 13, 2022

joestringer requested review from aditighag and jrajahalme January 13, 2022 00:09

maintainer-s-little-helper bot assigned aditighag Jan 13, 2022

joestringer force-pushed the submit/fix-lrp-test-cleanup branch from 1b12a56 to cbd789d Compare January 27, 2022 15:42

aditighag self-requested a review January 27, 2022 15:47

maintainer-s-little-helper bot assigned aditighag Jan 27, 2022

aditighag reviewed Jan 27, 2022

View reviewed changes

joestringer merged commit 8302f0e into cilium:master Jan 27, 2022

joestringer deleted the submit/fix-lrp-test-cleanup branch January 27, 2022 18:49

glibsm mentioned this pull request Jan 31, 2022

v1.10 backports 2022-01-30 #18668

Merged

glibsm added backport-pending/1.10 and removed needs-backport/1.10 labels Jan 31, 2022

glibsm mentioned this pull request Jan 31, 2022

v1.11 backports 2022-01-31 #18669

Merged

glibsm added backport-pending/1.11 and removed needs-backport/1.11 labels Jan 31, 2022

joamaki added backport-done/1.10 and removed backport-pending/1.10 labels Feb 8, 2022

joamaki added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Feb 8, 2022

This was referenced Feb 23, 2022

CI: K8sVerifier Runs the kernel verifier against Cilium's BPF datapath: terminating containers are not deleted after timeout #18895

Closed

Prepare for release v1.10.8 #18921

Merged

Prepare for release v1.11.2 #18922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Fix pod cleanup after various tests #18448

test: Fix pod cleanup after various tests #18448

joestringer commented Jan 11, 2022 •

edited

Loading

joestringer commented Jan 11, 2022 •

edited by maintainer-s-little-helper bot

Loading

Test Name

Failure Output

joestringer commented Jan 12, 2022 •

edited

Loading

nbusseneau left a comment

nbusseneau commented Jan 12, 2022

joestringer commented Jan 12, 2022 •

edited

Loading

joestringer commented Jan 27, 2022

aditighag left a comment

joestringer commented Jan 27, 2022 •

edited

Loading

joestringer commented Jan 27, 2022

joestringer commented Jan 27, 2022

joestringer commented Jan 27, 2022

test: Fix pod cleanup after various tests #18448

test: Fix pod cleanup after various tests #18448

Conversation

joestringer commented Jan 11, 2022 • edited Loading

joestringer commented Jan 11, 2022 • edited by maintainer-s-little-helper bot Loading

Test Name

Failure Output

joestringer commented Jan 12, 2022 • edited Loading

nbusseneau left a comment

Choose a reason for hiding this comment

nbusseneau commented Jan 12, 2022

joestringer commented Jan 12, 2022 • edited Loading

joestringer commented Jan 27, 2022

aditighag left a comment

Choose a reason for hiding this comment

joestringer commented Jan 27, 2022 • edited Loading

joestringer commented Jan 27, 2022

joestringer commented Jan 27, 2022

joestringer commented Jan 27, 2022

joestringer commented Jan 11, 2022 •

edited

Loading

joestringer commented Jan 11, 2022 •

edited by maintainer-s-little-helper bot

Loading

joestringer commented Jan 12, 2022 •

edited

Loading

joestringer commented Jan 12, 2022 •

edited

Loading

joestringer commented Jan 27, 2022 •

edited

Loading