test: Avoid unnecessary restarts of unmanaged pods #14938

pchaigno · 2021-02-11T15:27:55Z

In tests, after Cilium installations, we often restart unmanaged pods to ensure they are managed by Cilium. In particular, on GKE, we used to restart unmanaged pods from both the kube-system and the cilium namespaces.

However, commit 48be458 ("test: GKE: Install Cilium in kube-system namespace") removed the cilium namespace to install Cilium in the kube-system namespace. One of the two calls to RestartUnmanagedPodsInNamespace() is therefore unnecessary. In addition, we also already restart pods in the namespace of the log-gatherer pods for all CI environments (vs. just GKE). If that last namespace is the kube-system namespace, then we don't need any call to RestartUnmanagedPodsInNamespace() for GKE.

I expect this will fix flake #14915. In that flake, some pods are not found while attempting to restart unmanaged pods. The flake started appearing in master when we merged commit 48be458. The theory is that the two quick calls to RestartUnmanagedPodsInNamespace() for the same namespace lead us, in the second call, to select pods that have already
been restarted by the first call. Such pods may disappear between the time we select them and the time we actually execute kubectl delete, resulting in the error:

Error from server (NotFound): pods "kube-dns-66d6b7c877-dp4q2" not found

Fixes: #14899

In tests, after Cilium installations, we often restart unmanaged pods to ensure they are managed by Cilium. In particular, on GKE, we used to restart unmanaged pods from both the kube-system and the cilium namespaces. However, commit 48be458 ("test: GKE: Install Cilium in kube-system namespace") removed the cilium namespace to install Cilium in the kube-system namespace. One of the two calls to RestartUnmanagedPodsInNamespace is therefore unnecessary. In addition, we also already restart pods in the namespace of the log-gatherer pods for all CI environments (vs. just GKE). If that last namespace is the kube-system namespace, then we don't need any call to RestartUnmanagedPodsInNamespace for GKE. I expect this will fix flake #14915. In that flake, some pods are not found while attempting to restart unmanaged pods. The flake started appearing in master when we merged commit 48be458. The theory is that the two quick calls to RestartUnmanagedPodsInNamespace for the same namespace lead us, in the second call, to select pods that have already been restarted by the first call. Such pods may disappear between the time we select them and the time we actually execute 'kubectl delete', resulting in the error: Error from server (NotFound): pods "kube-dns-66d6b7c877-dp4q2" not found Fixes: 48be458 ("test: GKE: Install Cilium in kube-system namespace") Fixes: #14915 Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno · 2021-02-11T15:28:09Z

test-me-please

aditighag

The fix looks reasonable to me. But it's good to get a review from ci-structure team.

aditighag · 2021-02-11T16:00:53Z

test-gke
Perhaps run test-gke couple of more times to confirm that it fixes the flake?

pchaigno · 2021-02-11T16:02:25Z

Perhaps run test-gke couple of more times to confirm that it fixes the flake?

I thought about it, but I think the change makes sense to have regardless of whether it fixes the flake. So we might as well merge quickly rather than validate it does fix the flake. We'll soon know anyway.

nbusseneau · 2021-02-11T16:13:43Z

I thought about it, but I think the change makes sense to have regardless of whether it fixes the flake.

I agree.

aditighag · 2021-02-11T16:17:48Z

Perhaps run test-gke couple of more times to confirm that it fixes the flake?

I thought about it, but I think the change makes sense to have regardless of whether it fixes the flake. So we might as well merge quickly rather than validate it does fix the flake. We'll soon know anyway.

Yup, makes sense. Hit this on my PR. If the re-run hits the flake again, I'll have to rebase to master once your fix is merged.

pchaigno · 2021-02-11T16:24:02Z

Ah! It appears the previous GKE run failed because it tried to parse a focus from your comment @aditighag 😆

ginkgo --focus=K8s*run test-gke couple of more times to confirm that it fixes the flake? -v -- -cilium.provision=false -cilium.timeout=180m -cilium.kubeconfig=/home/jenkins/workspace/Cilium-PR-K8s-GKE@2/src/github.com/cilium/cilium/test/gke/gke-kubeconfig -cilium.passCLIEnvironment=true -cilium.image=147.75.55.179:37293/cilium/cilium -cilium.tag=cf9537359bc9eb267174fc8e94809489cd5d9783 -cilium.operator-image=147.75.55.179:37293/cilium/operator -cilium.operator-tag=cf9537359bc9eb267174fc8e94809489cd5d9783 -cilium.hubble-relay-image=147.75.55.179:37293/cilium/hubble-relay -cilium.hubble-relay-tag=cf9537359bc9eb267174fc8e94809489cd5d9783 -cilium.holdEnvironment=false -cilium.runQuarantined=false

https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4308/consoleFull

test-gke

pchaigno · 2021-02-11T17:02:58Z

Some error while cloning git repo... https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4310/

test-gke

pchaigno · 2021-02-11T18:40:25Z

Previous failure at https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel/4690/testReport/junit/Suite-k8s-1/19/K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_direct_routing_Tests_HostPort/.
Looks unrelated to the changes in this PR but I can't find an issue for it. Might have to create a new one.

test-4.19

pchaigno · 2021-02-11T20:46:02Z

GKE failed again with #14915 Error from server (NotFound): pods "kube-dns-66d6b7c877-zttfq" not found (cf. https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4311/testReport/junit/Suite-k8s-1/16/K8sKafkaPolicyTest_Kafka_Policy_Tests_KafkaPolicies/). So the current fix isn't enough. I think we should still merge once 4.19 is green. I removed the Fixes: tag in the original post to not close #14915.

pchaigno · 2021-02-12T08:28:46Z

This is ready to merge. Failure is known flake #14915, reviews are in.

pchaigno added kind/bug/CI This is a bug in the testing code. area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Feb 11, 2021

pchaigno requested a review from a team as a code owner February 11, 2021 15:27

maintainer-s-little-helper bot added this to In progress in 1.10.0 Feb 11, 2021

pchaigno requested a review from nbusseneau February 11, 2021 15:27

maintainer-s-little-helper bot assigned nbusseneau Feb 11, 2021

aditighag approved these changes Feb 11, 2021

View reviewed changes

nbusseneau approved these changes Feb 11, 2021

View reviewed changes

pchaigno unassigned nbusseneau Feb 11, 2021

pchaigno mentioned this pull request Feb 11, 2021

CI: GKE: K8sServicesTest Testing test script Validating test script correctness: Unable to restart unmanaged pod #14915

Closed

pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 12, 2021

borkmann merged commit fd918a0 into master Feb 12, 2021

1.10.0 automation moved this from In progress to Done Feb 12, 2021

borkmann deleted the pr/pchaigno/fix-gke-restart-unmanaged-pods branch February 12, 2021 09:08

joestringer moved this from In progress to Done in 1.10.0 Feb 12, 2021

pchaigno added the needs-backport/1.9 label Feb 18, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.9.5 Feb 18, 2021

gandro mentioned this pull request Feb 18, 2021

v1.9 backports 2021-02-17 #15006

Merged

gandro added backport-pending/1.9 and removed needs-backport/1.9 labels Feb 18, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.5 Feb 18, 2021

christarazi added backport-done/1.9 and removed backport-pending/1.9 labels Feb 22, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.5 Feb 22, 2021

christarazi mentioned this pull request Mar 10, 2021

Prepare for release v1.9.5 #15309

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Avoid unnecessary restarts of unmanaged pods #14938

test: Avoid unnecessary restarts of unmanaged pods #14938

pchaigno commented Feb 11, 2021 •

edited

Loading

pchaigno commented Feb 11, 2021

aditighag left a comment

aditighag commented Feb 11, 2021

pchaigno commented Feb 11, 2021

nbusseneau commented Feb 11, 2021

aditighag commented Feb 11, 2021

pchaigno commented Feb 11, 2021 •

edited

Loading

pchaigno commented Feb 11, 2021 •

edited

Loading

pchaigno commented Feb 11, 2021 •

edited

Loading

pchaigno commented Feb 11, 2021

pchaigno commented Feb 12, 2021

test: Avoid unnecessary restarts of unmanaged pods #14938

test: Avoid unnecessary restarts of unmanaged pods #14938

Conversation

pchaigno commented Feb 11, 2021 • edited Loading

pchaigno commented Feb 11, 2021

aditighag left a comment

Choose a reason for hiding this comment

aditighag commented Feb 11, 2021

pchaigno commented Feb 11, 2021

nbusseneau commented Feb 11, 2021

aditighag commented Feb 11, 2021

pchaigno commented Feb 11, 2021 • edited Loading

pchaigno commented Feb 11, 2021 • edited Loading

pchaigno commented Feb 11, 2021 • edited Loading

pchaigno commented Feb 11, 2021

pchaigno commented Feb 12, 2021

pchaigno commented Feb 11, 2021 •

edited

Loading

pchaigno commented Feb 11, 2021 •

edited

Loading

pchaigno commented Feb 11, 2021 •

edited

Loading

pchaigno commented Feb 11, 2021 •

edited

Loading