Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Avoid unnecessary restarts of unmanaged pods #14938

Merged
merged 1 commit into from
Feb 12, 2021

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Feb 11, 2021

In tests, after Cilium installations, we often restart unmanaged pods to ensure they are managed by Cilium. In particular, on GKE, we used to restart unmanaged pods from both the kube-system and the cilium namespaces.

However, commit 48be458 ("test: GKE: Install Cilium in kube-system namespace") removed the cilium namespace to install Cilium in the kube-system namespace. One of the two calls to RestartUnmanagedPodsInNamespace() is therefore unnecessary. In addition, we also already restart pods in the namespace of the log-gatherer pods for all CI environments (vs. just GKE). If that last namespace is the kube-system namespace, then we don't need any call to RestartUnmanagedPodsInNamespace() for GKE.

I expect this will fix flake #14915. In that flake, some pods are not found while attempting to restart unmanaged pods. The flake started appearing in master when we merged commit 48be458. The theory is that the two quick calls to RestartUnmanagedPodsInNamespace() for the same namespace lead us, in the second call, to select pods that have already
been restarted by the first call. Such pods may disappear between the time we select them and the time we actually execute kubectl delete, resulting in the error:

Error from server (NotFound): pods "kube-dns-66d6b7c877-dp4q2" not found

Fixes: #14899

In tests, after Cilium installations, we often restart unmanaged pods to
ensure they are managed by Cilium. In particular, on GKE, we used to
restart unmanaged pods from both the kube-system and the cilium
namespaces.

However, commit 48be458 ("test: GKE: Install Cilium in
kube-system namespace") removed the cilium namespace to install Cilium
in the kube-system namespace. One of the two calls to
RestartUnmanagedPodsInNamespace is therefore unnecessary. In addition,
we also already restart pods in the namespace of the log-gatherer pods
for all CI environments (vs. just GKE). If that last namespace is the
kube-system namespace, then we don't need any call to
RestartUnmanagedPodsInNamespace for GKE.

I expect this will fix flake #14915. In that flake, some pods are not
found while attempting to restart unmanaged pods. The flake started
appearing in master when we merged commit 48be458. The theory is that
the two quick calls to RestartUnmanagedPodsInNamespace for the same
namespace lead us, in the second call, to select pods that have already
been restarted by the first call. Such pods may disappear between the
time we select them and the time we actually execute 'kubectl delete',
resulting in the error:

    Error from server (NotFound): pods "kube-dns-66d6b7c877-dp4q2" not found

Fixes: 48be458 ("test: GKE: Install Cilium in kube-system namespace")
Fixes: #14915
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno added kind/bug/CI This is a bug in the testing code. area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Feb 11, 2021
@pchaigno pchaigno requested a review from a team as a code owner February 11, 2021 15:27
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Feb 11, 2021
@pchaigno
Copy link
Member Author

test-me-please

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks reasonable to me. But it's good to get a review from ci-structure team.

@aditighag
Copy link
Member

test-gke
Perhaps run test-gke couple of more times to confirm that it fixes the flake?

@pchaigno
Copy link
Member Author

Perhaps run test-gke couple of more times to confirm that it fixes the flake?

I thought about it, but I think the change makes sense to have regardless of whether it fixes the flake. So we might as well merge quickly rather than validate it does fix the flake. We'll soon know anyway.

@nbusseneau
Copy link
Member

I thought about it, but I think the change makes sense to have regardless of whether it fixes the flake.

I agree.

@aditighag
Copy link
Member

Perhaps run test-gke couple of more times to confirm that it fixes the flake?

I thought about it, but I think the change makes sense to have regardless of whether it fixes the flake. So we might as well merge quickly rather than validate it does fix the flake. We'll soon know anyway.

Yup, makes sense. Hit this on my PR. If the re-run hits the flake again, I'll have to rebase to master once your fix is merged.

@pchaigno
Copy link
Member Author

pchaigno commented Feb 11, 2021

Ah! It appears the previous GKE run failed because it tried to parse a focus from your comment @aditighag 😆

ginkgo --focus=K8s*run test-gke couple of more times to confirm that it fixes the flake? -v -- -cilium.provision=false -cilium.timeout=180m -cilium.kubeconfig=/home/jenkins/workspace/Cilium-PR-K8s-GKE@2/src/github.com/cilium/cilium/test/gke/gke-kubeconfig -cilium.passCLIEnvironment=true -cilium.image=147.75.55.179:37293/cilium/cilium -cilium.tag=cf9537359bc9eb267174fc8e94809489cd5d9783 -cilium.operator-image=147.75.55.179:37293/cilium/operator -cilium.operator-tag=cf9537359bc9eb267174fc8e94809489cd5d9783 -cilium.hubble-relay-image=147.75.55.179:37293/cilium/hubble-relay -cilium.hubble-relay-tag=cf9537359bc9eb267174fc8e94809489cd5d9783 -cilium.holdEnvironment=false -cilium.runQuarantined=false

https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4308/consoleFull

test-gke

@pchaigno
Copy link
Member Author

pchaigno commented Feb 11, 2021

Some error while cloning git repo... https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4310/

test-gke

@pchaigno
Copy link
Member Author

pchaigno commented Feb 11, 2021

Previous failure at https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel/4690/testReport/junit/Suite-k8s-1/19/K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_direct_routing_Tests_HostPort/.
Looks unrelated to the changes in this PR but I can't find an issue for it. Might have to create a new one.

test-4.19

@pchaigno
Copy link
Member Author

GKE failed again with #14915 Error from server (NotFound): pods "kube-dns-66d6b7c877-zttfq" not found (cf. https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4311/testReport/junit/Suite-k8s-1/16/K8sKafkaPolicyTest_Kafka_Policy_Tests_KafkaPolicies/). So the current fix isn't enough. I think we should still merge once 4.19 is green. I removed the Fixes: tag in the original post to not close #14915.

@pchaigno
Copy link
Member Author

This is ready to merge. Failure is known flake #14915, reviews are in.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 12, 2021
@borkmann borkmann merged commit fd918a0 into master Feb 12, 2021
1.10.0 automation moved this from In progress to Done Feb 12, 2021
@borkmann borkmann deleted the pr/pchaigno/fix-gke-restart-unmanaged-pods branch February 12, 2021 09:08
@joestringer joestringer moved this from In progress to Done in 1.10.0 Feb 12, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.5 Feb 18, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.5 Feb 18, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.5 Feb 22, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.5 Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake kind/bug/CI This is a bug in the testing code. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.9.5
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

None yet

6 participants