-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Avoid unnecessary restarts of unmanaged pods #14938
Conversation
In tests, after Cilium installations, we often restart unmanaged pods to ensure they are managed by Cilium. In particular, on GKE, we used to restart unmanaged pods from both the kube-system and the cilium namespaces. However, commit 48be458 ("test: GKE: Install Cilium in kube-system namespace") removed the cilium namespace to install Cilium in the kube-system namespace. One of the two calls to RestartUnmanagedPodsInNamespace is therefore unnecessary. In addition, we also already restart pods in the namespace of the log-gatherer pods for all CI environments (vs. just GKE). If that last namespace is the kube-system namespace, then we don't need any call to RestartUnmanagedPodsInNamespace for GKE. I expect this will fix flake #14915. In that flake, some pods are not found while attempting to restart unmanaged pods. The flake started appearing in master when we merged commit 48be458. The theory is that the two quick calls to RestartUnmanagedPodsInNamespace for the same namespace lead us, in the second call, to select pods that have already been restarted by the first call. Such pods may disappear between the time we select them and the time we actually execute 'kubectl delete', resulting in the error: Error from server (NotFound): pods "kube-dns-66d6b7c877-dp4q2" not found Fixes: 48be458 ("test: GKE: Install Cilium in kube-system namespace") Fixes: #14915 Signed-off-by: Paul Chaignon <paul@cilium.io>
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks reasonable to me. But it's good to get a review from ci-structure team.
test-gke |
I thought about it, but I think the change makes sense to have regardless of whether it fixes the flake. So we might as well merge quickly rather than validate it does fix the flake. We'll soon know anyway. |
I agree. |
Yup, makes sense. Hit this on my PR. If the re-run hits the flake again, I'll have to rebase to master once your fix is merged. |
Ah! It appears the previous GKE run failed because it tried to parse a focus from your comment @aditighag 😆
https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4308/consoleFull test-gke |
Some error while cloning git repo... https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/4310/ test-gke |
Previous failure at https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel/4690/testReport/junit/Suite-k8s-1/19/K8sServicesTest_Checks_service_across_nodes_Tests_NodePort_BPF_Tests_with_direct_routing_Tests_HostPort/. test-4.19 |
GKE failed again with #14915 |
This is ready to merge. Failure is known flake #14915, reviews are in. |
In tests, after Cilium installations, we often restart unmanaged pods to ensure they are managed by Cilium. In particular, on GKE, we used to restart unmanaged pods from both the
kube-system
and thecilium
namespaces.However, commit 48be458 ("test: GKE: Install Cilium in kube-system namespace") removed the cilium namespace to install Cilium in the
kube-system
namespace. One of the two calls toRestartUnmanagedPodsInNamespace()
is therefore unnecessary. In addition, we also already restart pods in the namespace of the log-gatherer pods for all CI environments (vs. just GKE). If that last namespace is thekube-system
namespace, then we don't need any call toRestartUnmanagedPodsInNamespace()
for GKE.I expect this will fix flake #14915. In that flake, some pods are not found while attempting to restart unmanaged pods. The flake started appearing in master when we merged commit 48be458. The theory is that the two quick calls to
RestartUnmanagedPodsInNamespace()
for the same namespace lead us, in the second call, to select pods that have alreadybeen restarted by the first call. Such pods may disappear between the time we select them and the time we actually execute
kubectl delete
, resulting in the error:Fixes: #14899