Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: ConformanceGKE: Timeout on cilium-test deletion #22368

Closed
pchaigno opened this issue Nov 25, 2022 · 11 comments
Closed

CI: ConformanceGKE: Timeout on cilium-test deletion #22368

pchaigno opened this issue Nov 25, 2022 · 11 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! sig/agent Cilium agent related.

Comments

@pchaigno
Copy link
Member

ConformanceGKE runs are failing on master with a 1h15min timeout on:

⌛ Waiting for cilium-test namespace to be terminated...
Error: The operation was canceled.

I don't know the root cause, but this can probably be worked around by deleting the pods in the namespace instead of the namespace itself (you can notice in the sysdump that the namespace's pods are already gone) and using a different test namespace for each run of the connectivity tests.

@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Nov 25, 2022
@aanm aanm added the sig/agent Cilium agent related. label Nov 25, 2022
@aanm
Copy link
Member

aanm commented Nov 25, 2022

tklauser added a commit to cilium/cilium-cli that referenced this issue Nov 28, 2022
This should help to work around an issue we're seeing in cilium/cilium
CI where deleting the cilium-test namespace times out, see
cilium/cilium#22368

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
tklauser added a commit to cilium/cilium-cli that referenced this issue Nov 28, 2022
This should help to work around an issue we're seeing in cilium/cilium
CI where deleting the cilium-test namespace times out, see
cilium/cilium#22368

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
tklauser added a commit to cilium/cilium-cli that referenced this issue Nov 28, 2022
This should help to work around an issue we're seeing in cilium/cilium
CI where deleting the cilium-test namespace times out, see
cilium/cilium#22368

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser
Copy link
Member

cilium/cilium-cli#1237 should fix/work around this in cilium-cli. Once this is merged and a new cilium-cli version is released, bumping cilium_cli_version in the GKE workflows should do the trick.

tklauser added a commit to cilium/cilium-cli that referenced this issue Nov 28, 2022
This should help to work around an issue we're seeing in cilium/cilium
CI where deleting the cilium-test namespace times out, see
cilium/cilium#22368

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@aanm
Copy link
Member

aanm commented Nov 30, 2022

Fixed by #22441

@aanm aanm closed this as completed Nov 30, 2022
@pchaigno
Copy link
Member Author

pchaigno commented Dec 5, 2022

Seems cilium/cilium-cli#1237 wasn't enough to fix it. It's still happening. E.g.:

@pchaigno pchaigno reopened this Dec 5, 2022
@tklauser
Copy link
Member

tklauser commented Dec 9, 2022

Looking at a recent failure on master (sysdump), we see that all pods in the namespace got deleted:

% yq '.items[].metadata.namespace | select(. == "kube-system")' k8s-pods-20221208-033953.yaml
%

Looking at the cilium-test namespace in k8s-namespaces-20221208-033953.yaml we see:

% yq '.items[] | select(.metadata.name == "cilium-test") | .status' k8s-namespaces-20221208-033953.yaml
conditions:
  - lastTransitionTime: "2022-12-08T02:50:56Z"
    message: 'Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2022-12-08T02:51:02Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2022-12-08T02:51:02Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2022-12-08T02:51:19Z"
    message: All content successfully removed
    reason: ContentRemoved
    status: "False"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2022-12-08T02:51:02Z"
    message: All content-preserving finalizers finished
    reason: ContentHasNoFinalizers
    status: "False"
    type: NamespaceFinalizersRemaining
phase: Terminating

The first condition looks suspicious:

% yq '.items[] | select(.metadata.name == "cilium-test") | .status.conditions[0]' k8s-namespaces-20221208-033953.yaml
lastTransitionTime: "2022-12-08T02:50:56Z"
message: 'Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request'
reason: DiscoveryFailed
status: "True"
type: NamespaceDeletionDiscoveryFailure

We seem to have hit this before, e.g. in cilium/cilium-cli#255 (comment).

Googling for the Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request this could be caused by a dangling k8s apiservice, see e.g. https://stackoverflow.com/questions/68305654/unable-to-delete-kubernetes-namespace-removing-finalizers-fails. Unfortunately, we're not capturing apiservices as part of the sysdump, so there's currently no way to verify that theory.

@pchaigno
Copy link
Member Author

pchaigno commented Dec 9, 2022

This looks oddly similar to the issues you get on EKS when you try to run with tunnel mode enabled. cc @bmcustodio
In that case, kube-apiserver can't reach the metrics-server pod because that is running in the overlay. I've checked and it looks like this flake always happens after testing Cilium on GKE with tunnel mode.

tklauser added a commit that referenced this issue Dec 15, 2022
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@squeed
Copy link
Contributor

squeed commented Feb 21, 2023

@pchaigno
Copy link
Member Author

@tklauser Are you still looking into this?

@tklauser
Copy link
Member

@tklauser Are you still looking into this?

Currently lacking cycles and ideas on how to proceed, so I'm not actively looking into this. I'm going to unassign myself for now.

@tklauser tklauser removed their assignment Mar 29, 2023
@pchaigno
Copy link
Member Author

For whoever looks into this next, this is probably a good way to mitigate (without fixing):

using a different test namespace for each run of the connectivity tests.

We would also need to not block on the namespace deletion attempt.

cc @brlbil

brlbil added a commit that referenced this issue Apr 5, 2023
This commit mitigates workflow flake on GKE with tunnel installation
until the issue #22368 is fixed.
For the test with tunnel test namespace is added
and for uninstall --wait option is removed.

Signed-off-by: Birol Bilgin <birol@cilium.io>
squeed pushed a commit that referenced this issue Apr 5, 2023
This commit mitigates workflow flake on GKE with tunnel installation
until the issue #22368 is fixed.
For the test with tunnel test namespace is added
and for uninstall --wait option is removed.

Signed-off-by: Birol Bilgin <birol@cilium.io>
@pchaigno
Copy link
Member Author

#24755 fixed this.

michi-covalent pushed a commit to michi-covalent/cilium that referenced this issue May 30, 2023
This should help to work around an issue we're seeing in cilium/cilium
CI where deleting the cilium-test namespace times out, see
cilium#22368

Suggested-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

4 participants