Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Installation and Conformance Test (IPv4) #29315

Closed
learnitall opened this issue Nov 22, 2023 · 1 comment
Closed

CI: Installation and Conformance Test (IPv4) #29315

learnitall opened this issue Nov 22, 2023 · 1 comment
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@learnitall
Copy link
Contributor

CI failure

Lot's of sub-tests failed in this workflow

Run URL

https://github.com/cilium/cilium/actions/runs/6950658033

Zip Files

kind-logs.zip

(Sysdump was too big to upload, please see run URL for download)

Log Output

2023-11-21T23:28:00.0809178Z �[38;5;243m------------------------------�[0m
2023-11-21T23:28:00.0810163Z �[38;5;9m• [FAILED] [240.346 seconds]�[0m
2023-11-21T23:28:00.0811630Z �[0m[sig-node] Probing container �[38;5;9m�[1m[It] should *not* be restarted with a /healthz http liveness probe [NodeConformance] [Conformance]�[0m
2023-11-21T23:28:00.0813230Z �[38;5;243mtest/e2e/common/node/container_probe.go:214�[0m
2023-11-21T23:28:00.0813740Z 
2023-11-21T23:28:00.0813983Z   �[38;5;243mTimeline >>�[0m
2023-11-21T23:28:00.0814804Z   �[1mSTEP:�[0m Creating a kubernetes client �[38;5;243m@ 11/21/23 23:23:59.671�[0m
2023-11-21T23:28:00.0815983Z   Nov 21 23:23:59.671: INFO: >>> kubeConfig: /home/runner/work/cilium/cilium/_artifacts/kubeconfig.conf
2023-11-21T23:28:00.0817524Z   �[1mSTEP:�[0m Building a namespace api object, basename container-probe �[38;5;243m@ 11/21/23 23:23:59.672�[0m
2023-11-21T23:28:00.0819052Z   �[1mSTEP:�[0m Waiting for a default service account to be provisioned in namespace �[38;5;243m@ 11/21/23 23:23:59.752�[0m
2023-11-21T23:28:00.0820479Z   �[1mSTEP:�[0m Waiting for kube-root-ca.crt to be provisioned in namespace �[38;5;243m@ 11/21/23 23:23:59.756�[0m
2023-11-21T23:28:00.0821985Z   �[1mSTEP:�[0m Creating pod test-webserver-cfef5e44-a482-44a2-a2ab-1c62f4c90c47 in namespace container-probe-1676 �[38;5;243m@ 11/21/23 23:23:59.76�[0m
2023-11-21T23:28:00.0823160Z   Nov 21 23:27:59.802: INFO: Failed inside E2E framework:
2023-11-21T23:28:00.0824532Z       k8s.io/kubernetes/test/e2e/framework/pod.WaitForPodCondition({0x7faa284f3498, 0xc004fd2ea0}, {0x77d9a80?, 0xc005606340?}, {0xc0049b2570, 0x14}, {0xc005b26680, 0x33}, {0x6fd1346, 0x15}, ...)
2023-11-21T23:28:00.0825752Z       	test/e2e/framework/pod/wait.go:227 +0x25f
2023-11-21T23:28:00.0826798Z       k8s.io/kubernetes/test/e2e/common/node.runLivenessTest({0x7faa284f3498, 0xc004fd2ea0}, 0xc0003b9c20, 0xc004047680, 0x0, 0x37e11d6000, {0x6faaf7b, 0xe})
2023-11-21T23:28:00.0828052Z       	test/e2e/common/node/container_probe.go:1730 +0x2f0
2023-11-21T23:28:00.0829077Z       k8s.io/kubernetes/test/e2e/common/node.RunLivenessTest({0x7faa284f3498, 0xc004fd2ea0}, 0x6faaf7b?, 0xc004047680, 0x50?, 0x3ecbda7?)
2023-11-21T23:28:00.0830024Z       	test/e2e/common/node/container_probe.go:1707 +0xce
2023-11-21T23:28:00.0831210Z       k8s.io/kubernetes/test/e2e/common/node.glob..func2.9({0x7faa284f3498, 0xc004fd2ea0})
2023-11-21T23:28:00.0831932Z       	test/e2e/common/node/container_probe.go:222 +0x12c
2023-11-21T23:28:00.0832801Z   �[38;5;9m[FAILED]�[0m in [It] - test/e2e/common/node/container_probe.go:1730 �[38;5;243m@ 11/21/23 23:27:59.802�[0m
2023-11-21T23:28:00.0834306Z   Nov 21 23:27:59.802: INFO: Waiting up to 7m0s for all (but 0) nodes to be ready
2023-11-21T23:28:27.4707356Z �[38;5;9m• [FAILED] [302.331 seconds]�[0m
2023-11-21T23:28:27.4708302Z �[0m[sig-apps] Deployment �[38;5;9m�[1m[It] should validate Deployment Status endpoints [Conformance]�[0m
2023-11-21T23:28:27.6752759Z   �[38;5;9m[FAILED] error waiting for deployment "test-deployment-kqwpn" status to match expectation: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:1, UpdatedReplicas:1, ReadyReplicas:0, AvailableReplicas:0, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:time.Date(2023, time.November, 21, 23, 23, 25, 0, time.Local), LastTransitionTime:time.Date(2023, time.November, 21, 23, 23, 25, 0, time.Local), Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:time.Date(2023, time.November, 21, 23, 23, 25, 0, time.Local), LastTransitionTime:time.Date(2023, time.November, 21, 23, 23, 25, 0, time.Local), Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"test-deployment-kqwpn-5d576bd769\" is progressing."}}, CollisionCount:(*int32)(nil)}�[0m
2023-11-21T23:28:29.7196963Z   �[38;5;9m[FAILED]�[0m in [It] - test/e2e/framework/pod/resource.go:369 �[38;5;243m@ 11/21/23 23:28:29.516�[0m
2023-11-21T23:28:29.7198145Z   Nov 21 23:28:29.516: INFO: Waiting up to 7m0s for all (but 0) nodes to be ready
2023-11-21T23:28:30.5586570Z           msg: "Timed out after 300.000s.\nExpected Pod to be in <v1.PodPhase>: \"Running\"\nGot instead:\n    <*v1.Pod | 0xc002290d80>: \n        metadata:\n          creationTimestamp: \"2023-11-21T23:23:30Z\"\n          generateName: verify-service-up-exec-pod-\n          managedFields:\n          - apiVersion: v1\n            fieldsType: FieldsV1\n            fieldsV1:\n              f:metadata:\n                f:generateName: {}\n              f:spec:\n                f:containers:\n                  k:{\"name\":\"agnhost-container\"}:\n                    .: {}\n                    f:args: {}\n                    f:image: {}\n                    f:imagePullPolicy: {}\n                    f:name: {}\n                    f:resources: {}\n                    f:securityContext: {}\n                    f:terminationMessagePath: {}\n                    f:terminationMessagePolicy: {}\n                f:dnsPolicy: {}\n                f:enableServiceLinks: {}\n                f:restartPolicy: {}\n                f:schedulerName: {}\n                f:securityContext: {}\n                f:terminationGracePeriodSeconds: {}\n            manager: e2e.test\n            operation: Update\n            time: \"2023-11-21T23:23:30Z\"\n          - apiVersion: v1\n            fieldsType: FieldsV1\n            fieldsV1:\n              f:status:\n                f:conditions:\n                  k:{\"type\":\"ContainersReady\"}:\n                    .: {}\n                    f:lastProbeTime: {}\n                    f:lastTransitionTime: {}\n                    f:message: {}\n                    f:reason: {}\n                    f:status: {}\n                    f:type: {}\n                  k:{\"type\":\"Initialized\"}:\n                    .: {}\n                    f:lastProbeTime: {}\n                    f:lastTransitionTime: {}\n                    f:status: {}\n                    f:type: {}\n                  k:{\"type\":\"Ready\"}:\n                    .: {}\n                    f:lastProbeTime: {}\n                    f:lastTransitionTime: {}\n                    f:message: {}\n                    f:reason: {}\n                    f:status: {}\n                    f:type: {}\n                f:containerStatuses: {}\n                f:hostIP: {}\n                f:startTime: {}\n            manager: kubelet\n            operation: Update\n            subresource: status\n            time: \"2023-11-21T23:23:30Z\"\n          name: verify-service-up-exec-pod-n6td7\n          namespace: services-6757\n          resourceVersion: \"12219\"\n          uid: 7cdd0509-27e6-447f-b507-0695cfca0ac2\n        spec:\n          containers:\n          - args:\n            - pause\n            image: registry.k8s.io/e2e-test-images/agnhost:2.45\n            imagePullPolicy: IfNotPresent\n            name: agnhost-container\n            resources: {}\n            securityContext: {}\n            terminationMessagePath: /dev/termination-log\n            terminationMessagePolicy: File\n            volumeMounts:\n            - mountPath: /var/run/secrets/kubernetes.io/serviceaccount\n              name: kube-api-access-bgdmn\n              readOnly: true\n          dnsPolicy: ClusterFirst\n          enableServiceLinks: true\n          nodeName: cilium-testing-worker2\n          preemptionPolicy: PreemptLowerPriority\n          priority: 0\n          restartPolicy: Always\n          schedulerName: default-scheduler\n          securityContext: {}\n          serviceAccount: default\n          serviceAccountName: default\n          terminationGracePeriodSeconds: 0\n          tolerations:\n          - effect: NoExecute\n            key: node.kubernetes.io/not-ready\n            operator: Exists\n            tolerationSeconds: 300\n          - effect: NoExecute\n            key: node.kubernetes.io/unreachable\n            operator: Exists\n            tolerationSeconds: 300\n          volumes:\n          - name: kube-api-access-bgdmn\n            projected:\n              defaultMode: 420\n              sources:\n              - serviceAccountToken:\n                  expirationSeconds: 3607\n                  path: token\n              - configMap:\n                  items:\n                  - key: ca.crt\n                    path: ca.crt\n                  name: kube-root-ca.crt\n              - downwardAPI:\n                  items:\n                  - fieldRef:\n                      apiVersion: v1\n                      fieldPath: metadata.namespace\n                    path: namespace\n        status:\n          conditions:\n          - lastProbeTime: null\n            lastTransitionTime: \"2023-11-21T23:23:30Z\"\n            status: \"True\"\n            type: Initialized\n          - lastProbeTime: null\n            lastTransitionTime: \"2023-11-21T23:23:30Z\"\n            message: 'containers with unready status: [agnhost-container]'\n            reason: ContainersNotReady\n            status: \"False\"\n            type: Ready\n          - lastProbeTime: null\n            lastTransitionTime: \"2023-11-21T23:23:30Z\"\n            message: 'containers with unready status: [agnhost-container]'\n            reason: ContainersNotReady\n            status: \"False\"\n            type: ContainersReady\n          - lastProbeTime: null\n            lastTransitionTime: \"2023-11-21T23:23:30Z\"\n            status: \"True\"\n            type: PodScheduled\n          containerStatuses:\n          - image: registry.k8s.io/e2e-test-images/agnhost:2.45\n            imageID: \"\"\n            lastState: {}\n            name: agnhost-container\n            ready: false\n            restartCount: 0\n            started: false\n            state:\n              waiting:\n                reason: ContainerCreating\n          hostIP: 172.18.0.3\n          phase: Pending\n          qosClass: BestEffort\n          startTime: \"2023-11-21T23:23:30Z\"",
2023-11-21T23:28:30.5607902Z           fullStackTrace: "k8s.io/kubernetes/test/e2e/framework/pod.WaitTimeoutForPodRunningInNamespace({0x7f0f2c162c58, 0xc0014e0990}, {0x77d9a80?, 0xc0030d8340?}, {0xc00007e780, 0x20}, {0xc004dfe0a3, 0xd}, 0x0?)\n\ttest/e2e/framework/pod/wait.go:459 +0x1a4\nk8s.io/kubernetes/test/e2e/framework/pod.WaitForPodNameRunningInNamespace(...)\n\ttest/e2e/framework/pod/wait.go:443\nk8s.io/kubernetes/test/e2e/framework/pod.CreateExecPodOrFail({0x7f0f2c162c58, 0xc0014e0990}, {0x77d9a80?, 0xc0030d8340}, {0xc001696180, 0xd}, {0x6ffc5f7, 0x1b}, 0x0)\n\ttest/e2e/framework/pod/resource.go:368 +0x2a6\nk8s.io/kubernetes/test/e2e/network.verifyServeHostnameServiceUp({0x7f0f2c162c58, 0xc0014e0990}, {0x77d9a80, 0xc0030d8340}, {0xc001696180, 0xd}, {0xc0017fc060?, 0x3, 0x3}, {0xc001749260, ...}, ...)\n\ttest/e2e/network/service.go:335 +0xf5\nk8s.io/kubernetes/test/e2e/network.glob..func27.8({0x7f0f2c162c58, 0xc0014e0990})\n\ttest/e2e/network/service.go:1156 +0x9b0",
2023-11-21T23:29:11.8893330Z   Nov 21 23:29:11.706: INFO: At 2023-11-21 23:27:20 +0000 UTC - event for netserver-1: {kubelet cilium-testing-worker2} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "78e3597366566df22a8efc6e045ce87103fcd89d0aecffa297d595020896777a": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests 
@learnitall learnitall added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Nov 22, 2023
@mhofstetter
Copy link
Member

Logs of corresponding sysdumps contain a lot of putEndpointIdTooManyRequests / Unable to create endpoint messages. CNI is unable to create endpoints due to a deadlock on the agent side.

1 occurences. Sample stack trace:
sync.runtime_SemacquireRWMutexR(0x0?, 0x56?, 0x1?)
        /usr/local/go/src/runtime/sema.go:82 +0x25
sync.(*RWMutex).RLock(...)
        /usr/local/go/src/sync/rwmutex.go:71
github.com/cilium/cilium/pkg/endpointmanager.(*endpointManager).GetEndpoints(0xc000b4c140)
        ./pkg/endpointmanager/manager.go:589 +0x45
github.com/cilium/cilium/pkg/auth.(*authMapGarbageCollector).cleanupEndpoints(0xc001128a20, {0x4?, 0x392f2f4?})
        ./pkg/auth/authmap_gc.go:427 +0x97
github.com/cilium/cilium/pkg/auth.(*authMapGarbageCollector).EndpointDeleted(0xc001128a20, 0xc000cba000?, {0x58?, 0x52?})
        ./pkg/auth/authmap_gc.go:416 +0x26
github.com/cilium/cilium/pkg/endpointmanager.(*endpointManager).removeEndpoint(0xc000b4c140, 0x0?, {0x0?, 0x0?})
        ./pkg/endpointmanager/manager.go:419 +0xe3
github.com/cilium/cilium/pkg/endpointmanager.(*endpointManager).RemoveEndpoint(0xc0004fa2a0?, 0xc002284e40?, {0xe0?, 0xf2?})
        ./pkg/endpointmanager/manager.go:429 +0x1e
github.com/cilium/cilium/daemon/cmd.(*Daemon).deleteEndpointQuiet(...)
        ./daemon/cmd/endpoint.go:707

This is related to the issue #29078 and has been fixed with the PR #29082.

-> Let's close this one.

@mhofstetter mhofstetter self-assigned this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

No branches or pull requests

2 participants