CI: Conformance AKS - Hubble Relay CrashLoopBackOff #30905

giorio94 · 2024-02-22T15:01:36Z

CI failure

Hit on #30808
Link: https://github.com/cilium/cilium/actions/runs/8005553206/job/21865336218
Sysdump: cilium-sysdump-final-1.27-eastus2-3-true.zip

NAMESPACE     NAME                                  READY   STATUS             RESTARTS      AGE    IP           NODE                                NOMINATED NODE   READINESS GATES
kube-system   cilium-fb6kk                          1/1     Running            0             10m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   cilium-jc2gz                          1/1     Running            0             10m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   cilium-node-init-mrp94                1/1     Running            0             10m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   cilium-node-init-xxn6n                1/1     Running            0             10m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   cilium-operator-688c886c98-pvwjj      1/1     Running            0             10m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   cloud-node-manager-fkm8h              1/1     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   cloud-node-manager-kmtzk              1/1     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   coredns-789789675-g886l               1/1     Running            0             9m7s   10.0.1.48    aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   coredns-789789675-qkh75               1/1     Running            0             12m    10.0.0.155   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   coredns-autoscaler-649b947bbd-7vm64   1/1     Running            0             12m    10.0.0.73    aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   csi-azuredisk-node-4x9zs              3/3     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   csi-azuredisk-node-qddjr              3/3     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   csi-azurefile-node-5qmm8              3/3     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   csi-azurefile-node-d247d              3/3     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   hubble-relay-5955499965-rn6rt         0/1     CrashLoopBackOff   6 (43s ago)   10m    10.0.0.10    aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   konnectivity-agent-6745784f5c-2fqbd   1/1     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   konnectivity-agent-6745784f5c-6bvvv   1/1     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   kube-proxy-kjh9t                      1/1     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   kube-proxy-wt59z                      1/1     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   metrics-server-5955767688-q4x2h       2/2     Running            0             6m8s   10.0.1.22    aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   metrics-server-5955767688-rsnjm       2/2     Running            0             6m8s   10.0.1.161   aks-nodepool1-40756659-vmss000001   <none>           <none>

2024-02-22T14:05:35.832568732Z level=info msg="Starting gRPC health server..." addr=":4222" subsys=hubble-relay
2024-02-22T14:05:35.833805450Z level=info msg="Starting gRPC server..." options="{peerTarget:hubble-peer.kube-system.svc.cluster.local:443 dialTimeout:5000000000 retryTimeout:30000000000 listenAddress::4245 healthListenAddress::4222 metricsListenAddress: log:0xc0002a7ea0 serverTLSConfig:<nil> insecureServer:true clientTLSConfig:0xc0005c7020 clusterName:cilium-cilium-8005553206-1-3 insecureClient:false observerOptions:[0x1f72d80 0x1f72e60] grpcMetrics:<nil> grpcUnaryInterceptors:[] grpcStreamInterceptors:[]}" subsys=hubble-relay
2024-02-22T14:05:40.834757497Z level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local:443"
2024-02-22T14:06:15.835975013Z level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local:443"
2024-02-22T14:06:35.994127079Z level=info msg="Stopping server..." subsys=hubble-relay
2024-02-22T14:06:35.994231981Z level=info msg="Server stopped" subsys=hubble-relay

The text was updated successfully, but these errors were encountered:

bimmlerd · 2024-03-19T13:45:35Z

We stared at the sysdump for a bit and there's one thing which stands out:

cilium agent cilium-jc2gz (on node vmss000000) at some point logs in logs-cilium-jc2gz-cilium-agent-20240222-140721.log:

2024-02-22T13:58:03.366816511Z level=debug msg="Allocated random IP" ip=10.0.0.10 owner=kube-system/hubble-relay-5955499965-rn6rt pool=default subsys=ipam

That is, the hubble-relay pod gets 10.0.0.10 as its pod IP. (Also visible in the pods listing above). Now, by itself that's reasonably innocuous, but the funny thing is that 10.0.0.10 is also the kube-dns service ClusterIP:

- metadata:
    [...]
    name: kube-dns
    namespace: kube-system
    resourceVersion: "397"
    uid: 299872cf-b929-4ca3-aed4-f5413fdd08c7
  spec:
    clusterIP: 10.0.0.10
    clusterIPs:
    - 10.0.0.10

We found this by looking at the hubble flows, one enlightening one is the following - which is simultaneously to pod hubble-relay, but also to service kube-dns :D

{
  "flow": {
    "time": "2024-02-22T14:07:20.238201757Z",
    "uuid": "224850fa-4040-4f5a-abf2-a14f1209856d",
    "verdict": "FORWARDED",
    "ethernet": {
      "source": "2a:d7:bd:fa:dd:e7",
      "destination": "5a:72:03:b3:0b:94"
    },
    "IP": {
      "source": "10.0.1.22",
      "destination": "10.0.0.10",
      "ipVersion": "IPv4"
    },
    "l4": {
      "UDP": {
        "source_port": 56367,
        "destination_port": 53
      }
    },
    "source": {
      "ID": 1416,
      "identity": 6335,
      "namespace": "kube-system",
      "labels": [
        "k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile",
        "k8s:io.cilium.k8s.namespace.labels.control-plane=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system",
        "k8s:io.cilium.k8s.policy.cluster=cilium-cilium-8005553206-1-3",
        "k8s:io.cilium.k8s.policy.serviceaccount=metrics-server",
        "k8s:io.kubernetes.pod.namespace=kube-system",
        "k8s:k8s-app=metrics-server",
        "k8s:kubernetes.azure.com/managedby=aks"
      ],
      "pod_name": "metrics-server-5955767688-q4x2h",
      "workloads": [
        {
          "name": "metrics-server",
          "kind": "Deployment"
        }
      ]
    },
    "destination": {
      "identity": 52431,
      "namespace": "kube-system",
      "labels": [
        "k8s:app.kubernetes.io/name=hubble-relay",
        "k8s:app.kubernetes.io/part-of=cilium",
        "k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile",
        "k8s:io.cilium.k8s.namespace.labels.control-plane=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system",
        "k8s:io.cilium.k8s.policy.cluster=cilium-cilium-8005553206-1-3",
        "k8s:io.cilium.k8s.policy.serviceaccount=hubble-relay",
        "k8s:io.kubernetes.pod.namespace=kube-system",
        "k8s:k8s-app=hubble-relay"
      ],
      "pod_name": "hubble-relay-5955499965-rn6rt"
    },
    "Type": "L3_L4",
    "node_name": "cilium-cilium-8005553206-1-3/aks-nodepool1-40756659-vmss000001",
    "event_type": {
      "type": 4,
      "sub_type": 5
    },
    "destination_service": {
      "name": "kube-dns",
      "namespace": "kube-system"
    },
    "trace_observation_point": "FROM_ENDPOINT",
    "Summary": "UDP"
  },
  "node_name": "cilium-cilium-8005553206-1-3/aks-nodepool1-40756659-vmss000001",
  "time": "2024-02-22T14:07:20.238201757Z"
}

giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Feb 22, 2024

bimmlerd assigned bimmlerd and glrf Mar 19, 2024

bimmlerd mentioned this issue Mar 19, 2024

AKS: avoid overlapping pod and service CIDRs #31504

Merged

tklauser closed this as completed in #31504 Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Conformance AKS - Hubble Relay CrashLoopBackOff #30905

CI: Conformance AKS - Hubble Relay CrashLoopBackOff #30905

giorio94 commented Feb 22, 2024

bimmlerd commented Mar 19, 2024 •

edited

CI: Conformance AKS - Hubble Relay CrashLoopBackOff #30905

CI: Conformance AKS - Hubble Relay CrashLoopBackOff #30905

Comments

giorio94 commented Feb 22, 2024

CI failure

bimmlerd commented Mar 19, 2024 • edited

bimmlerd commented Mar 19, 2024 •

edited