Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Conformance AKS - Hubble Relay CrashLoopBackOff #30905

Closed
giorio94 opened this issue Feb 22, 2024 · 1 comment · Fixed by #31504
Closed

CI: Conformance AKS - Hubble Relay CrashLoopBackOff #30905

giorio94 opened this issue Feb 22, 2024 · 1 comment · Fixed by #31504
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@giorio94
Copy link
Member

CI failure

Hit on #30808
Link: https://github.com/cilium/cilium/actions/runs/8005553206/job/21865336218
Sysdump: cilium-sysdump-final-1.27-eastus2-3-true.zip

NAMESPACE     NAME                                  READY   STATUS             RESTARTS      AGE    IP           NODE                                NOMINATED NODE   READINESS GATES
kube-system   cilium-fb6kk                          1/1     Running            0             10m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   cilium-jc2gz                          1/1     Running            0             10m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   cilium-node-init-mrp94                1/1     Running            0             10m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   cilium-node-init-xxn6n                1/1     Running            0             10m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   cilium-operator-688c886c98-pvwjj      1/1     Running            0             10m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   cloud-node-manager-fkm8h              1/1     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   cloud-node-manager-kmtzk              1/1     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   coredns-789789675-g886l               1/1     Running            0             9m7s   10.0.1.48    aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   coredns-789789675-qkh75               1/1     Running            0             12m    10.0.0.155   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   coredns-autoscaler-649b947bbd-7vm64   1/1     Running            0             12m    10.0.0.73    aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   csi-azuredisk-node-4x9zs              3/3     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   csi-azuredisk-node-qddjr              3/3     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   csi-azurefile-node-5qmm8              3/3     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   csi-azurefile-node-d247d              3/3     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   hubble-relay-5955499965-rn6rt         0/1     CrashLoopBackOff   6 (43s ago)   10m    10.0.0.10    aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   konnectivity-agent-6745784f5c-2fqbd   1/1     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   konnectivity-agent-6745784f5c-6bvvv   1/1     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   kube-proxy-kjh9t                      1/1     Running            0             12m    10.224.0.5   aks-nodepool1-40756659-vmss000000   <none>           <none>
kube-system   kube-proxy-wt59z                      1/1     Running            0             12m    10.224.0.4   aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   metrics-server-5955767688-q4x2h       2/2     Running            0             6m8s   10.0.1.22    aks-nodepool1-40756659-vmss000001   <none>           <none>
kube-system   metrics-server-5955767688-rsnjm       2/2     Running            0             6m8s   10.0.1.161   aks-nodepool1-40756659-vmss000001   <none>           <none>
2024-02-22T14:05:35.832568732Z level=info msg="Starting gRPC health server..." addr=":4222" subsys=hubble-relay
2024-02-22T14:05:35.833805450Z level=info msg="Starting gRPC server..." options="{peerTarget:hubble-peer.kube-system.svc.cluster.local:443 dialTimeout:5000000000 retryTimeout:30000000000 listenAddress::4245 healthListenAddress::4222 metricsListenAddress: log:0xc0002a7ea0 serverTLSConfig:<nil> insecureServer:true clientTLSConfig:0xc0005c7020 clusterName:cilium-cilium-8005553206-1-3 insecureClient:false observerOptions:[0x1f72d80 0x1f72e60] grpcMetrics:<nil> grpcUnaryInterceptors:[] grpcStreamInterceptors:[]}" subsys=hubble-relay
2024-02-22T14:05:40.834757497Z level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local:443"
2024-02-22T14:06:15.835975013Z level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local:443"
2024-02-22T14:06:35.994127079Z level=info msg="Stopping server..." subsys=hubble-relay
2024-02-22T14:06:35.994231981Z level=info msg="Server stopped" subsys=hubble-relay
@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Feb 22, 2024
@bimmlerd
Copy link
Member

bimmlerd commented Mar 19, 2024

We stared at the sysdump for a bit and there's one thing which stands out:

cilium agent cilium-jc2gz (on node vmss000000) at some point logs in logs-cilium-jc2gz-cilium-agent-20240222-140721.log:

2024-02-22T13:58:03.366816511Z level=debug msg="Allocated random IP" ip=10.0.0.10 owner=kube-system/hubble-relay-5955499965-rn6rt pool=default subsys=ipam

That is, the hubble-relay pod gets 10.0.0.10 as its pod IP. (Also visible in the pods listing above). Now, by itself that's reasonably innocuous, but the funny thing is that 10.0.0.10 is also the kube-dns service ClusterIP:

- metadata:
    [...]
    name: kube-dns
    namespace: kube-system
    resourceVersion: "397"
    uid: 299872cf-b929-4ca3-aed4-f5413fdd08c7
  spec:
    clusterIP: 10.0.0.10
    clusterIPs:
    - 10.0.0.10

We found this by looking at the hubble flows, one enlightening one is the following - which is simultaneously to pod hubble-relay, but also to service kube-dns :D

{
  "flow": {
    "time": "2024-02-22T14:07:20.238201757Z",
    "uuid": "224850fa-4040-4f5a-abf2-a14f1209856d",
    "verdict": "FORWARDED",
    "ethernet": {
      "source": "2a:d7:bd:fa:dd:e7",
      "destination": "5a:72:03:b3:0b:94"
    },
    "IP": {
      "source": "10.0.1.22",
      "destination": "10.0.0.10",
      "ipVersion": "IPv4"
    },
    "l4": {
      "UDP": {
        "source_port": 56367,
        "destination_port": 53
      }
    },
    "source": {
      "ID": 1416,
      "identity": 6335,
      "namespace": "kube-system",
      "labels": [
        "k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile",
        "k8s:io.cilium.k8s.namespace.labels.control-plane=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system",
        "k8s:io.cilium.k8s.policy.cluster=cilium-cilium-8005553206-1-3",
        "k8s:io.cilium.k8s.policy.serviceaccount=metrics-server",
        "k8s:io.kubernetes.pod.namespace=kube-system",
        "k8s:k8s-app=metrics-server",
        "k8s:kubernetes.azure.com/managedby=aks"
      ],
      "pod_name": "metrics-server-5955767688-q4x2h",
      "workloads": [
        {
          "name": "metrics-server",
          "kind": "Deployment"
        }
      ]
    },
    "destination": {
      "identity": 52431,
      "namespace": "kube-system",
      "labels": [
        "k8s:app.kubernetes.io/name=hubble-relay",
        "k8s:app.kubernetes.io/part-of=cilium",
        "k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile",
        "k8s:io.cilium.k8s.namespace.labels.control-plane=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true",
        "k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system",
        "k8s:io.cilium.k8s.policy.cluster=cilium-cilium-8005553206-1-3",
        "k8s:io.cilium.k8s.policy.serviceaccount=hubble-relay",
        "k8s:io.kubernetes.pod.namespace=kube-system",
        "k8s:k8s-app=hubble-relay"
      ],
      "pod_name": "hubble-relay-5955499965-rn6rt"
    },
    "Type": "L3_L4",
    "node_name": "cilium-cilium-8005553206-1-3/aks-nodepool1-40756659-vmss000001",
    "event_type": {
      "type": 4,
      "sub_type": 5
    },
    "destination_service": {
      "name": "kube-dns",
      "namespace": "kube-system"
    },
    "trace_observation_point": "FROM_ENDPOINT",
    "Summary": "UDP"
  },
  "node_name": "cilium-cilium-8005553206-1-3/aks-nodepool1-40756659-vmss000001",
  "time": "2024-02-22T14:07:20.238201757Z"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants