Cilium 1.14.0 AKS BYOCNI - Connections from pods to IMDS randomly blocked #27536

eegseth · 2023-08-16T13:35:24Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

After both fresh installation of cilium 1.14.0 and upgrade from 1.13.2, some pods in some deployments is blocked when trying to connect to the IMDS IP (169.254.169.254:80).

Some deployments with several replicas runs fine, and for some deployments only 1 replica is able to connect to the IMDS, where the rest of the replicas fails to connect.

Pods are evenly distributed across several nodes, and there is no apparent pattern in which nodes some pods run on and some fail on. A node can have both running pods and pods failing to connect to the IMDS at the same time (from the same deployment and between deployments).

We have a clusterwide policy in place which allows connections to the IMDS:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: allow-cluster-np
spec:
  endpointSelector:
    matchExpressions:
    - key: k8s:io.kubernetes.pod.namespace
      operator: NotIn
      values: # No networkpolicies in these namespaces atm. to avoid breaking aks traffic
     - kube-system
     - kube-public
     - kube-node-lease
     - gatekeeper-system
  ingress:
    - fromEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: kube-system
            k8s-app: kube-dns
  egress:
    - toEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: kube-system
            k8s-app: kube-dns
      toPorts:
        - ports:
            - port: "53"
              protocol: UDP
          rules:
            dns:
              - matchPattern: "*"
    - toEntities:
        - host
      toPorts:
        - ports:
            - port: "443"
    - toEntities:
        - remote-node
      toPorts:
        - ports:
            - port: "443"
    - toEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: default
      toPorts:
        - ports:
            - port: "443"
    - toCIDRSet:
        - cidr: 10.0.0.1/32
      toPorts:
        - ports:
            - port: "443"
    - toCIDRSet:
        - cidr: 127.0.0.1/32
      toPorts:
        - ports:
            - port: "2579"
    - toCIDRSet:
        - cidr: 169.254.169.254/32
      toPorts:
        - ports:
          - port: "80"
    - toEndpoints:
        - matchLabels:
            'reserved:kube-apiserver': ''
      toPorts:
        - ports:
            - port: "443"

We only use allow-policies, no deny-policies.

The following shows a deployment where 2 of 3 replicas are running ok, and the 3 is failing (where the failing replica is on the same node as one of the running ones):

The following is the output of hubble observe that shows the relevant traffic (different deployment, same issue):

We have tried restarting all cilium ds/deployments to no avail. We have recreated the issue in different aks clusters.

Cilium Version

1.14.0 - AKS BYO CNI

Kernel Version

5.15.0-1042-azure

(aks image version AKSUbuntu-2204gen2containerd-202307.27.0)

Kubernetes Version

1.27.3

Sysdump

Relevant log output

No response

Anything else?

No response

Code of Conduct

I agree to follow this project's Code of Conduct

eegseth · 2023-08-22T07:10:06Z

We tried installing cilium 1.14.1 in one of our clusters, and the issue seems to be resolved. We suspect the issue might have something to do with the following bug that was fixed in 1.14.1: #27327

Would be nice if someone with a bit more insight could have a look and see if the pr could be related to the issue.

joestringer · 2023-08-24T19:43:22Z

#27327 typically triggers ~10m after startup and starts to cause connectivity impacts for traffic that are allowed by CIDR or ToFQDNs policy, so the policy you've pasted could be impacted. It's mitigated by touching/updating that policy such that the agents pick it up again.

It's a good sign if v1.14.1 is no longer exhibiting the problem. I'll close this for now, but if you do observe this behaviour or another issue in future, feel free to comment so we can reopen this one, or file a new issue with the details.

opaetzel · 2023-09-08T12:22:12Z

We are experiencing the same issue with cilium 1.14.1. We are using chaining mode with AWS ENI though. The clusterwide policy looks as follows:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: allow-infrastructure-defaults
spec:
  egress:
  - toPorts:
    - ports:
      - port: "6831"
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s:k8s-app: kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: '*'
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: linkerd
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: linkerd-jaeger
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: opentelemetry
  - toCIDR:
    - 169.254.169.254/32
  - toFQDNs:
    - matchName: collector.linkerd-jaeger.svc.cluster.local
  endpointSelector:
    matchLabels:
      k8s:io.cilium.k8s.namespace.labels.test/cilium-enabled: "true"

The cilium endpoint looks as follows:

apiVersion: cilium.io/v2
kind: CiliumEndpoint
metadata:
  labels:
    app: dsp-ops-backend
    linkerd.io/control-plane-ns: linkerd
    linkerd.io/proxy-deployment: redacted
    linkerd.io/workload-ns: redacted
    pod-template-hash: 677798ddb5
    release: redacted
    service: dsp-ops-backend
  name: redacted
  namespace: redacted
  ownerReferences:
  - apiVersion: v1
    kind: Pod
    name: redacted
    uid: fb35cae7-fb8f-40b0-92fd-3ca703c771e4
  resourceVersion: "1151215601"
  uid: f47ab939-f7bb-4d11-acd4-1dce7529a0a6
status:
  encryption: {}
  external-identifiers:
    container-id: fefc5739620ba0ba947f26bc29bf9b103882281f37177285fca03b7cb8f3f10d
    k8s-namespace: redacted
    k8s-pod-name: redacted
    pod-name: redacted
  id: 78
  identity:
    id: 4463
    labels:
    - k8s:app=dsp-ops-backend
    - k8s:io.cilium.k8s.namespace.labels.test/cilium-enabled=true
    - k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=redacted
    - k8s:io.cilium.k8s.policy.cluster=default
    - k8s:io.cilium.k8s.policy.serviceaccount=redacted
    - k8s:io.kubernetes.pod.namespace=redacted
    - k8s:linkerd.io/control-plane-ns=linkerd
    - k8s:linkerd.io/proxy-deployment=redacted
    - k8s:linkerd.io/workload-ns=redacted
    - k8s:release=redacted
    - k8s:service=dsp-ops-backend
  named-ports:
  - name: http
    port: 8000
    protocol: TCP
  - name: linkerd-admin
    port: 4191
    protocol: TCP
  - name: linkerd-proxy
    port: 4143
    protocol: TCP
  networking:
    addressing:
    - ipv4: 172.20.133.186
      ipv6: fe80::f4e5:aff:feea:c210
    node: 172.20.132.39
  policy:
    egress:
      enforcing: false
      state: <status disabled>
    ingress:
      enforcing: false
      state: <status disabled>
  state: ready

The traffic going to 169.254.169.254 is marked as "AUDIT" by cilium, but as I see it, it should go through.

Should I open a new issue or will we use this one?

joestringer · 2023-09-08T15:56:00Z

@opaetzel I'd suggest opening a fresh issue. At a glance that might be more of a problem where the IP's scope (in-cluster vs external) is being considered in a way that excludes it from CIDR policy.

eegseth added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Aug 16, 2023

joestringer closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cilium 1.14.0 AKS BYOCNI - Connections from pods to IMDS randomly blocked #27536

Cilium 1.14.0 AKS BYOCNI - Connections from pods to IMDS randomly blocked #27536

eegseth commented Aug 16, 2023 •

edited

eegseth commented Aug 22, 2023

joestringer commented Aug 24, 2023

opaetzel commented Sep 8, 2023

joestringer commented Sep 8, 2023

Cilium 1.14.0 AKS BYOCNI - Connections from pods to IMDS randomly blocked #27536

Cilium 1.14.0 AKS BYOCNI - Connections from pods to IMDS randomly blocked #27536

Comments

eegseth commented Aug 16, 2023 • edited

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

eegseth commented Aug 22, 2023

joestringer commented Aug 24, 2023

opaetzel commented Sep 8, 2023

joestringer commented Sep 8, 2023

eegseth commented Aug 16, 2023 •

edited