Proxy redirect issue when running Cilium on top of Calico (CNI-Chaining) #12454

brandshaide · 2020-07-08T10:18:50Z

Bug report

General Information

pod-to-external-fqdn-allow-google-cnp is failing when running Cilium on top of Calico using CNI chaining

Cilium version: 1.8.1 & 1.7.6
tested with minikube 1.8.2 and RKE

How to reproduce the issue

minikube start --network-plugin=cni --memory=4096
deploy Calico (tested with v. 3.13.4)
follow instructions for CNI Chaining as described in our Documentation
Finally deploy connectivity-tests

**Expected behaviour

All connectivity-check-pods are up/running:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s

**Actual behaviour

pod-to-external-fqdn-allow-google-cnp is failing and falling into a CrashLoopBackOff

kubectl get po --all-namespaces
NAMESPACE     NAME                                                     READY   STATUS             RESTARTS   AGE
default       echo-a-58dd59998d-rssbn                                  1/1     Running            0          159m
default       echo-b-865969889d-jtl65                                  1/1     Running            0          159m
default       echo-b-host-659c674bb6-vb6lm                             1/1     Running            0          159m
default       host-to-b-multi-node-clusterip-6fb94d9df6-nsstx          0/1     Pending            0          159m
default       host-to-b-multi-node-headless-7c4ff79cd-zjwsn            0/1     Pending            0          159m
default       pod-to-a-5c8dcf69f7-kf8q7                                1/1     Running            0          159m
default       pod-to-a-allowed-cnp-75684d58cc-4jdlv                    1/1     Running            0          159m
default       pod-to-a-external-1111-669ccfb85f-phltt                  1/1     Running            0          159m
default       pod-to-a-l3-denied-cnp-7b8bfcb66c-b2kdv                  1/1     Running            0          159m
default       pod-to-b-intra-node-74997967f8-4pcnr                     1/1     Running            0          159m
default       pod-to-b-intra-node-nodeport-775f967f47-mcwlq            1/1     Running            0          159m
default       pod-to-b-multi-node-clusterip-587678cbc4-b7njc           0/1     Pending            0          159m
default       pod-to-b-multi-node-headless-574d9f5894-vzb6q            0/1     Pending            0          159m
default       pod-to-b-multi-node-nodeport-7944d9f9fc-7t46p            0/1     Pending            0          159m
default       pod-to-external-fqdn-allow-google-cnp-6dd57bc859-vx4qk   0/1     CrashLoopBackOff   9          20m
kube-system   calico-kube-controllers-889867b65-dflz4                  1/1     Running            0          163m
kube-system   calico-node-8m4fz                                        1/1     Running            0          163m
kube-system   cilium-mfz8d                                             1/1     Running            0          143m
kube-system   cilium-operator-6cc5dff878-gw268                         1/1     Running            0          161m
kube-system   coredns-6955765f44-jjt9v                                 1/1     Running            0          160m
kube-system   coredns-6955765f44-nzgjj                                 1/1     Running            0          161m
kube-system   etcd-m01                                                 1/1     Running            0          175m
kube-system   hubble-relay-6555c76d6-vm4bq                             1/1     Running            0          148m
kube-system   hubble-ui-5ddff94674-6w6vk                               1/1     Running            0          148m
kube-system   kube-apiserver-m01                                       1/1     Running            0          175m
kube-system   kube-controller-manager-m01                              1/1     Running            0          175m
kube-system   kube-proxy-plrpt                                         1/1     Running            0          175m
kube-system   kube-scheduler-m01                                       1/1     Running            0          175m
kube-system   storage-provisioner                                      1/1     Running            0          175m

When deleting (correct) CNP, which is

---
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "pod-to-external-fqdn-allow-google-cnp"
spec:
  endpointSelector:
    matchLabels:
      name: pod-to-external-fqdn-allow-google-cnp
  egress:
  - toEndpoints:
    - matchLabels:
       "k8s:io.kubernetes.pod.namespace": kube-system
       "k8s:k8s-app": kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: "*"
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: openshift-dns
        k8s:dns.operator.openshift.io/daemonset-dns: default
    toPorts:
    - ports:
      - port: "5353"
        protocol: UDP
      rules:
        dns:
        - matchPattern: "*"
  - toFQDNs:
    - matchPattern: "*.google.com"
---

traffic is now being routed, i. e. now successfully cURLing google.com .

redirects are OK:

IPAM:                   IPv4: 1/255 allocated from 10.0.0.0/24,
Masquerading:           Disabled
Controller Status:      66/66 healthy
Proxy Status:           OK, ip 10.0.0.6, 3 redirects active on ports 10000-20000
  Protocol            Redirect               Proxy Port
  cilium-dns-egress   1138:egress:TCP:53     43693
  cilium-dns-egress   1138:egress:UDP:53     43693
  cilium-dns-egress   1138:egress:UDP:5353   43693

No ERROR and/or WARN logs identified at the agent

@pchaigno :
issue being related to a conflict on packet marks between Cilium and Calico that prevents proxy redirects from working properly.

The text was updated successfully, but these errors were encountered:

brandshaide · 2020-07-16T05:52:30Z

Additionally applying L7 rules seems to have no effect at all.

Steps to reproduce:
Follow L7 Example here
kubectl exec tiefighter -- curl -s -XPUT deathstar.default.svc.cluster.local/v1/exhaust-port

Expected result
L7 rule being applied, Access being denied in this example

Actual result
cURL being stuck.

Has been tested by the community also using AWS CNI

jrfastab · 2020-07-20T15:26:05Z

What I think is going on here, Calico uses mark fields that conflict with Cilium and will change mark values as they are sent to the stack. For Cilium redirect from pod to pod in non-chaining case we use a bpf redirect and skip going to the stack when we know the pod is local. However, in chaining case we use stacks routing table so we let the packet go to the stack. In non-L7 cases we do avoid using the mark value in Calico chaining case and force ingress on veth to do another extra lookup because of the mark mangling done in Calico.

However, in the L7 case we redirect to Envoy using the stack and still rely on the mark value to hit the Cilium NOTRACK rules in iptables. I observed in a similar test that in this chaining case NOTRACK rules are not being hit and the packets sent to Envoy are being dropped by the stack. The reason we get a miss on the NOTRACK side is because we can also see a Calico rule being hit (in front of the Cilium rules) that mangles the mark value using Calico logic, conflicting with Cilium usage.

To fix we likely need to put the NOTRACK rule in front of the Calico rules. I think this is OK because Cilium forward to Envoy is an internal implementation detail of Cilium and should avoid any routing/policy logic in use by Calico. Also I expect that TPROXY improvements in future release will resolve this as well because we wont need the iptables logic in the Envoy redirect path.

joestringer · 2020-08-25T22:44:33Z

See this PR for the referred TPROXY changes mentioned above: #11279

aditighag · 2020-11-13T16:57:06Z

To fix we likely need to put the NOTRACK rule in front of the Calico rules. I think this is OK because Cilium forward to Envoy is an internal implementation detail of Cilium and should avoid any routing/policy logic in use by Calico. Also I expect that TPROXY improvements in future release will resolve this as well because we wont need the iptables logic in the Envoy redirect path.

See this PR for the referred TPROXY changes mentioned above: #11279

@joestringer Can we close this issue since the TPROXY change was merged in 1.9?

joestringer · 2020-11-16T19:58:07Z

@aditighag I didn't specifically validate that #11279 fixed this issue. I think that additional development work will be needed to fully resolve this.

brandshaide · 2020-11-26T08:06:33Z

To fix we likely need to put the NOTRACK rule in front of the Calico rules. I think this is OK because Cilium forward to Envoy is an internal implementation detail of Cilium and should avoid any routing/policy logic in use by Calico. Also I expect that TPROXY improvements in future release will resolve this as well because we wont need the iptables logic in the Envoy redirect path.

See this PR for the referred TPROXY changes mentioned above: #11279

@joestringer Can we close this issue since the TPROXY change was merged in 1.9?

@aditighag @joestringer I'll have a look whether or in which scope this fix is solving/mitigating the issue

sergeyshevch · 2021-10-20T07:18:06Z

@brandshaide @joestringer We started to use cilium with AWS CNI chaining. Can we maybe mention more about this limitation in docs? Or it currently should works?

joestringer · 2021-10-20T17:17:48Z

@sergeyshevch sure, do you have a suggestion where to document this?

We currently have this documented at the top of pages like this: https://docs.cilium.io/en/stable/gettingstarted/cni-chaining-aws-cni/ .

sergeyshevch · 2021-10-24T17:39:58Z

@joestringer I guess it's not totally clear what isn't working. So all L7 ruless will not work on CNI chaining or only some cases. Maybe it will be good also to mention all related issues?

kkourt · 2021-10-25T07:13:03Z

Note that there is also this section in the docs https://docs.cilium.io/en/stable/gettingstarted/cni-chaining-calico/#calico that enumerates specific issues.

joestringer · 2021-10-25T08:07:47Z

@sergeyshevch yes, basically all features listed in those sections are untested and are likely to be broken. Someone would need to pick up the work to implement the integrations necessary for it to work. Currently the docs link to the primary known issue which is likely to affect all CNI chaining setups, even though the specific details may vary from plugin to plugin.

zhanghe9702 · 2022-04-14T06:11:54Z

To fix we likely need to put the NOTRACK rule in front of the Calico rules. I think this is OK because Cilium forward to Envoy is an internal implementation detail of Cilium and should avoid any routing/policy logic in use by Calico. Also I expect that TPROXY improvements in future release will resolve this as well because we wont need the iptables logic in the Envoy redirect path.

See this PR for the referred TPROXY changes mentioned above: #11279

@joestringer Can we close this issue since the TPROXY change was merged in 1.9?

@aditighag @joestringer I'll have a look whether or in which scope this fix is solving/mitigating the issue

i have tested in CNI-Chaining (calico) kind k8s cluster(cgroupv2 only) , it looks like exist same problem :), and cilium version(v1.12.0-rc1)

helm install cilium install/kubernetes/cilium --devel=true --namespace=kube-system --set cni.chainingMode=generic-veth --set cni.customConf=true --set cni.configMap=cni-configuration --set tunnel=disabled --set enableIPv4Masquerade=false --set enableIdentityMark=false --set bpf.tproxy=true

zhanghe9702 · 2022-04-14T06:12:46Z

📋 Test Report ❌ 3/11 tests failed (7/104 actions), 0 tests skipped, 0 scenarios skipped: Test [echo-ingress-l7]: ❌ echo-ingress-l7/pod-to-pod/curl-2: cilium-test/client2-547996d7d8-fqr48 (10.244.162.132) -> cilium-test/echo-other-node-6cd597cddc-mjhfl (10.244.110.131:8080) ❌ echo-ingress-l7/pod-to-pod/curl-3: cilium-test/client2-547996d7d8-fqr48 (10.244.162.132) -> cilium-test/echo-same-node-6f4976ddbd-6pfrg (10.244.162.133:8080) Test [client-egress-l7]: ❌ client-egress-l7/pod-to-pod/curl-2: cilium-test/client2-547996d7d8-fqr48 (10.244.162.132) -> cilium-test/echo-other-node-6cd597cddc-mjhfl (10.244.110.131:8080) ❌ client-egress-l7/pod-to-pod/curl-3: cilium-test/client2-547996d7d8-fqr48 (10.244.162.132) -> cilium-test/echo-same-node-6f4976ddbd-6pfrg (10.244.162.133:8080) ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-547996d7d8-fqr48 (10.244.162.132) -> one-one-one-one-http (one.one.one.one:80) Test [to-fqdns]: ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7df6cfbf7b-sdd8t (10.244.162.131) -> one-one-one-one-http (one.one.one.one:80) ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-547996d7d8-fqr48 (10.244.162.132) -> one-one-one-one-http (one.one.one.one:80) Connectivity test failed: 3 tests failed

stevehipwell · 2022-07-20T16:22:55Z

Is anyone planning on picking up this work? Chaining Cilium to the AWS VPC CNI is a compelling proposition for us but loosing layer 7 policy support makes it a tougher sell.

yurrriq · 2023-05-01T18:41:18Z

We are chaining Cilium 1.12 to the AWS VPC CNI and L7 policies (seem to) work fine.

Edit: My assessment is anecdotal... I haven't (yet) run cilium connectivity test as in #20720

github-actions · 2023-07-08T02:06:06Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2023-07-23T01:58:32Z

This issue has not seen any activity since it was marked stale.
Closing.

bog-dance · 2023-08-11T17:28:13Z

We are chaining Cilium 1.14 to the AWS VPC CNI and L7 policies work fine.
Rule example:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      "k8s:io.kubernetes.pod.namespace": ns-a
  ingress:
  - fromEndpoints:
    - matchLabels:
        "k8s:io.kubernetes.pod.namespace": ns-b
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/index.html"
        - method: "PUT"
          path: "/50x.html"

Alex-Waring · 2023-08-12T07:14:35Z

I'm using CNI chaining on AWS and L7 policies work until a security group is attached, at which point traffic black holed at some point after the egress proxy

AleksandrAksenov · 2023-08-24T12:28:08Z

We are chaining Cilium 1.14 to the AWS VPC CNI and L7 policies work fine. Rule example:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      "k8s:io.kubernetes.pod.namespace": ns-a
  ingress:
  - fromEndpoints:
    - matchLabels:
        "k8s:io.kubernetes.pod.namespace": ns-b
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/index.html"
        - method: "PUT"
          path: "/50x.html"

How did you do it?
I use Cilium 1.14.1 and aws-vpc-cni and when I check with Hubble - I have only world and no external DNS-names.

github-actions · 2024-04-30T01:45:45Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2024-05-14T01:46:07Z

This issue has not seen any activity since it was marked stale.
Closing.

pchaigno assigned jrfastab Jul 8, 2020

pchaigno assigned kkourt Jul 20, 2020

pchaigno mentioned this issue Aug 28, 2020

docs: Mention L7 limitation in Calico chaining GSG #13005

Merged

rohit0718 mentioned this issue Jun 21, 2021

503 response when L7 egress rules are applied to external hosts (chaining with AWS VPC CNI) #16599

Closed

joestringer unassigned kkourt and jrfastab Feb 22, 2022

joestringer added pinned These issues are not marked stale by our issue bot. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Feb 22, 2022

olga-mir mentioned this issue Aug 8, 2022

Install cilium with vpc cni failed #20720

Closed

2 tasks

joestringer added help-wanted Please volunteer for this by adding yourself as an assignee! and removed pinned These issues are not marked stale by our issue bot. labels May 2, 2023

youngnick added the sig/agent Cilium agent related. label May 8, 2023

paulbhart mentioned this issue Jun 8, 2023

l7 policies do not work in gke #24679

Closed

1 task

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 8, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 23, 2023

pchaigno reopened this Jul 24, 2023

yadutaf mentioned this issue Jan 24, 2024

Support other network plugins (CNI) than canal in MKS ovh/public-cloud-roadmap#346

Open

aanm removed the pinned These issues are not marked stale by our issue bot. label Feb 29, 2024

ferrandinand mentioned this issue Mar 7, 2024

[VPC-CNI] [request]: Support Layer7 network policies aws/containers-roadmap#2303

Open

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy redirect issue when running Cilium on top of Calico (CNI-Chaining) #12454

Proxy redirect issue when running Cilium on top of Calico (CNI-Chaining) #12454

brandshaide commented Jul 8, 2020

brandshaide commented Jul 16, 2020 •

edited

jrfastab commented Jul 20, 2020 •

edited

joestringer commented Aug 25, 2020

aditighag commented Nov 13, 2020

joestringer commented Nov 16, 2020

brandshaide commented Nov 26, 2020

sergeyshevch commented Oct 20, 2021

joestringer commented Oct 20, 2021

sergeyshevch commented Oct 24, 2021

kkourt commented Oct 25, 2021

joestringer commented Oct 25, 2021

zhanghe9702 commented Apr 14, 2022

zhanghe9702 commented Apr 14, 2022

stevehipwell commented Jul 20, 2022

yurrriq commented May 1, 2023 •

edited

github-actions bot commented Jul 8, 2023

github-actions bot commented Jul 23, 2023

bog-dance commented Aug 11, 2023 •

edited

Alex-Waring commented Aug 12, 2023

AleksandrAksenov commented Aug 24, 2023

github-actions bot commented Apr 30, 2024

github-actions bot commented May 14, 2024

Proxy redirect issue when running Cilium on top of Calico (CNI-Chaining) #12454

Proxy redirect issue when running Cilium on top of Calico (CNI-Chaining) #12454

Comments

brandshaide commented Jul 8, 2020

Bug report

brandshaide commented Jul 16, 2020 • edited

jrfastab commented Jul 20, 2020 • edited

joestringer commented Aug 25, 2020

aditighag commented Nov 13, 2020

joestringer commented Nov 16, 2020

brandshaide commented Nov 26, 2020

sergeyshevch commented Oct 20, 2021

joestringer commented Oct 20, 2021

sergeyshevch commented Oct 24, 2021

kkourt commented Oct 25, 2021

joestringer commented Oct 25, 2021

zhanghe9702 commented Apr 14, 2022

zhanghe9702 commented Apr 14, 2022

stevehipwell commented Jul 20, 2022

yurrriq commented May 1, 2023 • edited

github-actions bot commented Jul 8, 2023

github-actions bot commented Jul 23, 2023

bog-dance commented Aug 11, 2023 • edited

Alex-Waring commented Aug 12, 2023

AleksandrAksenov commented Aug 24, 2023

github-actions bot commented Apr 30, 2024

github-actions bot commented May 14, 2024

brandshaide commented Jul 16, 2020 •

edited

jrfastab commented Jul 20, 2020 •

edited

yurrriq commented May 1, 2023 •

edited

bog-dance commented Aug 11, 2023 •

edited