Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request time out from container to other container using hostIP:hostPort on same host with portmap CNI chained #9784

Closed
splushii opened this issue Dec 18, 2019 · 7 comments · Fixed by #10928
Assignees
Labels
area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. priority/high This is considered vital to an upcoming release.
Projects
Milestone

Comments

@splushii
Copy link

Bug report

General Information

  • Cilium version (run cilium version)
Client: 1.6.4 8048d320a 2019-11-27T17:00:12+01:00 go version go1.12.13 linux/amd64
Daemon: 1.6.4 8048d320a 2019-11-27T17:00:12+01:00 go version go1.12.13 linux/amd64
  • Kernel version (run uname -a)
Linux tender-drake 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Orchestration system version in use (e.g. kubectl version, Mesos, ...)
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
  • Link to relevant artifacts (policies, deployments scripts, ...)
  • Upload a system dump (run curl -sLO releases.cilium.io/tools/cluster-diagnosis.zip && python cluster-diagnosis.zip sysdump and then attach the generated zip file)

( The sysdump was too big to attach. Please let me know if it's needed and I'll send or make it available somehow. )

How to reproduce the issue

  1. Install Cilium 1.6.4 using Helm chart from https://github.com/cilium/cilium/archive/v1.6.4.tar.gz with the following values:
---
global:
  tag: v1.6.4
  cluster:
    name: "k8slab"
    id: "1"
  cni:
    install: true
    chainingMode: portmap
  ipv4:
    enabled: true
  ipv6:
    enabled: false
  prometheus:
    enabled: false
  1. Deploy container using hostPort
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
          - containerPort: 80
            hostPort: 35000
  1. Deploy some debug container connecting to the hostPort service
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ubuntu
spec:
  selector:
    matchLabels:
      app: ubuntu
  template:
    metadata:
      labels:
        app: ubuntu
    spec:
      containers:
      - name: ubuntu
        image: ubuntu
        command:
          - '/bin/bash'
          - '-c'
          - '--'
        args:
          - 'apt-get update -y; apt-get install -y netcat curl; while true; do sleep 30; done;'
  1. Select a hostPort pod and debug pod on the same host (e.g. nginx-2cb8j and ubuntu-vf44z in the example below)
$ kubectl get po -owide
NAME           READY   STATUS    RESTARTS   AGE   IP             NODE            NOMINATED NODE   READINESS GATES
nginx-2cb8j    1/1     Running   0          24m   172.16.3.16    tender-drake    <none>           <none>
nginx-7b4wr    1/1     Running   0          24m   172.16.7.154   mutual-dodo     <none>           <none>
nginx-drtk8    1/1     Running   0          24m   172.16.8.63    full-evil       <none>           <none>
nginx-dzk7n    1/1     Running   0          24m   172.16.4.59    right-ferret    <none>           <none>
nginx-l89zr    1/1     Running   0          24m   172.16.6.17    loving-gopher   <none>           <none>
nginx-l9ntn    1/1     Running   0          24m   172.16.5.128   strong-caiman   <none>           <none>
ubuntu-bb44g   1/1     Running   0          20m   172.16.5.210   strong-caiman   <none>           <none>
ubuntu-hhpv7   1/1     Running   0          20m   172.16.7.223   mutual-dodo     <none>           <none>
ubuntu-n2p99   1/1     Running   0          20m   172.16.4.245   right-ferret    <none>           <none>
ubuntu-rlv5c   1/1     Running   0          20m   172.16.6.156   loving-gopher   <none>           <none>
ubuntu-tjr88   1/1     Running   0          20m   172.16.8.70    full-evil       <none>           <none>
ubuntu-vf44z   1/1     Running   0          20m   172.16.3.47    tender-drake    <none>           <none>
  1. Try to contact hostPort pod via its hostIP
    Using the above example:
$ kubectl get pod nginx-2cb8j -ojsonpath='{.status.hostIP}'
10.24.214.230
$ kubectl exec -ti ubuntu-vf44z bash
root@ubuntu-vf44z:/# curl 10.24.214.230:35000 -v
* Rebuilt URL to: 10.24.214.230:35000/
*   Trying 10.24.214.230...
* TCP_NODELAY set
* connect to 10.24.214.230 port 35000 failed: Connection timed out
* Failed to connect to 10.24.214.230 port 35000: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 10.24.214.230 port 35000: Connection timed out

The cilium monitor output is (where id 568 is the debug pod ubuntu-vf44z):

cilium monitor --related-to 568
Listening for events on 32 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
level=info msg="Initializing dissection cache..." subsys=monitor
-> host from flow 0xab4466b0 identity 111636->1 state new ifindex cilium_net: 172.16.3.47:49290 -> 10.24.214.230:35000 tcp SYN
-> endpoint 568 flow 0x85c0bb7d identity 111706->111636 state new ifindex lxc2faa014a7904: 172.16.3.16:80 -> 172.16.3.47:49290 tcp SYN, ACK
-> host from flow 0x65945f77 identity 111636->1 state established ifindex cilium_net: 172.16.3.47:49290 -> 10.24.214.230:35000 tcp SYN
-> host from flow 0x405d06ef identity 111636->1 state established ifindex cilium_net: 172.16.3.47:49290 -> 10.24.214.230:35000 tcp SYN
-> host from flow 0xc974addb identity 111636->1 state established ifindex cilium_net: 172.16.3.47:49290 -> 10.24.214.230:35000 tcp SYN
-> host from flow 0x73876df9 identity 111636->1 state established ifindex cilium_net: 172.16.3.47:49290 -> 10.24.214.230:35000 tcp SYN

Contacting the hostPort pod using its ClusterIP works:

root@ubuntu-vf44z:/# curl 172.16.3.16:80 -I
HTTP/1.1 200 OK
...

and contacting a pod running on another host using the hostIP and hostPort works:

$ kubectl get pod nginx-7b4wr -ojsonpath='{.status.hostIP}'
10.24.214.232
root@ubuntu-vf44z:/# curl 10.24.214.232:35000 -I   
HTTP/1.1 200 OK
...

It seems the issue only arises when the source and destination pod is on the same host and the hostIP:hostPort is used.

@brb brb added kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Dec 18, 2019
@croissong
Copy link

croissong commented Dec 21, 2019

Same issue here. With cilium v1.6.3 the same test case is working.

@TimEvens
Copy link

TimEvens commented Jan 3, 2020

We are having the same problem, even with "Cilium 1.6.90 ca7f68526 2019-11-19T08:00:31-08:00 go version go1.13.4 linux/amd64".

Use-case: Consul Agent - PODs access consul via the hostIP that the POD is running on, eg. http://${HOST_IP}:8500/v1/...

portmap commit attempted to fix the problem for hair-pinning, but it looks to me that they only fixed it for the host itself and PODs that are using hostNetworking. PODs that use cluster networking are not correctly being masqueraded when a local pod to the host tries to access the hostIP:hostPort (e.g., DNAT).

In the case of consul-agent using port 8500, we are able to work around the issue by adding the below:

iptables -t nat -I CNI-HOSTPORT-DNAT -p tcp -m tcp --dport 8500 -j KUBE-MARK-MASQ

If portmap is configured with defaults, we can also use:

iptables -t nat -I CNI-HOSTPORT-DNAT -p tcp -m tcp --dport 8500 -j CNI-HOSTPORT-SETMARK

It seems to me that this is an issue with portmap, but maybe there is something Cilium can do...?? :)

@lkysow
Copy link

lkysow commented Jan 16, 2020

I've reproduced this on Digital Ocean using the Consul helm chart. Every pod in other nodes can access the hostIP:hostPort except for the pods on that node itself.

@aanm aanm added area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. kind/bug This is a bug in the Cilium logic. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Jan 17, 2020
@aanm aanm changed the title Request time out from container to other container hostPort using hostIP on same host Request time out from container to other container using hostIP:hostPort on same host with portmap CNI chained Jan 17, 2020
@aanm aanm added this to the 1.8 milestone Jan 20, 2020
@jdef
Copy link

jdef commented Mar 2, 2020

the workaround in the description above doesn't seem to be compatible with a CiliumNetworkPolicy that enforces ingress using fromEntities: [ host ]

EDIT

.. meaning that I need to remove such policy in order for the workaround to be effective

@joestringer
Copy link
Member

One additional snippet of information to add to @jdef's observations above is that the source identity in that case was 2 (world).

@tgraf tgraf added the priority/high This is considered vital to an upcoming release. label Mar 12, 2020
@tgraf
Copy link
Member

tgraf commented Mar 20, 2020

The commit which likely broke this is 4ae2ead. The commit is needed to make externalTrafficPolicy=local work.

Traffic pattern looks like this:

-> stack flow 0x9a6ba3e1 identity 34607->1 state new ifindex 0 orig-ip 0.0.0.0: 192.168.131.170:60406 -> 192.168.52.196:35000 tcp SYN
-> endpoint 3150 flow 0x9a6ba3e1 identity 34607->2352 state new ifindex 0 orig-ip 192.168.131.170: 192.168.131.170:60406 -> 192.168.144.37:80 tcp SYN

The forward-path looks correct. The identity is correctly preserved.

-> stack flow 0x6a800030 identity 2352->34607 state reply ifindex 0 orig-ip 0.0.0.0: 192.168.144.37:80 -> 192.168.131.170:60406 tcp SYN, ACK
-> endpoint 3549 flow 0x6a800030 identity 2->34607 state reply ifindex 0 orig-ip 192.168.52.196: 192.168.52.196:35000 -> 192.168.131.170:60406 tcp SYN, ACK

The reply path loses the identity because the traffic seems to gets masqueraded.

@tgraf tgraf assigned tgraf and unassigned borkmann Mar 20, 2020
tgraf added a commit that referenced this issue Apr 9, 2020
When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
tgraf added a commit that referenced this issue Apr 10, 2020
The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmaqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf
Copy link
Member

tgraf commented Apr 10, 2020

Prior to commit 4ae2ead, Cilium implemented a relatively aggressive masquerading behavior of traffic from the local host. This simplified integration with components such as portmap and others which implement their own DNAT rules but often lack SNAT to force reply traffic back through the same path.

Starting with commit 4ae2ead, this was relaxed in order to allow for proper behavior of externalTrafficPolicy=local. The commit removed the following masquerade rule:

-                       // Masquerade all traffic from the host into the
-                       // local Cilium cluster range if the source is not
-                       // in the cluster range and DNAT has been
-                       // applied.  These conditions are met by traffic
-                       // redirected via hostports from non-cluster sources.
-                       // The SNAT to the cluster address is needed so that
-                       // the return traffic from a host proxy (when used) is
-                       // routed back via the cilium_host device also
-                       // when the source address is originally
-                       // outside of the cluster range.
-                       //
-                       // The following conditions must be met:
-                       // * Must be targeted to an IP that IS local
-                       // * May not originate from any IP inside of the cluster range
-                       // * Must have DNAT applied (k8s hostport, etc.)
-
-                       if err := runProg("iptables", append(
-                               m.waitArgs,
-                               "-t", "nat",
-                               "-A", ciliumPostNatChain,
-                               "!", "-s", node.GetIPv4ClusterRange().String(),
-                               "-o", localDeliveryInterface,
-                               "-m", "conntrack", "--ctstate", "DNAT",
-                               "-m", "comment", "--comment", "cilium hostport cluster masquerade",
-                               "-j", "SNAT", "--to-source", node.GetHostMasqueradeIPv4().String()), false); err != nil {
-                               return err
-                       }

With the removal of this masquerade rule, it is no longer guaranteed for reply traffic to be routed symmetrically on the reply path. It was expected that portmap would ensure for this, given that it depends on seeing the reply traffic. It seems that portmap assumes that the routing path is always symmetric.

The following PRs will fix this issues:

aanm pushed a commit that referenced this issue Apr 14, 2020
When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
tgraf added a commit that referenced this issue Apr 17, 2020
Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
tgraf added a commit that referenced this issue Apr 17, 2020
The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
jrfastab pushed a commit that referenced this issue Apr 23, 2020
[ upstream commit 5f50d82 ]

Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
jrfastab pushed a commit that referenced this issue Apr 23, 2020
[ upstream commit 1fb15cf ]

The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
jrfastab pushed a commit that referenced this issue Apr 28, 2020
[ upstream commit f25d8b9 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
jrfastab pushed a commit that referenced this issue Apr 28, 2020
[ upstream commit 5f50d82 ]

Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
jrfastab pushed a commit that referenced this issue Apr 28, 2020
[ upstream commit 1fb15cf ]

The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
joestringer pushed a commit that referenced this issue Apr 29, 2020
[ upstream commit 5f50d82 ]

Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
joestringer pushed a commit that referenced this issue Apr 29, 2020
[ upstream commit 1fb15cf ]

The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
jrfastab added a commit that referenced this issue Apr 29, 2020
[ upstream commit f25d8b9 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

To work with clang-7 and avoid the pattern where the ctx is read into
a register and then incremented then finally a value assigned to it,

  r1 = %[ctx]
  r1 += 8
  ...
  *(u32)(r1 +=8) = %[mark]

We wrote the code block in asm which is not the same as master branch
which was able to use C code due to use of clang-11. We attempted to
update the branch to clang-10 but that created a separate set of issue
that was causing more code churn than we wanted.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
jrfastab added a commit that referenced this issue Apr 29, 2020
[ upstream commit f25d8b9 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

To work with clang-7 and avoid the pattern where the ctx is read into
a register and then incremented then finally a value assigned to it,

  r1 = %[ctx]
  r1 += 8
  ...
  *(u32)(r1 +=8) = %[mark]

We wrote the code block in asm which is not the same as master branch
which was able to use C code due to use of clang-11. We attempted to
update the branch to clang-10 but that created a separate set of issue
that was causing more code churn than we wanted.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
joestringer pushed a commit that referenced this issue May 7, 2020
[ upstream commit f25d8b9 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

To work with clang-7 and avoid the pattern where the ctx is read into
a register and then incremented then finally a value assigned to it,

  r1 = %[ctx]
  r1 += 8
  ...
  *(u32)(r1 +=8) = %[mark]

We wrote the code block in asm which is not the same as master branch
which was able to use C code due to use of clang-11. We attempted to
update the branch to clang-10 but that created a separate set of issue
that was causing more code churn than we wanted.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
qmonnet pushed a commit that referenced this issue May 12, 2020
[ upstream commit f25d8b9 ]
[ based on v1.7 backport 60b4210 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

To work with clang-7 and avoid the pattern where the ctx is read into
a register and then incremented then finally a value assigned to it,

  r1 = %[ctx]
  r1 += 8
  ...
  *(u32)(r1 +=8) = %[mark]

We wrote the code block in asm which is not the same as master branch
which was able to use C code due to use of clang-11. We attempted to
update the branch to clang-10 but that created a separate set of issue
that was causing more code churn than we wanted.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit that referenced this issue May 12, 2020
[ upstream commit 5f50d82 ]

Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit that referenced this issue May 12, 2020
[ upstream commit 1fb15cf ]

The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit that referenced this issue May 13, 2020
[ upstream commit f25d8b9 ]
[ based on v1.7 backport 60b4210 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

To work with clang-7 and avoid the pattern where the ctx is read into
a register and then incremented then finally a value assigned to it,

  r1 = %[ctx]
  r1 += 8
  ...
  *(u32)(r1 +=8) = %[mark]

We wrote the code block in asm which is not the same as master branch
which was able to use C code due to use of clang-11. We attempted to
update the branch to clang-10 but that created a separate set of issue
that was causing more code churn than we wanted.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit that referenced this issue May 13, 2020
[ upstream commit 5f50d82 ]

Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit that referenced this issue May 13, 2020
[ upstream commit 1fb15cf ]

The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@borkmann borkmann moved this from 1.8 fixes/CI needed to Done in 1.9 kube-proxy removal & general dp optimization May 19, 2020
qmonnet pushed a commit that referenced this issue May 26, 2020
[ upstream commit f25d8b9 ]
[ based on v1.7 backport 60b4210 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

To work with clang-7 and avoid the pattern where the ctx is read into
a register and then incremented then finally a value assigned to it,

  r1 = %[ctx]
  r1 += 8
  ...
  *(u32)(r1 +=8) = %[mark]

We wrote the code block in asm which is not the same as master branch
which was able to use C code due to use of clang-11. We attempted to
update the branch to clang-10 but that created a separate set of issue
that was causing more code churn than we wanted.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit that referenced this issue May 26, 2020
[ upstream commit 5f50d82 ]

Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
qmonnet pushed a commit that referenced this issue May 26, 2020
[ upstream commit 1fb15cf ]

The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
joestringer pushed a commit that referenced this issue May 26, 2020
[ upstream commit f25d8b9 ]
[ based on v1.7 backport 60b4210 ]

When Cilium is used in chaining mode with portmap, the hostPort is
translated using iptables DNAT as inserted by the portmap plugin.  When
this happens all within a node, we can preserve the source identity for
the reply traffic for correct visibility. The traffic will be allowed
anyway based on the connection tracking state.

To work with clang-7 and avoid the pattern where the ctx is read into
a register and then incremented then finally a value assigned to it,

  r1 = %[ctx]
  r1 += 8
  ...
  *(u32)(r1 +=8) = %[mark]

We wrote the code block in asm which is not the same as master branch
which was able to use C code due to use of clang-11. We attempted to
update the branch to clang-10 but that created a separate set of issue
that was causing more code churn than we wanted.

Updates: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
joestringer pushed a commit that referenced this issue May 26, 2020
[ upstream commit 5f50d82 ]

Packets to a host IP are currently redirected via
cilium_host/cilium_net. The reason for this is mostly historic. For
other packets where routing by the kernel routing tables is desired,
packets are already passed on via TC_ACT_OK to the stack directly.

The two cases where this redirection is needed are:
* For proxy redirection due to a kernel limitation on passing the
  routing tables multiple times. This case is left untouched.
* For the HOST_REDIRECT_TO_INGRESS case, e.g. flannel integration. This
  case is left untouched. The IPv4 and IPv6 case is brought in line to
  not accidently lose this logic later on.

A side effect of this is that the skb gets scrubbed including the
skb->mark. The presence of the identity in the skb->mark is being relied
on in a follow-up fix however.

Therfore, pass packets via the stack using TC_ACT_OK. This is faster,
simpler, and allows for the identity to be carried in the mark.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
joestringer pushed a commit that referenced this issue May 26, 2020
[ upstream commit 1fb15cf ]

The traffic is sent to the stack and hairpin'ed back into a local pod
after a component on the stack has applied a DNAT rule, the traffic must
be SNATed to ensure the reverse NAT can take place. This can happen if
portmap or kiam is being used and redirection happens to a local
destination.

The masquerade filter must be limited as not all DNAT traffic may be
affected. NodePort traffic from a non-local source must remain
unmasqueraded in order for trafficPolicy=local to continue working.

Also, when EnableEndpointRoutes is enabled, traffic always traverses the
stack and must not be masqueraded either.

Fixes: #9784

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. priority/high This is considered vital to an upcoming release.
Projects
No open projects
1.7.4
Awaiting triage
Development

Successfully merging a pull request may close this issue.

10 participants