Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.8 backports 2021-02-12 #14953

Merged
merged 22 commits into from
Feb 15, 2021
Merged

v1.8 backports 2021-02-12 #14953

merged 22 commits into from
Feb 15, 2021

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Feb 12, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 14611 14114 14793 14893 14913 14920 14634 14675 14847 14756 14924 14338; do contrib/backporting/set-labels.py $pr done 1.8; done

jrfastab and others added 17 commits February 12, 2021 10:19
[ upstream commit 32e41f6 ]

If a pod in host networking and/or the node itself sends traffic to a pod
with encryption+vxlan enabled and encryptNode disabled we may see dropped
packets. The flow is the following,

First, a pkt is sent from the host networking with srcIP=nodeIP dstIP=podIP.
Next a source IP is chosen for the packet so that the route src will not
be used. Then the route table 'routes' the packet to cilium_host using the
route rule matching podIPs subnet,

   podIPSubnet via $IP dev cilium_host src $SrcIP mtu 1410

Then we drop into BPF space with srcIP=nodeIP,dstIP=podIP as above. Here
tohost=true, and we do an ipcache lookup. The ipcache lookup will have a
hit for the podIP and will have both a key and tunnelIP. For example
something like this,

  10.128.5.129/32     1169 3 10.0.0.4

Using above key identifier, in the example '3', the bpf_host program will
mark the packet form encryption. This will pass encryption parameters
through the ctx where the ingress program attached to cilium_net will in
turn use those parameters to set skb->mark before passing up to the stack
for encryption. At this point we have a skb ingress'ing the stack with
an skb->mark=0xe00,srcIP=nodeIP,dstIP=podIP.

Here is where the trouble starts. This will miss encryption policy rules
because unless encryptNode is enabled we do not have a rule to match
srcIP=nodeIP. So the unencrypted is sent back to cilium_host using a
route in the encryption routing table. Then finally bpf_host will send
the skb to the overlay, cilium_vxlan.

The packet is then received by the podIP node and sent to cilium_vxlan
because its a vxlan packet. The vxlan header is popped off and the
inner (unencrypted) packet is sent to the pod, remember srcIP=nodeIP
still.

Assuming the above was a SYN packet the pod will respond with a SYN/ACK
with srcIP=podIP,dstIP=nodeIP. This will be handled by the bpf_lxc
program.

First, we will do an ipcache lookup and get an ipcache entry, but this
entry will not have a tunnelIP. It will have a key though. Because
no tunnelIP is available a tunnel map lookup will be done. This will
fail because the dstIP is a nodeIP and not in a tunnel subnet. (The
tunnel map key is done by masking the dstIP with the subnet.)

Because the program did not find a valid tunnel the packet is sent to the
stack. The stack will run through iptables, but because the packet flow
is asymmetric (SYN over cilium_vxlan, SYN/ACK over nic) the MASQUERADE
rules will be skipped. The end result is we try to send a packet with
the srcIP=podIP.

Here a couple different outcomes are possible. If the network infra
is strict the packet may be dropped at the sending node, refusing to
send a packet with an unknown srcIP. If the receiving node has rp_filter=1
on the receiving interface it will be dropped with a martianIP message
in dmesg. Or if rp_filter=0 and the network knows how to route the
srcIP somehow it might work. For example some of my test systems
managed to work without any issues.

In order to fix this we can ensure reply packets are symmetric to
the request. To do this we inject tunnel info into nodeIP ipcache
entry. Once this is done the pod replying will do the following,

Send a packet with srcIP=podIP,dstIP=nodeIP, this is picked up by the
bpf_lxc program. The ipcache lookup finds a complete entry now with the
tunnel info. Now instead of sending that packet to the stack it will be
encapsulated in the vxlan header and send/received on the original
requesting node.

A quick observation, some of the passes through the xfrm stack are
useless here. We know the miss is going to happen and can signal this
to the bpf layer by clearing the encryption key in the ipcache. We
will do this as a follow up patch.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 16e8f2f ]

Fix #14100

Identity relevant labels is a label prefix list combined of two parts:

1. base part:
  1.1. Read from a user specified (--label-prefix-file) json file if this
      file is provided. Default: `--label-prefix-file=""`.
  1.2 If `--label-prefix-file=""`, read from a default hardcoded list
      (`func defaultLabelPrefixCfg()`).
2. additional part: read from user inputs (--labels), default `--labels=""`

When `--label-prefix-file=""` (default) but `--labels=<custom-list>` provided,
if `reserved:host` (or `reserved:.*`) is not included in the above
`<custom-list>`, the `cilium_host` endpoint will lose its `reserved:host`
label.

When rolling back to the default configuration, that is, setting `--labels=""`
and restarting the agent, cilium agent will raise errors like following:

```
level=warning msg="Regeneration of endpoint failed" .. error="Exposing new BPF failed: invalid LXC MAC: invalid MAC address "
level=error msg="endpoint regeneration failed" ..  error="Exposing new BPF failed: invalid LXC MAC: invalid MAC address "
```

And subsequently, all pods' traffic on this node will be interrupted.

This is because the agent relies on this label to distinguish `cilium_host`
endpoint from normal endpoints, and the former has no `lxcMAC`.
We should never exclude reserved labels from default label list.
Add reserved labels to the default label list could solve the problem.

Appendix:

Sample custom label file (--label-prefix-file) to overwrite the default base
label list:

```
{
    "version": 1,
    "valid-prefixes": [
    {
            "source": "k8s",
            "prefix": "io.kubernetes.pod.namespace"
    }, {
            "source": "k8s",
          "prefix": ":io.cilium.k8s.namespace.labels"
    }, {
            "source": "k8s",
            "prefix": "app.kubernetes.io"
    },{
            "source": "k8s",
            "prefix": "k8s!:io.kubernetes"
    },{
            "source": "k8s",
            "prefix": "!kubernetes.io"
    },{
            "source": "k8s",
            "prefix": "!.*beta.kubernetes.io"
    },{
            "source": "k8s",
            "prefix": "!k8s.io"
    },{
            "source": "k8s",
            "prefix": "!pod-template-generation"
    },{
            "source": "k8s",
            "prefix": "!pod-template-hash"
    },{
            "source": "k8s",
            "prefix": "!controller-revision-hash"
    },{
            "source": "k8s",
            "prefix": "!annotation.*"
    },{
            "source": "k8s",
            "prefix": "!etcd_node"
    ]
}
```

Signed-off-by: ArthurChiao <arthurchiao@hotmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit d869108 ]

[SG]etInternalIPv4Router - This is the CiliumInternalIP of the node. It
is only relevant within the Cilium-managed network (this means within
the node for direct routing mode and on the overlay for tunnel mode).
Inorder to not get confused with InternalIP of a k8s node, we renamed
`[SG]etInternalIPv4` to `[SG]etInternalIPv4Router`.

[SG]etIPv4 - This returns one of the IPv4 node address based on the
following priority:
- NodeInternalIP
- NodeExternalIP
- other IP address type.
These are reachable on the network. Since it can be External or
Internal IP address, we renamed `[SG]etExternalIPv4` to `[SG]etIPv4`

Signed-off-by: Anish Shah <anishshah@google.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit b60f8b7 ]

Previously, accessing a NodePort using ExternalIP of the local K8s node
would result in connection timeout. This is because ExternalIP of the
local K8s node is not added to ipcache and hence no service resolution
happens on egress.

This commit resolves this issue by adding the ExternalIP of local k8s
node to the ipcache. We introduce two functions `SetExternalIPv[46]` and
`GetExternalIPv[46]` to store/fetch the External IPv4/IPv6 address of a k8s node.
If a node doesn't have any External IP address, then we store/return
`nil`.

Signed-off-by: Anish Shah <anishshah@google.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
… cluster

[ upstream commit 265d677 ]

This test checks connectivity to NodePort service from inside a cluster
by sending request to ExternalIP:NodePort.

Signed-off-by: Anish Shah <anishshah@google.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 463e0dc ]

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2ceb386 ]

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 55caedc ]

Strictly speaking this is a general truth for all services, but we don't
have a dedicated section for explaining services and users have begun
reporting specifically in relation to AWS / ENI mode. Put this somewhere
in the docs, we can always move it around to somewhere more generic when
we have a better location for these links to live.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 83cb610 ]

Setting this option makes Cilium to run in a mode that will allow
it to reduce its process' RSS even if the system is not under
memory pressure.

By default, i.e. without 'madvdontneed' being set, Cilium will only
release its process' RSS when the system is under pressure. This
may be confusing for users since they will assume Cilium needs the
memory reported in RSS, which is not true.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 317a671 ]

Kube-proxy installs the following rule in the iptables FORWARD chain:

    -A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP

That rule causes iptables to drop packets whenever we have a path
asymmetry with for example a SYN packet going through iptables and a
SYN+ACK packet skipping iptables via BPF's redirect(). In that example,
the next ACK packet would be dropped by that rule because the connection
is invalid (SYN+ACK is missing) from iptables' point of view.

To avoid such drops, in the IPv4 case, Cilium installs rules in the
FORWARD chain (via its own CILIUM_FORWARD chain) to bypass the
subsequent -j DROP rule from kube-proxy. For example, with per-endpoint
routes enabled:

    -A FORWARD -m comment --comment "cilium-feeder: CILIUM_FORWARD" -j CILIUM_FORWARD
    -A CILIUM_FORWARD -o cilium_host -m comment --comment "cilium: any->cluster on cilium_host forward accept" -j ACCEPT
    -A CILIUM_FORWARD -i cilium_host -m comment --comment "cilium: cluster->any on cilium_host forward accept (nodeport)" -j ACCEPT
    -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT
    -A CILIUM_FORWARD -i cilium_net -m comment --comment "cilium: cluster->any on cilium_net forward accept (nodeport)" -j ACCEPT
    -A CILIUM_FORWARD -o lxc+ -m comment --comment "cilium: any->cluster on lxc+ forward accept" -j ACCEPT
    -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept (nodeport)" -j ACCEPT

The same rules are however missing from the IPv6 tables. This commit
fixes this inconsistency.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ed68d2c ]

This commit fixes a broken packet path on pod --> remote pod via VIP
when running with encapsulation and per-endpoint routes, with kube-proxy
handling DNATing. The following details the broken packet path:

- SYN leaves lxc device for stack             with pod1IP -> VIP
- SYN goes through cilium_host and cilium_net with pod1IP -> pod2IP
- SYN is sent to second node through tunnel   with pod1IP -> pod2IP
- SYN+ACK arrives from tunnel on cilium_vxlan with pod2IP -> pod1IP
- SYN+ACK is processed by ipvX_local_delivery
  and redirected in BPF directly to the lxc
  device
- SYN+ACK arrives on lxc device               with pod2IP -> pod1IP
- RST is sent because pod2IP doesn't match VIP

Because we use redirect() in BPF, the SYN+ACK packet bypasses most of
the stack (including conntrack) on its way from cilium_vxlan to the lxc
device. By bypassing conntrack, we also bypass the reverse DNAT.

The bug doesn't occur when per-endpoint routes are disabled because, in
that case, the SYN is SNATed to the cilium_host IP before leaving
through the tunnel. The SYN+ACK packet therefore takes a different path
and goes through conntrack.

Fixes: #14646
Co-authored-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ae96b92 ]

Signed-off-by: Yosh de Vos <yosh@elzorro.nl>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c8052a1 ]

When IPSec or the host firewall are enabled in tunneling mode, traffic
from a pod to a remote node is sent through the stack.

This introduces an asymmetry in kube-proxy DNAT and breaks connections
from a node (or hostns pod) to remote pods via
externalTrafficPolicy=Local services. The following details the broken
packet path node1 --VIP--> pod@node2 (VIP is node2IP):

- SYN leaves node1 via native device with  node1IP -> VIP
- SYN is DNATed on node2 to                node1IP -> podIP
- SYN is deliverd to lxc device with       node1IP -> podIP
- SYN+ACK is sent from lxc device with     podIP   -> node1IP
- SYN+ACK is redirected in BPF directly to cilium_vxlan
- SYN+ACK arrives on node1 via tunnel with podIP   -> node1IP
- RST is sent because podIP doesn't match VIP

Because we use redirect() in BPF, the SYN+ACK packet bypasses most of
the stack (including conntrack) on its way from cilium_vxlan to the lxc
device. By bypassing conntrack, we also bypass the reverse DNAT.

The bug only occurs for externalTrafficPolicy=Local services because,
without that annotation, the SYN packet is SNATed when arriving on
node2. The SYN+ACK is therefore first sent via the stack to node2's IP
where it is both reverse SNATed and reverse DNATed.

Fixes: #14674
Fixes: #12542
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c4356bd ]

For IPSec and the host firewall, in order to apply kube-proxy's reverse
DNAT we want to send traffic through the tunnel via the stack instead
of doing a redirect call. Such traffic therefore goes from the lxc
device to cilium_host where it is redirected to the tunneling device.

We however SNAT that traffic on its way to cilium_host. We don't need to
apply SNAT there for the pod CIDR as any traffic from the pod CIDR will
either leave directly through the native device (masquerading happening
there) or go through cilium_host and be redirected to the tunnel.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2303803 ]

This test validates a potential regression on the path node1 ->
pod@node2 via service with externalTrafficPolicy=Local.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1db771f ]

This test validates a potential regression on the path node1 ->
pod@node2 via service with externalTrafficPolicy=Local when kube-proxy,
tunneling and the host firewall are enabled.

This commit also removes the explicit disabling of the host firewall in
externalTrafficPolicy=Local tests.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 773b1d2 ]

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno requested a review from a team as a code owner February 12, 2021 11:22
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.8 kind/backports This PR provides functionality previously merged into master. labels Feb 12, 2021
@pchaigno pchaigno added backport/1.8 kind/backports This PR provides functionality previously merged into master. labels Feb 12, 2021
aditighag and others added 5 commits February 12, 2021 12:23
[ upstream commit 3ce38c9 ]

This is a preparatory commit that introduces ENI related
information for cilium_host (aka router). Specifically,
the information includes primary ENI interface that the
secondary cilium_host IP is associated with, primary ENI MAC address
and VPC CIDRs.
Also, refactor the current IPAM internal IP allocating methods
so that the ENI information can be retrieved for cilium_host (aka router).

This information can be useful for setting up encryption
related state such as ip rules and interface to attach
BPF programs to.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit f34371c ]

Previously, we weren't adding policy based routing rules
for cilium_host secondary IP address. These rules ensure
that traffic from source node's cilium_host interface destined to the
remote node in the cluster VPC is routed through the primary
ENI interface for cilium_host. From there on, the traffic is routed to the
AWS infrastructure router, which in turn routes it to the destination node.

Fix involves setting up ip rules to the per ENI routing table.

Verified that the correct rule is added, and encrypted traffic is
reaching the destination node.

root@ip-192-168-38-245:/home/cilium# ip addr show cilium_host
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether da:61:19:32:0b:37 brd ff:ff:ff:ff:ff:ff
    inet 192.168.148.108/32 scope link cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::d861:19ff:fe32:b37/64 scope link
       valid_lft forever preferred_lft forever
root@ip-192-168-38-245:/home/cilium# ip rule show | grep 192.168.148.108
111:	from 192.168.148.108 to 192.168.0.0/16 lookup 11

Tcpdump on the remote node shows encrypted packets getting received -

tcpdump -eni eth1 dst 192.168.142.218
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
17:32:52.216989 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 130: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x64), length 96
17:32:52.217694 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 122: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x65), length 88
17:32:52.217896 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 206: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x66), length 172

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
…r ENI mode

[ upstream commit 2c5e192 ]

Problem: Auto detection of encryption.interface leads to bpf_network program
that does decryption to get attached to a wrong network interface.
As a result, packets were not decrypted on the destination remote node.

If users don't specify the cilium `encryption.interface`
configuration, the interface can be auto detected based on main routing
table. This breaks in ENI IPAM mode, where cilium_host is allocated an IP
address from the pool of secondary IP addresses of a primary ENI interface.
The IP allocation is influenced by the specified `first-interface-index` configuration.
Hence, retrieve the correct primary interface for cilium_host for bpf_network
program to be attached.

Fix is verified based on the xfrm stats -

Packet stats on source node :

src 192.168.117.26 dst 192.168.142.218
	proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel
	replay-window 0 seq 0x00000000 flag  (0x00000000)
	mark 0x3e00/0xff00 output-mark 0xe00
	aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128
	anti-replay context: seq 0x0, oseq 0x9, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
	lifetime config:
	  limit: soft (INF)(bytes), hard (INF)(bytes)
	  limit: soft (INF)(packets), hard (INF)(packets)
	  expire add: soft 0(sec), hard 0(sec)
	  expire use: soft 0(sec), hard 0(sec)
	lifetime current:
	  540(bytes), 9(packets)
	  add 2021-02-05 18:12:08 use 2021-02-05 18:13:01
	stats:
	  replay-window 0 replay 0 failed 0

Packet stats on dest node :

ip -s xfrm s
src 192.168.142.218 dst 192.168.117.26
	proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel
	replay-window 0 seq 0x00000000 flag  (0x00000000)
	mark 0x3e00/0xff00 output-mark 0xe00
	aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128
	anti-replay context: seq 0x0, oseq 0x6, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
	lifetime config:
	  limit: soft (INF)(bytes), hard (INF)(bytes)
	  limit: soft (INF)(packets), hard (INF)(packets)
	  expire add: soft 0(sec), hard 0(sec)
	  expire use: soft 0(sec), hard 0(sec)
	lifetime current:
	  360(bytes), 6(packets)
	  add 2021-02-05 18:12:08 use 2021-02-05 23:19:07
	stats:
	  replay-window 0 replay 0 failed 0

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1e80ec0 ]

This commit updates the list of default label filters and clarifies the
default behavior, as well as how it changes when a first inclusive label
is added.

The example is also completed based on the new unit test added in the
previous commit.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3b0f6e8 ]

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

test-backport-1.8

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my commits. I confirmed that the EnableIPv4Masquerade flag was changed from Masquerade (v1.8). Thank you!

@aditighag aditighag removed their assignment Feb 12, 2021
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs changes LGTM. I also glanced at the non-core contributor commits, looks straightforward.

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my commit.

@aanm aanm removed their assignment Feb 12, 2021
Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@pchaigno
Copy link
Member Author

Cilium-Ginkgo-Test-k8s can be ignored as all the (equivalent) K8s-1.X-kernel-4.9 jobs are passing.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 15, 2021
@aanm aanm merged commit 57d4775 into v1.8 Feb 15, 2021
@aanm aanm deleted the pr/v1.8-backport-2021-02-12 branch February 15, 2021 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants