Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.9 backports 2021-02-12 #14961

Merged
merged 22 commits into from
Feb 15, 2021
Merged

v1.9 backports 2021-02-12 #14961

merged 22 commits into from
Feb 15, 2021

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Feb 12, 2021

Once this PR is merged, you can update the PR labels via:

$ for pr in 14611 14793 14913 14920 14634 14675 14847 14756 14886 14924 14338 14954; do contrib/backporting/set-labels.py $pr done 1.9; done

jrfastab and others added 21 commits February 12, 2021 15:32
[ upstream commit 32e41f6 ]

If a pod in host networking and/or the node itself sends traffic to a pod
with encryption+vxlan enabled and encryptNode disabled we may see dropped
packets. The flow is the following,

First, a pkt is sent from the host networking with srcIP=nodeIP dstIP=podIP.
Next a source IP is chosen for the packet so that the route src will not
be used. Then the route table 'routes' the packet to cilium_host using the
route rule matching podIPs subnet,

   podIPSubnet via $IP dev cilium_host src $SrcIP mtu 1410

Then we drop into BPF space with srcIP=nodeIP,dstIP=podIP as above. Here
tohost=true, and we do an ipcache lookup. The ipcache lookup will have a
hit for the podIP and will have both a key and tunnelIP. For example
something like this,

  10.128.5.129/32     1169 3 10.0.0.4

Using above key identifier, in the example '3', the bpf_host program will
mark the packet form encryption. This will pass encryption parameters
through the ctx where the ingress program attached to cilium_net will in
turn use those parameters to set skb->mark before passing up to the stack
for encryption. At this point we have a skb ingress'ing the stack with
an skb->mark=0xe00,srcIP=nodeIP,dstIP=podIP.

Here is where the trouble starts. This will miss encryption policy rules
because unless encryptNode is enabled we do not have a rule to match
srcIP=nodeIP. So the unencrypted is sent back to cilium_host using a
route in the encryption routing table. Then finally bpf_host will send
the skb to the overlay, cilium_vxlan.

The packet is then received by the podIP node and sent to cilium_vxlan
because its a vxlan packet. The vxlan header is popped off and the
inner (unencrypted) packet is sent to the pod, remember srcIP=nodeIP
still.

Assuming the above was a SYN packet the pod will respond with a SYN/ACK
with srcIP=podIP,dstIP=nodeIP. This will be handled by the bpf_lxc
program.

First, we will do an ipcache lookup and get an ipcache entry, but this
entry will not have a tunnelIP. It will have a key though. Because
no tunnelIP is available a tunnel map lookup will be done. This will
fail because the dstIP is a nodeIP and not in a tunnel subnet. (The
tunnel map key is done by masking the dstIP with the subnet.)

Because the program did not find a valid tunnel the packet is sent to the
stack. The stack will run through iptables, but because the packet flow
is asymmetric (SYN over cilium_vxlan, SYN/ACK over nic) the MASQUERADE
rules will be skipped. The end result is we try to send a packet with
the srcIP=podIP.

Here a couple different outcomes are possible. If the network infra
is strict the packet may be dropped at the sending node, refusing to
send a packet with an unknown srcIP. If the receiving node has rp_filter=1
on the receiving interface it will be dropped with a martianIP message
in dmesg. Or if rp_filter=0 and the network knows how to route the
srcIP somehow it might work. For example some of my test systems
managed to work without any issues.

In order to fix this we can ensure reply packets are symmetric to
the request. To do this we inject tunnel info into nodeIP ipcache
entry. Once this is done the pod replying will do the following,

Send a packet with srcIP=podIP,dstIP=nodeIP, this is picked up by the
bpf_lxc program. The ipcache lookup finds a complete entry now with the
tunnel info. Now instead of sending that packet to the stack it will be
encapsulated in the vxlan header and send/received on the original
requesting node.

A quick observation, some of the passes through the xfrm stack are
useless here. We know the miss is going to happen and can signal this
to the bpf layer by clearing the encryption key in the ipcache. We
will do this as a follow up patch.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit d869108 ]

[SG]etInternalIPv4Router - This is the CiliumInternalIP of the node. It
is only relevant within the Cilium-managed network (this means within
the node for direct routing mode and on the overlay for tunnel mode).
Inorder to not get confused with InternalIP of a k8s node, we renamed
`[SG]etInternalIPv4` to `[SG]etInternalIPv4Router`.

[SG]etIPv4 - This returns one of the IPv4 node address based on the
following priority:
- NodeInternalIP
- NodeExternalIP
- other IP address type.
These are reachable on the network. Since it can be External or
Internal IP address, we renamed `[SG]etExternalIPv4` to `[SG]etIPv4`

Signed-off-by: Anish Shah <anishshah@google.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit b60f8b7 ]

Previously, accessing a NodePort using ExternalIP of the local K8s node
would result in connection timeout. This is because ExternalIP of the
local K8s node is not added to ipcache and hence no service resolution
happens on egress.

This commit resolves this issue by adding the ExternalIP of local k8s
node to the ipcache. We introduce two functions `SetExternalIPv[46]` and
`GetExternalIPv[46]` to store/fetch the External IPv4/IPv6 address of a k8s node.
If a node doesn't have any External IP address, then we store/return
`nil`.

Signed-off-by: Anish Shah <anishshah@google.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
… cluster

[ upstream commit 265d677 ]

This test checks connectivity to NodePort service from inside a cluster
by sending request to ExternalIP:NodePort.

Signed-off-by: Anish Shah <anishshah@google.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2ceb386 ]

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 55caedc ]

Strictly speaking this is a general truth for all services, but we don't
have a dedicated section for explaining services and users have begun
reporting specifically in relation to AWS / ENI mode. Put this somewhere
in the docs, we can always move it around to somewhere more generic when
we have a better location for these links to live.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 83cb610 ]

Setting this option makes Cilium to run in a mode that will allow
it to reduce its process' RSS even if the system is not under
memory pressure.

By default, i.e. without 'madvdontneed' being set, Cilium will only
release its process' RSS when the system is under pressure. This
may be confusing for users since they will assume Cilium needs the
memory reported in RSS, which is not true.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 317a671 ]

Kube-proxy installs the following rule in the iptables FORWARD chain:

    -A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP

That rule causes iptables to drop packets whenever we have a path
asymmetry with for example a SYN packet going through iptables and a
SYN+ACK packet skipping iptables via BPF's redirect(). In that example,
the next ACK packet would be dropped by that rule because the connection
is invalid (SYN+ACK is missing) from iptables' point of view.

To avoid such drops, in the IPv4 case, Cilium installs rules in the
FORWARD chain (via its own CILIUM_FORWARD chain) to bypass the
subsequent -j DROP rule from kube-proxy. For example, with per-endpoint
routes enabled:

    -A FORWARD -m comment --comment "cilium-feeder: CILIUM_FORWARD" -j CILIUM_FORWARD
    -A CILIUM_FORWARD -o cilium_host -m comment --comment "cilium: any->cluster on cilium_host forward accept" -j ACCEPT
    -A CILIUM_FORWARD -i cilium_host -m comment --comment "cilium: cluster->any on cilium_host forward accept (nodeport)" -j ACCEPT
    -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT
    -A CILIUM_FORWARD -i cilium_net -m comment --comment "cilium: cluster->any on cilium_net forward accept (nodeport)" -j ACCEPT
    -A CILIUM_FORWARD -o lxc+ -m comment --comment "cilium: any->cluster on lxc+ forward accept" -j ACCEPT
    -A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept (nodeport)" -j ACCEPT

The same rules are however missing from the IPv6 tables. This commit
fixes this inconsistency.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ed68d2c ]

This commit fixes a broken packet path on pod --> remote pod via VIP
when running with encapsulation and per-endpoint routes, with kube-proxy
handling DNATing. The following details the broken packet path:

- SYN leaves lxc device for stack             with pod1IP -> VIP
- SYN goes through cilium_host and cilium_net with pod1IP -> pod2IP
- SYN is sent to second node through tunnel   with pod1IP -> pod2IP
- SYN+ACK arrives from tunnel on cilium_vxlan with pod2IP -> pod1IP
- SYN+ACK is processed by ipvX_local_delivery
  and redirected in BPF directly to the lxc
  device
- SYN+ACK arrives on lxc device               with pod2IP -> pod1IP
- RST is sent because pod2IP doesn't match VIP

Because we use redirect() in BPF, the SYN+ACK packet bypasses most of
the stack (including conntrack) on its way from cilium_vxlan to the lxc
device. By bypassing conntrack, we also bypass the reverse DNAT.

The bug doesn't occur when per-endpoint routes are disabled because, in
that case, the SYN is SNATed to the cilium_host IP before leaving
through the tunnel. The SYN+ACK packet therefore takes a different path
and goes through conntrack.

Fixes: #14646
Co-authored-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit ae96b92 ]

Signed-off-by: Yosh de Vos <yosh@elzorro.nl>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c8052a1 ]

When IPSec or the host firewall are enabled in tunneling mode, traffic
from a pod to a remote node is sent through the stack.

This introduces an asymmetry in kube-proxy DNAT and breaks connections
from a node (or hostns pod) to remote pods via
externalTrafficPolicy=Local services. The following details the broken
packet path node1 --VIP--> pod@node2 (VIP is node2IP):

- SYN leaves node1 via native device with  node1IP -> VIP
- SYN is DNATed on node2 to                node1IP -> podIP
- SYN is deliverd to lxc device with       node1IP -> podIP
- SYN+ACK is sent from lxc device with     podIP   -> node1IP
- SYN+ACK is redirected in BPF directly to cilium_vxlan
- SYN+ACK arrives on node1 via tunnel with podIP   -> node1IP
- RST is sent because podIP doesn't match VIP

Because we use redirect() in BPF, the SYN+ACK packet bypasses most of
the stack (including conntrack) on its way from cilium_vxlan to the lxc
device. By bypassing conntrack, we also bypass the reverse DNAT.

The bug only occurs for externalTrafficPolicy=Local services because,
without that annotation, the SYN packet is SNATed when arriving on
node2. The SYN+ACK is therefore first sent via the stack to node2's IP
where it is both reverse SNATed and reverse DNATed.

Fixes: #14674
Fixes: #12542
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit c4356bd ]

For IPSec and the host firewall, in order to apply kube-proxy's reverse
DNAT we want to send traffic through the tunnel via the stack instead
of doing a redirect call. Such traffic therefore goes from the lxc
device to cilium_host where it is redirected to the tunneling device.

We however SNAT that traffic on its way to cilium_host. We don't need to
apply SNAT there for the pod CIDR as any traffic from the pod CIDR will
either leave directly through the native device (masquerading happening
there) or go through cilium_host and be redirected to the tunnel.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 2303803 ]

This test validates a potential regression on the path node1 ->
pod@node2 via service with externalTrafficPolicy=Local.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1db771f ]

This test validates a potential regression on the path node1 ->
pod@node2 via service with externalTrafficPolicy=Local when kube-proxy,
tunneling and the host firewall are enabled.

This commit also removes the explicit disabling of the host firewall in
externalTrafficPolicy=Local tests.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 5440f9f ]

This sets the `GOPS_CONFIG_DIR` environment variable in all images which
use gops and are based on a scratch image.

This fixes an issue with crun where due to missing environment
variables, cilium-operator and hubble-relay would error out when
starting the gops agent with the following error:

``` Error: failed to start gops agent: unable to get current user home
directory: os/user lookup failed; $HOME is empty ```

This only seems to occur on `crun`, which sets `HOME` to an empty
variable in scratch images. In non-scratch images, it sets HOME
according to `/etc/passwd`, so this currently only affects the
cilium-operator and hubble-relay images. Most other container runtimes
seem to set `HOME` to `/` in scratch images, which works for gops.

Fixes: #14301

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 773b1d2 ]

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3ce38c9 ]

This is a preparatory commit that introduces ENI related
information for cilium_host (aka router). Specifically,
the information includes primary ENI interface that the
secondary cilium_host IP is associated with, primary ENI MAC address
and VPC CIDRs.
Also, refactor the current IPAM internal IP allocating methods
so that the ENI information can be retrieved for cilium_host (aka router).

This information can be useful for setting up encryption
related state such as ip rules and interface to attach
BPF programs to.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit f34371c ]

Previously, we weren't adding policy based routing rules
for cilium_host secondary IP address. These rules ensure
that traffic from source node's cilium_host interface destined to the
remote node in the cluster VPC is routed through the primary
ENI interface for cilium_host. From there on, the traffic is routed to the
AWS infrastructure router, which in turn routes it to the destination node.

Fix involves setting up ip rules to the per ENI routing table.

Verified that the correct rule is added, and encrypted traffic is
reaching the destination node.

root@ip-192-168-38-245:/home/cilium# ip addr show cilium_host
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether da:61:19:32:0b:37 brd ff:ff:ff:ff:ff:ff
    inet 192.168.148.108/32 scope link cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::d861:19ff:fe32:b37/64 scope link
       valid_lft forever preferred_lft forever
root@ip-192-168-38-245:/home/cilium# ip rule show | grep 192.168.148.108
111:	from 192.168.148.108 to 192.168.0.0/16 lookup 11

Tcpdump on the remote node shows encrypted packets getting received -

tcpdump -eni eth1 dst 192.168.142.218
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
17:32:52.216989 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 130: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x64), length 96
17:32:52.217694 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 122: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x65), length 88
17:32:52.217896 02:93:9c:00:a2:eb > 02:72:ef:86:3e:17, ethertype IPv4 (0x0800), length 206: 192.168.117.26 > 192.168.142.218: ESP(spi=0x00000003,seq=0x66), length 172

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
…r ENI mode

[ upstream commit 2c5e192 ]

Problem: Auto detection of encryption.interface leads to bpf_network program
that does decryption to get attached to a wrong network interface.
As a result, packets were not decrypted on the destination remote node.

If users don't specify the cilium `encryption.interface`
configuration, the interface can be auto detected based on main routing
table. This breaks in ENI IPAM mode, where cilium_host is allocated an IP
address from the pool of secondary IP addresses of a primary ENI interface.
The IP allocation is influenced by the specified `first-interface-index` configuration.
Hence, retrieve the correct primary interface for cilium_host for bpf_network
program to be attached.

Fix is verified based on the xfrm stats -

Packet stats on source node :

src 192.168.117.26 dst 192.168.142.218
	proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel
	replay-window 0 seq 0x00000000 flag  (0x00000000)
	mark 0x3e00/0xff00 output-mark 0xe00
	aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128
	anti-replay context: seq 0x0, oseq 0x9, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
	lifetime config:
	  limit: soft (INF)(bytes), hard (INF)(bytes)
	  limit: soft (INF)(packets), hard (INF)(packets)
	  expire add: soft 0(sec), hard 0(sec)
	  expire use: soft 0(sec), hard 0(sec)
	lifetime current:
	  540(bytes), 9(packets)
	  add 2021-02-05 18:12:08 use 2021-02-05 18:13:01
	stats:
	  replay-window 0 replay 0 failed 0

Packet stats on dest node :

ip -s xfrm s
src 192.168.142.218 dst 192.168.117.26
	proto esp spi 0x00000003(3) reqid 1(0x00000001) mode tunnel
	replay-window 0 seq 0x00000000 flag  (0x00000000)
	mark 0x3e00/0xff00 output-mark 0xe00
	aead rfc4106(gcm(aes)) 0xea5799388f441ba52b8c60c4f11c084bcdf5373e (160 bits) 128
	anti-replay context: seq 0x0, oseq 0x6, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
	lifetime config:
	  limit: soft (INF)(bytes), hard (INF)(bytes)
	  limit: soft (INF)(packets), hard (INF)(packets)
	  expire add: soft 0(sec), hard 0(sec)
	  expire use: soft 0(sec), hard 0(sec)
	lifetime current:
	  360(bytes), 6(packets)
	  add 2021-02-05 18:12:08 use 2021-02-05 23:19:07
	stats:
	  replay-window 0 replay 0 failed 0

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 1e80ec0 ]

This commit updates the list of default label filters and clarifies the
default behavior, as well as how it changes when a first inclusive label
is added.

The example is also completed based on the new unit test added in the
previous commit.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3b0f6e8 ]

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno requested a review from a team as a code owner February 12, 2021 15:03
@pchaigno pchaigno added backport/1.9 kind/backports This PR provides functionality previously merged into master. labels Feb 12, 2021
Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my commits. I confirmed that the EnableIPv4Masquerade flag was changed from Masquerade (v1.9). Thank you!

@aditighag aditighag removed their assignment Feb 12, 2021
@joestringer
Copy link
Member

BPF checkpatch seems safe to ignore.

@joestringer
Copy link
Member

Smoketest failure should be addressed by adding a backport for #14954 .

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my commit.

@aanm aanm removed their assignment Feb 12, 2021
Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

[ upstream commit 3c630d8 ]

This commit is to pin the promtool, also avoid using -u in go get
command.

Closes #14950

Signed-off-by: Tam Mach <sayboras@yahoo.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG for my commit

@pchaigno
Copy link
Member Author

pchaigno commented Feb 15, 2021

Reviews are in and all required CI jobs are green. K8s-1.20-kernel-4.9 was triggered by mistake but isn't required for backports. Cilium-Ginkgo-Test-k8s and Cilium-PR-Ginkgo-Tests-K8s failed to provision but are now replaced by the K8s-1.X-kernel-4.9 jobs anyway.

Cilium-Ginkgo-Tests failed with what looks like a flake. That job seems to have been triggered by mistake (wrong trigger phrase config.) however since it's a duplicate of K8s-1.19-kernel-4.9.

Marking as ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 15, 2021
@joestringer joestringer merged commit 6b96fa9 into v1.9 Feb 15, 2021
@joestringer joestringer deleted the pr/v1.9-backport-2021-02-12 branch February 15, 2021 23:41
@aanm aanm mentioned this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants