Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.8 backport 2020-06-16 #12103

Merged
merged 44 commits into from Jun 16, 2020
Merged

v1.8 backport 2020-06-16 #12103

merged 44 commits into from Jun 16, 2020

Conversation

borkmann
Copy link
Member

@borkmann borkmann commented Jun 16, 2020

v1.8 backports 2020-06-16 (no conflicts, had to take in #11929 to get #11981):

Once this PR is merged, you can update the PR labels via:

$ for pr in 12064 12001 11978 12066 12065 11981 12054 12083 12069 12050 12081 12092 12068 12082 12045 12095 12096 12074 11929; do contrib/backporting/set-labels.py $pr done 1.8; done

ap4y and others added 30 commits June 16, 2020 15:38
[ upstream commit b330436 ]

Restart command in documentation pipes complete output of the awk into
xargs which make creates invalid kubectl command. This commit updates
xargs to use per line mode and also makes it omit empty input to not
execute empty "kubectl delete pod" on the last line.

Signed-off-by: Arthur Evstifeev <aevstifeev@gitlab.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 4764e6d ]

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit f3b474a ]

Since recently, allocations results returned without CIDRs became fatal
for CNI (CNI cmdAdd() would fail due to interfaceAdd() failing, due to
RoutingInfo.parse() erroring out on `len(cidrs) == 0`).

That (now mandatory) `IPv4CIDRs` is used when masquerading is enabled,
to source-route traffic from the pods to its attached VPC CIDRs through
a dedicated local table (that I assume is exempt from masquerade).

But on Azure we don't collect and dont't propagate subnets CIDRs, and
that bit of information is missing from ciliumnodes azure crd/ipam.

Adding masquerade support to Azure CNI (therefore, collecting and bubbling
down per-address CIDRs) would be a nice feature to have in the longer term.
But we might be late in the feature freeze to implement that on time to
unbreak Azure CNI before 1.8 release; and there's no reason to insist on
having CIDRs when masquerade is disabled anyway.

Signed-off-by: Benjamin Pineau <benjamin.pineau@datadoghq.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit da09664 ]

Add info to make people aware that they don't need CNI chaning anymore.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 831c5ee ]

Clarify that this is only needed when kube-proxy replacement is disabled.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 5637e0e ]

PR #11991 added additional BPF maps to be influenced by the
bpf-map-dynamic-size-ratio flag. Adjust the upgrade guide to document
these changes, as well as the newly added neighMax option.

Also correct the ratio given in the architecture section, so it
corresponds to 0.25%, not 2.5%.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 325b06a ]

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 74c6d25 ]

This reverts commit a4d86b2.

Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit bdb6247 ]

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 17b597d ]

Signed-off-by: Maciej Kwiek <maciej@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit b2762cc ]

Such device is denoted with "(DR)", e.g.:

    KubeProxyReplacement:   Strict   [eth0 (DR), eth1]   [NodePort
    (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs,
    HostReachableServices (TCP, UDP), SessionAffinity]

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit fe4e456 ]

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit f1f4e94 ]

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 1d190eb ]

v1.5 is going out of style with v1.8, drop the upgrade notes. Users can
refer to the v1.7 upgrade notes if they need to upgrade from v1.5.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit c216dfb ]

Update the example output of the commands to the way they look wiht
Cilium 1.8.0-rc3.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit bed67f6 ]

Just for DRY.

Also, make the log messages in initKubeProxyReplacementOptions()
more consistent and less ambiguous.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 0e3f827 ]

Move all the checks which do not depend on
option.Config.{DirectRoutingDevice,Devices} above the device auto
detection.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 11650b3 ]

We no longer consider that the first device among option.Config.Devices
is going to be used for direct routing.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 5c2c8b1 ]

The second helper function is supposed to call the device detection, and
the third to finish the initialization.

It's required, because the host-fw also requires the auto detection, and
extending the former helper to consider host-fw is a bit brittle, as any
premature return would prevent from the auto-detection for the host-fw.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit e4bd54c ]

Enable the device auto-detection when BPF NodePort is disabled, but
host-fw is enabled.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit d876cdf ]

Previously, when falling back to iptables-based masquerading, if a user
had specified --egress-masquerade-interface, the agent considered it
as part of BPF masquerading options. This made it obviously to fail.

Fix: b7b962b ("daemon: Fallback to iptables if BPF masq cannot be enabled")
Reported-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit f5b7027 ]

To avoid breaking existing guides, fallback to iptables-based
masquerading if --egress-masquerade-interface is set.

In v1.9, we plan to add a support for the flag which will make
the transition from BPF to iptables -based masquerading transparent
when the flag is set.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 4530c0e ]

These are shell sessions for highlighting purposes, not bash scripts.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 6dfa5f1 ]

These instructions will *not* import on bird 1.x. Make this more clear.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 26a4b29 ]

The previous instructions hinted at a successful export, but only proved
that bird parsed the config correctly. Validate the peering & export via
the bird commandline.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 00ef71b ]

This pulls in cilium/ebpf#81 which fixes a crash when trying to
initialize BPF per ring buffers for offline CPUs:

  level=fatal msg="Cannot initialise BPF perf ring buffer sockets" error="failed to create perf ring for CPU 2: can't create perf event: can't create perf event: no such device" startTime="2020-06-15 12:15:09.153912253 +0000 UTC m=+129.850487215" subsys=monitor-agent

Reported-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit d1f2b48 ]

This updates the script to allow the user to specify which namespace
Cilium is deployed in.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 348686d ]

We have --enable-{ipv4,ipv6}={true,false} setting. The --disable-ipv4 is
legacy and hidden for some time already. Schedule for removal.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit d03996c ]

This fixes a bug where during a Cilium upgrade test full connectivity is lost.
The major reason why it is happening is basically the test for ...

  if strings.Contains(bpfFilter.Name, "bpf_netdev.o") [...]

... which is buggy due to multi-dev support since we now end up with something
like bpf_netdev_eno1.o instead of bpf_netdev.o . This means that tc BPF progs
never get removed from the devices, yet we're tearing down everything else and
once we remove all files from the BPF fs then in case of tail call maps, the
kernel will purge all BPF programs inside tail call maps due to circular dependency
avoidance. However, this means that those still-attached programs go into a
drop-all mode due to missed tail calls.

Fix this in four ways: i) Match on "bpf_netdev" prefix instead so we catch old
but also new names, ii) Reorder BPF fs cleanup with tc BPF prog removal, meaning,
we first need to remove all progs from the devices and only then clean up BPF fs,
iii) When calling with --bpf-state only, then also remove progs from devices, and
iv) Last but not least, add also other tc prog names. Step ii) to iv) is necessary
since there can still be a small race window where we missed tail calls right
before the device BPF prog removal.

'K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master' test
case is then fixed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 83b43e9 ]

In XDP, we can only xmit via XDP_TX, so a multi-device setup would
otherwise be broken. Tell the user to specify the device manually
in case there were multiple devices detected.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
borkmann and others added 14 commits June 16, 2020 15:46
[ upstream commit 90b8062 ]

It is otherwise just confusing to users to figure out whether they need
to react to this or not. Only really warn if we cannot auto-derive it
from the k8s Node IP since this case would be unexpected.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 71e994c ]

Add few notes on how to deal with nodes that have multiple devices which
are auto-detected. Also make it clear as a limitation.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 8b73d6c ]

Add a small paragraph for helping users to verify whether enabled or not.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 087345c ]

Make users aware that they might need to tweak their map sizes and
link to the guide which has more details on it.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit dfa8bb8 ]

Commit b2c9b07 ("bpf, sock: reduce xlation for externalTrafficPolicy=Local
to host_id") skips service translation for remote externalTrafficPolicy=Local
services (aka not being HOST_ID ipcache scope).

However, later on, we relaxed bpf_lxc to compile legacy service xlation back in
for services with externalIPs via 6426c86 ("bpf, external ip: fix service
xlation for containers"), which would then go and try to xlate such services
locally.

Issue is that here we don't differentiate between the different service types
so we would go and attempt to xlate. Given we don't populate service maps with
remote backends that traffic gets blackholed on bpf_lxc.

Instead, only xlate externalIP services at that layer and skip the rest, so we
let externalTrafficPolicy=Local dsts pass through and be xlated at the remote
LB. We /already/ have this kind of behavior when we have netns cookies since
the bpf_lxc xlation is compiled out entirely there, but not for older kernels,
thus, fix it to make both consistent.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 64dddf7 ]

On Linux 4.19 (and any kernel with 4096 BPF instructions limit), when
running with kube-proxy-replacement=strict, IPv4, IPv6, fragment support
(and in debug mode), Cilium fails to load bpf_overlay and bpf_host with:

  Prog section '2/19' rejected: Argument list too long (7)
   - Type:         3
   - Attach Type:  0
   - Instructions: 4203 (107 over limit)
   - License:      GPL
  Error filling program arrays!

2/19 is section CILIUM_CALL_NODEPORT_REVNAT. We can reduce the number of
instructions in that section by breaking the IPv4 and IPv6 cases with
tail calls. Instead of introducing a new tail call on each path, we can
turn CILIUM_CALL_NODEPORT_REVNAT into CILIUM_CALL_IPV{4,6}_NODEPORT_REVNAT.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit c862a71 ]

Given v4/v6 has been split off into different tail calls, it enables
us to significantly bump the collision retries on older kernels.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 874bb0e ]

... as otherwise it won't work on multi-dev node. Revert back to auto-detection
after XDP tests have run.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit bc5e161 ]

Re-enable it such that we can run CI with kube-proxy-free BPF code alongside
kube-proxy.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit cb65fca ]

Prior tests have been run with the latest Cilium image. Using the old one
for the initial cleanup is broken since it will not clean-up everything.

For example the 1.7 cleanup method will try to remove tc attached filters
with the name bpf_netdev.o, but for 1.8/latest they are names bpf_netdev_<dev>.o
which will then break connectivity after flushing all files from BPF fs as
tail call maps are purged which gets us into drop-all due to missed tail calls.

Run new (and old) to make sure there are no quirks from old version left behind
that the new one didn't catch.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 0fa8ac7 ]

... and instead rely on kube-proxy's strict replacement.

Also disable tests that are not working on 4.19 due to missing kernel
features on NodePort side.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 67281df ]

This adds a note about setting the `--node-ip` correctly in kube-proxy
free deployments with kubeadm. Without `--node-ip` set correctly,
Cilium's multi-device detection will not pick up all needed interfaces.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 0c4037c ]

Use parsed-literal, so SCM_WEB gets expanded properly in the
kubectl apply command.

Also add a missing AGE column value in the `kubectl get pods` output and
remove all trailing whitespaces.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[ upstream commit 015ffbd ]

Remove ginkgo test source dependency on pkg/iptables, as that only
compiles on Linux. This allows running ginkgo from OSX against GKE,
for example.

Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann requested review from a team as code owners June 16, 2020 13:53
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.8 kind/backports This PR provides functionality previously merged into master. labels Jun 16, 2020
@borkmann
Copy link
Member Author

test-backport-1.8

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for my changes

Copy link
Contributor

@errordeveloper errordeveloper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11978 LGTM 👍

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my changes

@borkmann
Copy link
Member Author

retest-4.19

@borkmann borkmann merged commit 153cb98 into v1.8 Jun 16, 2020
@borkmann borkmann deleted the pr/v1.8-backport-2020-06-16 branch June 16, 2020 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet