Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.9 backports 2021-05-07 #16048

Merged
merged 23 commits into from
May 10, 2021
Merged

v1.9 backports 2021-05-07 #16048

merged 23 commits into from
May 10, 2021

Conversation

brb
Copy link
Member

@brb brb commented May 7, 2021

v1.9 backports 2021-05-07

NOT BACKPORTED:

Once this PR is merged, you can update the PR labels via:

$ for pr in 14968 14816 15743 15934 15964 15959 15988 15985 15915 15783 15867 15882; do contrib/backporting/set-labels.py $pr done 1.9; done

pchaigno and others added 23 commits May 7, 2021 15:51
[ upstream commit 63996d6 ]

- Update the information on SIGs.
- Split the Slack channels in three categories and add a few new
  suggestions.
- Add a section on the Weekly Community Meeting.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 42eb1ed ]

Could be used to decide if the arping refresh interval needs to be
adjusted in order to limit spams.

Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit fb08c6c ]

I often try to start test-only builds with e.g.:

    test-only --kernel_version=4.19 --focus="..."

That fails because our tests expect "419". We can extend the Python
script used to parse argument to recognize that and update
kernel_version to the expected format.

Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit d3dc63d ]

A Pod could potentially emit non-IPv4/v6 traffic and in that case we must
not set the aggregate since we also don't clear it in edt_sched_departure()
again. Hence, move the edt_set_aggregate() to the two supported protos.

Fixes: #15960
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit c233140 ]

Consider a socket which has SO_MAX_PACING_RATE of 4Gbit/s and the
socket being part of a Pod. This is currently broken given skb->tstamps
are cleared on BPF redirect as well as netns traversal even though
fq in hostns manages the socket's pacing. Rates would result being
unpredictable:

  root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec
   87380  16384  16384    40.04     655.52

  root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec
   87380  16384  16384    40.07    1274.70

  root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec
   87380  16384  16384    40.07    1519.32

  root@apoc:~/go/src/github.com/cilium/cilium# netperf -H 10.217.1.19 -t TCP_STREAM -l40 -s2
  MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.217.1.19 () port 0 AF_INET : demo
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec
   87380  16384  16384    40.06     849.96

We are working on a kernel side solution to retain skb->tstamps which
would fix this issue and result in a stable 4Gbit/s rate for this
example. Once that is merged we can reenable BBR from BWM side for
those kernels (and fallback to cubic for those that do not have it).

Related: #15324
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 9fc9583 ]

The fragment tracking test was disabled on GKE because it is
incompatible with endpoint routes [1]. Until that incompatiblity is
fixed, we can disable endpoint routes when running on GKE.

1 - #15958
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit e97c7c3 ]

Cilium's DNS proxy listens on a port allocated by the kernel. That port
can however conflict with the port used by echo-c-host-container from
the connectivity checks, leading to CI flakes when run regularly.

A kernel comment on the function allocating the port [1] gives us a way out:

    * if snum is zero it means select any available local port.
    * We try to allocate an odd port (and leave even ports for connect())

So using an even-numbered port for echo-c-host-container in the
connectivity checks should strongly reduce the likelihood of this flake
happening again. Other hostns pods already use even-numbered ports.

1 - https://elixir.bootlin.com/linux/v5.12.1/source/net/ipv4/inet_connection_sock.c#L354
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 785c2e7 ]

As discussed during community meeting 2021-05-03.

Signed-off-by: Joe Stringer <joe@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit ea381e7 ]

AWS-CNI seems to require more fields than the ones hard coded in Cilium
image. This patch adds the missing fields.

Error messages that might show up in pod describe are similar as:
```
network: invalid character '{' after top-level value
```
or
```
\n{\n    \"code\": 100,\n    \"msg\": \"add cmd: failed to assign an IP address to container\"\n}": invalid character '{' after top-level value
```

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit a994147 ]

This commit introduces a dedicated lock for managing node neighbor
entries.

Previously, the neighbor subsystem was sharing the same lock with the
node manager. This resulted in the lock starvation when the periodic
neighbor refresh was taking longer than expected and it was queuing up
the refresh requests.

To avoid the starvation which can slow down node updates, we introduce
the dedicated lock. Also, we make sure that insertNeighbor() won't block
NodeUpdate() in the case of the scenario above by running it in a
separate goroutine (same applies to deleteNeighbor() and NodeDelete()).

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 7c45d17 ]

There is no default case in the select, so the select will block until
one of its cases is fulfilled.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 0d72941 ]

- Lock insertNeighbor() before accessing shared structures.
- Defer neighbor BPF map retirement so that it's moved from the contend
  path.
- Make getSrcAndNextHopIPv4() a function to indicate that it's not using
  any shared structures.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit a82137b ]

This commit changes how insertNeighbor() method takes the lock. Instead
of holding the lock for the entire execution duration, we release the
lock while arpinging. This allows parallel arpings which should make the
periodic ARP refresher faster.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 38d0049 ]

It can happen that the insertNeighbor(refresh=true) gets called before
insertNeighbor(refresh=false), and only the latter can increment the
neigh refcounting.

Remove the optimization for the sake of refcounting correctness.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit e779e96 ]

Instead of deriving the value from
net.ipv4.neigh.default.base_reachable_time_ms which is 30s by default
use a user specified value for the ARP periodic refresher.

Running ARP refresher every 15-45s is too often, as in most clusters
node hw addrs don't change at all.

Users who have highly dynamic environments will be able to schedule the
refresher in higher frequency by using the flag.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 0ff85f2 ]

Most of arping failures which we observed are transient, so these false
positives might mislead users.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit fbadf99 ]

When using subnet mode for encryption IPs subnets are not bound to
any specific node. So rather than use xfrm stack to write src/dst
IPs in the IPSec path we use bpf_host.o attached to cilium_host
qdisc to complete the header. This is necessary because we want
to avoid having an entry per pod in the xfrm tables and instead
have at max an entry per node.

Currently, we use the srcIP of the first interface we find in
the egress interface list. This works on some more forgiving
networks, but EKS uses source based IP routing, but does not
have a source IP in the source routing tables for all network
facing interfaces. What happens then is we rewrite the SIP of
the IPSec header to the IPAddr assigned to the network interface.
This is passed to the stack for routing. The routing table
doesn't have a source entry for the SIP so it uses the default
route. The default may or may not point to the first interface.
If it doesn't then the network appears to do a firewall rule
where the NIC/SIP lookup is unexpected and the network drops
the packet either by firewall or just missing routing rules.
It appears common for the first interface "eth0" to match the
default route so often get working by depending on this
ordering, buts its very fragile and easy to break.

Fix by using the SIP of cilium_host which always has an
entry in the source routing table.

Note: Above only applies to non-fib case (kernels <4.17) with
fib lookup we do a lookup in BPF datapath and write MACs
and IPs correctly without stack intervention.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit d56f565 ]

We can not do a redirect to the egress interface if the fib lookup
is not supported. If we do try to do the redirect there is no guarantee
the MAC addresses are correct because we didn't rewrite them from the
BPF datapath. This may work, if the network is very fogiving, but more
likely the NIC will drop the packet because of MAC addresses being
incorrect.

Also without a correct ENCRYPT_IFINDEX the FIB lookup will fail. So
lets guard fib lookups with both encrypt ifindex and fib support this
way we will pass the packets to the stack instead of dropping them
when we don't have the correct set of kernel features or config.

EKS creates a unique problem here. EKS does source routing and has
multiple egress interfaces. The correct egress interface is unclear
at init time (there are multiple interfaces) and can change at runtime.
Rather than try to pick an interface (fragile!) let the stack route
the packet and skip FIB lookup and redirect from BPF datapath.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit 69e885c ]

We moved the subnet detection lower into the init path so that it handled
reinitialization, but unfortunately this broke setting IP_POOLS define
in header writing.

Do the simplest fix and just always enable IP_POOLS for ENI case.

Fixes: a42d442 ("cilium: auto-discovery pod subnets for ENI IPAM")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit e34115a ]

In order to push auto-discovery of the subnet into the nodeSubent fields we
need to do the discovery early enough in the setup that the subnets exist.
At the moment we do the auto-discovery too late and the values are not
pushed into the nodeConfiguration so we miss updating ipsec route table
correctly.

Rather than overwrite the config options this patch pushes the code into
the nodeConfigurationChanged() and updates the node configuration directly.
This has two advantages: we avoid stomping on user config and then
we also catch any subnet updates.

Fixes: a42d442 ("cilium: auto-discovery pod subnets for ENI IPAM")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit adb5ccf ]

The dmac may not be set in the case of !HAVE_FIB_SUPPORT so we need
to ensure the stack doesn't mark this packet as OTHERHOST and drop
it.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit c3d971a ]

When nodes are connected via L3 GW, the periodic ARP refresher pings the
same IP (GW). To avoid that, introduce neighLastPingByNextHop which
stores a timestamp of the last successful arping for a given next hop.
If delta between the last and the next arping is less than
option.Config.ARPPingRefreshPeriod, then the arping is skipped until the
next time the refresher is scheduled.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
[ upstream commit a753b38 ]

The impact of the wrong removal was very small - unnecessary neigh
insert via netlink and dangling unused value. This could have happened
under very rare circumstances when a next hop to a node had changed.

Fixes: 0483ba0 ("node-neigh: Do not inc neighbor refcount for the same node")
Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb added kind/backports This PR provides functionality previously merged into master. backport/1.9 labels May 7, 2021
@brb brb requested a review from a team as a code owner May 7, 2021 14:03
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for my changes.

@brb
Copy link
Member Author

brb commented May 7, 2021

test-backport-1.9

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my commits

@jrajahalme
Copy link
Member

@pchaigno Verifier test hit by #16050, not sure if this can be considered as unrelated flake for this PR or not?

@pchaigno
Copy link
Member

pchaigno commented May 7, 2021

@pchaigno Verifier test hit by #16050, not sure if this can be considered as unrelated flake for this PR or not?

We can ignore if it's the same error message (it is here). It already fails on v1.9.

@brb brb added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 10, 2021
@ti-mo ti-mo merged commit 7dc246b into v1.9 May 10, 2021
@ti-mo ti-mo deleted the pr/backport-1.9-2021-05-07 branch May 10, 2021 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants