Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: more scalability improvements #11694

Merged
merged 12 commits into from May 30, 2020
Merged

bpf: more scalability improvements #11694

merged 12 commits into from May 30, 2020

Conversation

borkmann
Copy link
Member

@borkmann borkmann commented May 26, 2020

Needs release-blocker flag given the jiffies conversion which falls under the agent clock source probe flag.

See commit msgs.

@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 May 26, 2020
@borkmann borkmann requested a review from brb May 26, 2020 13:07
@borkmann borkmann added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels May 26, 2020
@coveralls
Copy link

coveralls commented May 26, 2020

Coverage Status

Coverage decreased (-0.003%) to 36.905% when pulling 0e38c41 on pr/dp-improvements into 05163ed on master.

@borkmann borkmann force-pushed the pr/dp-improvements branch 6 times, most recently from 23ca008 to 999ee8d Compare May 28, 2020 13:30
... which uses jiffies if possible and otherwise falls back to ktime. See
also commit cd7921f ("bpf: implement time source switch and use it in
CT") for more information wrt ktime vs jiffies.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For more info see commit cd7921f ("bpf: implement time source switch
and use it in CT") for more information wrt ktime vs jiffies. Given the CT
maps have been switched, we also should do it for NAT to allow for potential
created time correlation of entries.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If there is no entry in the affinity map, then it's pointless to fetch
bpf_mono_now() and prep the match key. Only go for it when really needed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The only thing that lb4_update_affinity_by_netns() is doing for existing
entries is updating the last_used timestamp, nothing else. So lets not
hit the BPF map update spinlock from all CPUs. Instead use a READ_ONCE()/
WRITE_ONCE() scheme and update the timestamp at the time we do the lookup
where we have the val already.

Only call the lb4_update_affinity_by_netns() update when the
backend_from_affinity is false which indicates that a backend was just
newly selected.

The lb4_update_affinity_by_netns() is addressed separately to ease review.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Using the kernel's fib lookup is too expensive and not strictly needed
for the case of having a hairpinned load-balancer as in XDP. Given we
already have a faster NODEPORT_NEIGH{4,6} cache that we populate for
inbound connections, just look up the backend mac from there. Avoiding
the fib lookup in XDP bumps performance by ~1.5Mpps in my environment.
We can only make this assumption for the hairpin case where the LB is
exposed to a single public dev. We can extend nodeport_lb_hairpin()
later on also for the tc case. The assumption on why this is okay is
because backend pods have src IP validation in bpf_lx via checking on
is_valid_lxc_src_ip{,v4}() which prevents from spoofing (src IP/MAC)
in general.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Do the same conversion as done in the DSR case also for SNAT NodePort
which similarly helps to improve the performance by avoiding the fib
lookup in the kernel. Improving by similar numbers as with DSR case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann force-pushed the pr/dp-improvements branch 3 times, most recently from 94a1b6a to 5fb6e96 Compare May 29, 2020 00:00
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mostly, some small nits and would be nice to have this covered by unit tests as for the other maps size options.

Also, I think the cmdref for cilium-agent probably needs to be re-generated as part of commit 5fb6e96a447e106644b6ed472b18bb47586bf2ca ("bpf: make neigh{4,6} maps size configurable from agent and helm ").

@@ -2661,6 +2670,15 @@ func (c *DaemonConfig) calculateDynamicBPFMapSizes(totalMemory uint64, dynamicSi
} else {
log.Debugf("option %s set by user to %v", NATMapEntriesGlobalName, c.NATMapEntriesGlobal)
}
if !viper.IsSet(NeighMapEntriesGlobalName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you maybe also update the unit tests to cover the newly introduced Neigh map size here:

func TestCheckMapSizeLimits(t *testing.T) {

and here:

func TestCheckIPv4NativeRoutingCIDR(t *testing.T) {

?

Copy link
Member Author

@borkmann borkmann May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, adding as follow-up as discussed. Also that we should fix the dynamic sizing calc even before this patchset since it doesn't take neigh BPF map into account.

@borkmann borkmann force-pushed the pr/dp-improvements branch 2 times, most recently from d1b7127 to 531ee33 Compare May 29, 2020 21:26
Given we reuse neigh BPF tables for client and backends, it doesn't make
sense to push the HW addresss down only once upon k8s node events like
nodeUpdate() and nodeDelete() where we ARP probe and update the fib. Since
this is an LRU cache they can get evicted. Therefore update once we did
a successful fib lookup so we don't need to perform another one next time
and have the guarantee that these continue to be cached once evicted.
From control plane, we need to retire old entries however, so deletes are
pushed from there.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Enforce retiring of neigh cache entries either on update or deletion of
new nodes. This will enforce the datapath to re-cache the neigh entry
from the fib. This is needed in order to not leave stale entries around.
Only do this when we have acceleration enabled right now since this is
equivalent to what the datapath does as well. We might later extend this
for tc as well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Small enough that it shouldn't matter much, but add a note here wrt
backend selection. Slighly clean up some of the other comments.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We only ever need to track neigh entries of clients in case of SNAT
when the reply comes back to us or in the case where its local to the
node. For DSR and Hyrbid (DSR for TCP) we don't need to track them
here since the node only ever sees to original requests which improves
scalability for new client mappings as tracking can be avoided altogether.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann marked this pull request as ready for review May 29, 2020 22:21
@borkmann borkmann requested review from a team May 29, 2020 22:21
@borkmann borkmann requested review from a team as code owners May 29, 2020 22:21
@borkmann borkmann requested review from a team and joestringer May 29, 2020 22:21
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 May 29, 2020
@borkmann
Copy link
Member Author

test-me-please

The BPF neigh{4,6} maps are decoupled from SNAT NodePort mode these
days and are used for i) request IP/Mac mappings for remote backends,
ii) for local backends and iii) backend to Mac mappings. Make its size
configurable in order to allow for more flexibility. Default to the
same as NAT map size.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann
Copy link
Member Author

test-me-please

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue on storage of the mono higher resolution counter and one minor initialization suggestion below.

bpf/lib/lb.h Show resolved Hide resolved
Comment on lines 319 to 322
struct bpf_fib_lookup fib_params = {
.family = AF_INET6,
.ifindex = DIRECT_ROUTING_DEV_IFINDEX,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to initialize this until the fib_lookup() fallback case below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the ifindex is used also in the non-fib case. Given prior case was unconditional struct bpf_fib_lookup fib_params = {}; I thought I'd reuse it for the neigh map case. I can look into reworking it a bit as follow-up.

bpf/lib/lb.h Show resolved Hide resolved
@joestringer joestringer merged commit b7be1c0 into master May 30, 2020
1.8.0 automation moved this from In progress to Merged May 30, 2020
@joestringer joestringer deleted the pr/dp-improvements branch May 30, 2020 23:36
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.0 Jun 3, 2020
@joestringer joestringer moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.8.0
  
Merged
1.8.0
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

4 participants