New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: more scalability improvements #11694
Conversation
Please set the appropriate release note label. |
23ca008
to
999ee8d
Compare
... which uses jiffies if possible and otherwise falls back to ktime. See also commit cd7921f ("bpf: implement time source switch and use it in CT") for more information wrt ktime vs jiffies. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For more info see commit cd7921f ("bpf: implement time source switch and use it in CT") for more information wrt ktime vs jiffies. Given the CT maps have been switched, we also should do it for NAT to allow for potential created time correlation of entries. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If there is no entry in the affinity map, then it's pointless to fetch bpf_mono_now() and prep the match key. Only go for it when really needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The only thing that lb4_update_affinity_by_netns() is doing for existing entries is updating the last_used timestamp, nothing else. So lets not hit the BPF map update spinlock from all CPUs. Instead use a READ_ONCE()/ WRITE_ONCE() scheme and update the timestamp at the time we do the lookup where we have the val already. Only call the lb4_update_affinity_by_netns() update when the backend_from_affinity is false which indicates that a backend was just newly selected. The lb4_update_affinity_by_netns() is addressed separately to ease review. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
999ee8d
to
d4afb6d
Compare
Using the kernel's fib lookup is too expensive and not strictly needed for the case of having a hairpinned load-balancer as in XDP. Given we already have a faster NODEPORT_NEIGH{4,6} cache that we populate for inbound connections, just look up the backend mac from there. Avoiding the fib lookup in XDP bumps performance by ~1.5Mpps in my environment. We can only make this assumption for the hairpin case where the LB is exposed to a single public dev. We can extend nodeport_lb_hairpin() later on also for the tc case. The assumption on why this is okay is because backend pods have src IP validation in bpf_lx via checking on is_valid_lxc_src_ip{,v4}() which prevents from spoofing (src IP/MAC) in general. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Do the same conversion as done in the DSR case also for SNAT NodePort which similarly helps to improve the performance by avoiding the fib lookup in the kernel. Improving by similar numbers as with DSR case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
94a1b6a
to
5fb6e96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mostly, some small nits and would be nice to have this covered by unit tests as for the other maps size options.
Also, I think the cmdref
for cilium-agent
probably needs to be re-generated as part of commit 5fb6e96a447e106644b6ed472b18bb47586bf2ca ("bpf: make neigh{4,6} maps size configurable from agent and helm ").
@@ -2661,6 +2670,15 @@ func (c *DaemonConfig) calculateDynamicBPFMapSizes(totalMemory uint64, dynamicSi | |||
} else { | |||
log.Debugf("option %s set by user to %v", NATMapEntriesGlobalName, c.NATMapEntriesGlobal) | |||
} | |||
if !viper.IsSet(NeighMapEntriesGlobalName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you maybe also update the unit tests to cover the newly introduced Neigh map size here:
cilium/pkg/option/config_test.go
Line 265 in af9fd59
func TestCheckMapSizeLimits(t *testing.T) { |
and here:
cilium/pkg/option/config_test.go
Line 472 in af9fd59
func TestCheckIPv4NativeRoutingCIDR(t *testing.T) { |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, adding as follow-up as discussed. Also that we should fix the dynamic sizing calc even before this patchset since it doesn't take neigh BPF map into account.
d1b7127
to
531ee33
Compare
Given we reuse neigh BPF tables for client and backends, it doesn't make sense to push the HW addresss down only once upon k8s node events like nodeUpdate() and nodeDelete() where we ARP probe and update the fib. Since this is an LRU cache they can get evicted. Therefore update once we did a successful fib lookup so we don't need to perform another one next time and have the guarantee that these continue to be cached once evicted. From control plane, we need to retire old entries however, so deletes are pushed from there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Enforce retiring of neigh cache entries either on update or deletion of new nodes. This will enforce the datapath to re-cache the neigh entry from the fib. This is needed in order to not leave stale entries around. Only do this when we have acceleration enabled right now since this is equivalent to what the datapath does as well. We might later extend this for tc as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Small enough that it shouldn't matter much, but add a note here wrt backend selection. Slighly clean up some of the other comments. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We only ever need to track neigh entries of clients in case of SNAT when the reply comes back to us or in the case where its local to the node. For DSR and Hyrbid (DSR for TCP) we don't need to track them here since the node only ever sees to original requests which improves scalability for new client mappings as tracking can be avoided altogether. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
531ee33
to
0995c19
Compare
test-me-please |
The BPF neigh{4,6} maps are decoupled from SNAT NodePort mode these days and are used for i) request IP/Mac mappings for remote backends, ii) for local backends and iii) backend to Mac mappings. Make its size configurable in order to allow for more flexibility. Default to the same as NAT map size. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
0995c19
to
0e38c41
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue on storage of the mono higher resolution counter and one minor initialization suggestion below.
bpf/lib/nodeport.h
Outdated
struct bpf_fib_lookup fib_params = { | ||
.family = AF_INET6, | ||
.ifindex = DIRECT_ROUTING_DEV_IFINDEX, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need to initialize this until the fib_lookup()
fallback case below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the ifindex is used also in the non-fib case. Given prior case was unconditional struct bpf_fib_lookup fib_params = {};
I thought I'd reuse it for the neigh map case. I can look into reworking it a bit as follow-up.
Needs release-blocker flag given the jiffies conversion which falls under the agent clock source probe flag.
See commit msgs.