bpf: optimized memmove for XDP + DSR #11676

borkmann · 2020-05-25T12:48:24Z

See commit msgs.

Add an implementation for small sizes and throw a build-bug for unsupported ones. This is used in XDP's DSR implementation, see ctx_adjust_room(). There, we also know a-priori that dst <= src always holds, so __bpf_memmove_fwd() is used directly. Example code generation for DSR with offset used in IPv4: __section("test") int bpf_xdp_test(struct __ctx_buff *ctx) { ctx_adjust_room(ctx, 8, BPF_ADJ_ROOM_NET, 0); barrier_data(ctx); return 0; } Before: # llvm-objdump --disassemble --section=test bpf_xdp.o bpf_xdp.o: file format ELF64-BPF Disassembly of section test: 0000000000000000 bpf_xdp_test: 0: bf 16 00 00 00 00 00 00 r6 = r1 1: 18 02 00 00 f8 ff ff ff 00 00 00 00 00 00 00 00 r2 = 4294967288 ll 3: 85 00 00 00 2c 00 00 00 call 44 4: 67 00 00 00 20 00 00 00 r0 <<= 32 5: 77 00 00 00 20 00 00 00 r0 >>= 32 6: 55 00 49 00 00 00 00 00 if r0 != 0 goto +73 <LBB5_3> 7: 61 62 04 00 00 00 00 00 r2 = *(u32 *)(r6 + 4) 8: 61 61 00 00 00 00 00 00 r1 = *(u32 *)(r6 + 0) 9: bf 13 00 00 00 00 00 00 r3 = r1 10: 07 03 00 00 2a 00 00 00 r3 += 42 11: 2d 23 44 00 00 00 00 00 if r3 > r2 goto +68 <LBB5_3> 12: 71 12 0f 00 00 00 00 00 r2 = *(u8 *)(r1 + 15) 13: 73 21 07 00 00 00 00 00 *(u8 *)(r1 + 7) = r2 14: 71 12 0e 00 00 00 00 00 r2 = *(u8 *)(r1 + 14) 15: 73 21 06 00 00 00 00 00 *(u8 *)(r1 + 6) = r2 16: 71 12 0d 00 00 00 00 00 r2 = *(u8 *)(r1 + 13) 17: 73 21 05 00 00 00 00 00 *(u8 *)(r1 + 5) = r2 18: 71 12 0c 00 00 00 00 00 r2 = *(u8 *)(r1 + 12) 19: 73 21 04 00 00 00 00 00 *(u8 *)(r1 + 4) = r2 20: 71 12 0b 00 00 00 00 00 r2 = *(u8 *)(r1 + 11) 21: 73 21 03 00 00 00 00 00 *(u8 *)(r1 + 3) = r2 22: 71 12 0a 00 00 00 00 00 r2 = *(u8 *)(r1 + 10) 23: 73 21 02 00 00 00 00 00 *(u8 *)(r1 + 2) = r2 24: 71 12 09 00 00 00 00 00 r2 = *(u8 *)(r1 + 9) 25: 73 21 01 00 00 00 00 00 *(u8 *)(r1 + 1) = r2 26: 71 12 08 00 00 00 00 00 r2 = *(u8 *)(r1 + 8) 27: 73 21 00 00 00 00 00 00 *(u8 *)(r1 + 0) = r2 28: 71 12 16 00 00 00 00 00 r2 = *(u8 *)(r1 + 22) 29: 73 21 0e 00 00 00 00 00 *(u8 *)(r1 + 14) = r2 30: 71 12 17 00 00 00 00 00 r2 = *(u8 *)(r1 + 23) 31: 73 21 0f 00 00 00 00 00 *(u8 *)(r1 + 15) = r2 32: 71 12 14 00 00 00 00 00 r2 = *(u8 *)(r1 + 20) 33: 73 21 0c 00 00 00 00 00 *(u8 *)(r1 + 12) = r2 34: 71 12 15 00 00 00 00 00 r2 = *(u8 *)(r1 + 21) 35: 73 21 0d 00 00 00 00 00 *(u8 *)(r1 + 13) = r2 36: 71 12 12 00 00 00 00 00 r2 = *(u8 *)(r1 + 18) 37: 73 21 0a 00 00 00 00 00 *(u8 *)(r1 + 10) = r2 38: 71 12 13 00 00 00 00 00 r2 = *(u8 *)(r1 + 19) 39: 73 21 0b 00 00 00 00 00 *(u8 *)(r1 + 11) = r2 40: 71 12 10 00 00 00 00 00 r2 = *(u8 *)(r1 + 16) 41: 73 21 08 00 00 00 00 00 *(u8 *)(r1 + 8) = r2 42: 71 12 11 00 00 00 00 00 r2 = *(u8 *)(r1 + 17) 43: 73 21 09 00 00 00 00 00 *(u8 *)(r1 + 9) = r2 44: 71 12 1e 00 00 00 00 00 r2 = *(u8 *)(r1 + 30) 45: 73 21 16 00 00 00 00 00 *(u8 *)(r1 + 22) = r2 46: 71 12 1f 00 00 00 00 00 r2 = *(u8 *)(r1 + 31) 47: 73 21 17 00 00 00 00 00 *(u8 *)(r1 + 23) = r2 48: 71 12 1c 00 00 00 00 00 r2 = *(u8 *)(r1 + 28) 49: 73 21 14 00 00 00 00 00 *(u8 *)(r1 + 20) = r2 50: 71 12 1d 00 00 00 00 00 r2 = *(u8 *)(r1 + 29) 51: 73 21 15 00 00 00 00 00 *(u8 *)(r1 + 21) = r2 52: 71 12 1a 00 00 00 00 00 r2 = *(u8 *)(r1 + 26) 53: 73 21 12 00 00 00 00 00 *(u8 *)(r1 + 18) = r2 54: 71 12 1b 00 00 00 00 00 r2 = *(u8 *)(r1 + 27) 55: 73 21 13 00 00 00 00 00 *(u8 *)(r1 + 19) = r2 56: 71 12 18 00 00 00 00 00 r2 = *(u8 *)(r1 + 24) 57: 73 21 10 00 00 00 00 00 *(u8 *)(r1 + 16) = r2 58: 71 12 19 00 00 00 00 00 r2 = *(u8 *)(r1 + 25) 59: 73 21 11 00 00 00 00 00 *(u8 *)(r1 + 17) = r2 60: 71 12 26 00 00 00 00 00 r2 = *(u8 *)(r1 + 38) 61: 73 21 1e 00 00 00 00 00 *(u8 *)(r1 + 30) = r2 62: 71 12 27 00 00 00 00 00 r2 = *(u8 *)(r1 + 39) 63: 73 21 1f 00 00 00 00 00 *(u8 *)(r1 + 31) = r2 64: 71 12 24 00 00 00 00 00 r2 = *(u8 *)(r1 + 36) 65: 73 21 1c 00 00 00 00 00 *(u8 *)(r1 + 28) = r2 66: 71 12 25 00 00 00 00 00 r2 = *(u8 *)(r1 + 37) 67: 73 21 1d 00 00 00 00 00 *(u8 *)(r1 + 29) = r2 68: 71 12 22 00 00 00 00 00 r2 = *(u8 *)(r1 + 34) 69: 73 21 1a 00 00 00 00 00 *(u8 *)(r1 + 26) = r2 70: 71 12 23 00 00 00 00 00 r2 = *(u8 *)(r1 + 35) 71: 73 21 1b 00 00 00 00 00 *(u8 *)(r1 + 27) = r2 72: 71 12 20 00 00 00 00 00 r2 = *(u8 *)(r1 + 32) 73: 73 21 18 00 00 00 00 00 *(u8 *)(r1 + 24) = r2 74: 71 12 21 00 00 00 00 00 r2 = *(u8 *)(r1 + 33) 75: 73 21 19 00 00 00 00 00 *(u8 *)(r1 + 25) = r2 76: 71 12 28 00 00 00 00 00 r2 = *(u8 *)(r1 + 40) 77: 73 21 20 00 00 00 00 00 *(u8 *)(r1 + 32) = r2 78: 71 12 29 00 00 00 00 00 r2 = *(u8 *)(r1 + 41) 79: 73 21 21 00 00 00 00 00 *(u8 *)(r1 + 33) = r2 0000000000000280 LBB5_3: 80: b7 00 00 00 00 00 00 00 r0 = 0 81: 95 00 00 00 00 00 00 00 exit After: # llvm-objdump --disassemble --section=test bpf_xdp.o bpf_xdp.o: file format ELF64-BPF Disassembly of section test: 0000000000000000 bpf_xdp_test: 0: bf 16 00 00 00 00 00 00 r6 = r1 1: 18 02 00 00 f8 ff ff ff 00 00 00 00 00 00 00 00 r2 = 4294967288 ll 3: 85 00 00 00 2c 00 00 00 call 44 4: 67 00 00 00 20 00 00 00 r0 <<= 32 5: 77 00 00 00 20 00 00 00 r0 >>= 32 6: 55 00 0f 00 00 00 00 00 if r0 != 0 goto +15 <LBB5_3> 7: 61 62 04 00 00 00 00 00 r2 = *(u32 *)(r6 + 4) 8: 61 61 00 00 00 00 00 00 r1 = *(u32 *)(r6 + 0) 9: bf 13 00 00 00 00 00 00 r3 = r1 10: 07 03 00 00 2a 00 00 00 r3 += 42 11: 2d 23 0a 00 00 00 00 00 if r3 > r2 goto +10 <LBB5_3> 12: 69 12 08 00 00 00 00 00 r2 = *(u16 *)(r1 + 8) 13: 6b 21 00 00 00 00 00 00 *(u16 *)(r1 + 0) = r2 14: 79 12 0a 00 00 00 00 00 r2 = *(u64 *)(r1 + 10) 15: 79 13 12 00 00 00 00 00 r3 = *(u64 *)(r1 + 18) 16: 7b 31 0a 00 00 00 00 00 *(u64 *)(r1 + 10) = r3 17: 7b 21 02 00 00 00 00 00 *(u64 *)(r1 + 2) = r2 18: 79 12 1a 00 00 00 00 00 r2 = *(u64 *)(r1 + 26) 19: 7b 21 12 00 00 00 00 00 *(u64 *)(r1 + 18) = r2 20: 79 12 22 00 00 00 00 00 r2 = *(u64 *)(r1 + 34) 21: 7b 21 1a 00 00 00 00 00 *(u64 *)(r1 + 26) = r2 00000000000000b0 LBB5_3: 22: b7 00 00 00 00 00 00 00 r0 = 0 23: 95 00 00 00 00 00 00 00 exit Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Extend the builtin test suite and add __bpf_memmove() tests along the existing __bpf_mem{set,cpy,cmp}() ones. The memmove is split into four subtests: 1) same (non-overlapping) memcpy test just with memmove, 2) overlapping with dst < src, 3) overlapping with dst == src, 4) overlapping with dst > src. Also improve / only use barrier_data() where it makes sense. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann · 2020-05-25T12:48:38Z

test-me-please

coveralls · 2020-05-25T13:19:27Z

Coverage increased (+0.03%) to 36.9% when pulling ac303a5 on pr/optimized-memmove into 7fb10af on master.

jrfastab · 2020-05-25T15:29:01Z

bpf/include/bpf/builtins.h

@@ -12,11 +12,24 @@
 # define lock_xadd(P, V)	((void) __sync_fetch_and_add((P), (V)))
 #endif

+/* Unfortunately verifier forces aligned stack access while other memory


If ptr leaks are OK and the return type is not a pointer we could probably allow unaligned stack access, any idea if that would help performance? Or maybe being clever the ptr leaks could be avoided as well by checking the slot type.

You mean for the memcpy case (not this PR)? BPF stack requires alignment, I think if we we'd try to get rid of it from kernel side then it might be at the cost of higher complexity. If we'd only keep spilled pointers aligned, it could work, agree. Though we won't be able to get rid of __align_stack_8 for older kernels, so likely no change either way. Either way, performance wise should be the same if LLVM would have optimised code generation vs our builtin replacements here. Here, we don't have to be generic and can optimise a bit better wrt our code.

right memcpy. Otherwise makes sense to me.

borkmann · 2020-05-25T18:06:07Z

retest-net-next

borkmann added 2 commits May 25, 2020 13:32

borkmann added pending-review sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels May 25, 2020

borkmann requested review from brb and a team May 25, 2020 12:48

borkmann requested a review from a team as a code owner May 25, 2020 12:48

maintainer-s-little-helper bot added this to In progress in 1.8.0 May 25, 2020

borkmann requested a review from pchaigno May 25, 2020 13:08

jrfastab reviewed May 25, 2020

View reviewed changes

brb approved these changes May 25, 2020

View reviewed changes

borkmann merged commit 52bb8f3 into master May 25, 2020

1.8.0 automation moved this from In progress to Merged May 25, 2020

borkmann deleted the pr/optimized-memmove branch May 25, 2020 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf: optimized memmove for XDP + DSR #11676

bpf: optimized memmove for XDP + DSR #11676

borkmann commented May 25, 2020

borkmann commented May 25, 2020

coveralls commented May 25, 2020

jrfastab May 25, 2020 •

edited

borkmann May 25, 2020 •

edited

jrfastab May 25, 2020

borkmann commented May 25, 2020

bpf: optimized memmove for XDP + DSR #11676

bpf: optimized memmove for XDP + DSR #11676

Conversation

borkmann commented May 25, 2020

borkmann commented May 25, 2020

coveralls commented May 25, 2020

jrfastab May 25, 2020 • edited

Choose a reason for hiding this comment

borkmann May 25, 2020 • edited

Choose a reason for hiding this comment

jrfastab May 25, 2020

Choose a reason for hiding this comment

borkmann commented May 25, 2020

jrfastab May 25, 2020 •

edited

borkmann May 25, 2020 •

edited