New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: optimized memmove for XDP + DSR #11676
Conversation
Add an implementation for small sizes and throw a build-bug for unsupported ones. This is used in XDP's DSR implementation, see ctx_adjust_room(). There, we also know a-priori that dst <= src always holds, so __bpf_memmove_fwd() is used directly. Example code generation for DSR with offset used in IPv4: __section("test") int bpf_xdp_test(struct __ctx_buff *ctx) { ctx_adjust_room(ctx, 8, BPF_ADJ_ROOM_NET, 0); barrier_data(ctx); return 0; } Before: # llvm-objdump --disassemble --section=test bpf_xdp.o bpf_xdp.o: file format ELF64-BPF Disassembly of section test: 0000000000000000 bpf_xdp_test: 0: bf 16 00 00 00 00 00 00 r6 = r1 1: 18 02 00 00 f8 ff ff ff 00 00 00 00 00 00 00 00 r2 = 4294967288 ll 3: 85 00 00 00 2c 00 00 00 call 44 4: 67 00 00 00 20 00 00 00 r0 <<= 32 5: 77 00 00 00 20 00 00 00 r0 >>= 32 6: 55 00 49 00 00 00 00 00 if r0 != 0 goto +73 <LBB5_3> 7: 61 62 04 00 00 00 00 00 r2 = *(u32 *)(r6 + 4) 8: 61 61 00 00 00 00 00 00 r1 = *(u32 *)(r6 + 0) 9: bf 13 00 00 00 00 00 00 r3 = r1 10: 07 03 00 00 2a 00 00 00 r3 += 42 11: 2d 23 44 00 00 00 00 00 if r3 > r2 goto +68 <LBB5_3> 12: 71 12 0f 00 00 00 00 00 r2 = *(u8 *)(r1 + 15) 13: 73 21 07 00 00 00 00 00 *(u8 *)(r1 + 7) = r2 14: 71 12 0e 00 00 00 00 00 r2 = *(u8 *)(r1 + 14) 15: 73 21 06 00 00 00 00 00 *(u8 *)(r1 + 6) = r2 16: 71 12 0d 00 00 00 00 00 r2 = *(u8 *)(r1 + 13) 17: 73 21 05 00 00 00 00 00 *(u8 *)(r1 + 5) = r2 18: 71 12 0c 00 00 00 00 00 r2 = *(u8 *)(r1 + 12) 19: 73 21 04 00 00 00 00 00 *(u8 *)(r1 + 4) = r2 20: 71 12 0b 00 00 00 00 00 r2 = *(u8 *)(r1 + 11) 21: 73 21 03 00 00 00 00 00 *(u8 *)(r1 + 3) = r2 22: 71 12 0a 00 00 00 00 00 r2 = *(u8 *)(r1 + 10) 23: 73 21 02 00 00 00 00 00 *(u8 *)(r1 + 2) = r2 24: 71 12 09 00 00 00 00 00 r2 = *(u8 *)(r1 + 9) 25: 73 21 01 00 00 00 00 00 *(u8 *)(r1 + 1) = r2 26: 71 12 08 00 00 00 00 00 r2 = *(u8 *)(r1 + 8) 27: 73 21 00 00 00 00 00 00 *(u8 *)(r1 + 0) = r2 28: 71 12 16 00 00 00 00 00 r2 = *(u8 *)(r1 + 22) 29: 73 21 0e 00 00 00 00 00 *(u8 *)(r1 + 14) = r2 30: 71 12 17 00 00 00 00 00 r2 = *(u8 *)(r1 + 23) 31: 73 21 0f 00 00 00 00 00 *(u8 *)(r1 + 15) = r2 32: 71 12 14 00 00 00 00 00 r2 = *(u8 *)(r1 + 20) 33: 73 21 0c 00 00 00 00 00 *(u8 *)(r1 + 12) = r2 34: 71 12 15 00 00 00 00 00 r2 = *(u8 *)(r1 + 21) 35: 73 21 0d 00 00 00 00 00 *(u8 *)(r1 + 13) = r2 36: 71 12 12 00 00 00 00 00 r2 = *(u8 *)(r1 + 18) 37: 73 21 0a 00 00 00 00 00 *(u8 *)(r1 + 10) = r2 38: 71 12 13 00 00 00 00 00 r2 = *(u8 *)(r1 + 19) 39: 73 21 0b 00 00 00 00 00 *(u8 *)(r1 + 11) = r2 40: 71 12 10 00 00 00 00 00 r2 = *(u8 *)(r1 + 16) 41: 73 21 08 00 00 00 00 00 *(u8 *)(r1 + 8) = r2 42: 71 12 11 00 00 00 00 00 r2 = *(u8 *)(r1 + 17) 43: 73 21 09 00 00 00 00 00 *(u8 *)(r1 + 9) = r2 44: 71 12 1e 00 00 00 00 00 r2 = *(u8 *)(r1 + 30) 45: 73 21 16 00 00 00 00 00 *(u8 *)(r1 + 22) = r2 46: 71 12 1f 00 00 00 00 00 r2 = *(u8 *)(r1 + 31) 47: 73 21 17 00 00 00 00 00 *(u8 *)(r1 + 23) = r2 48: 71 12 1c 00 00 00 00 00 r2 = *(u8 *)(r1 + 28) 49: 73 21 14 00 00 00 00 00 *(u8 *)(r1 + 20) = r2 50: 71 12 1d 00 00 00 00 00 r2 = *(u8 *)(r1 + 29) 51: 73 21 15 00 00 00 00 00 *(u8 *)(r1 + 21) = r2 52: 71 12 1a 00 00 00 00 00 r2 = *(u8 *)(r1 + 26) 53: 73 21 12 00 00 00 00 00 *(u8 *)(r1 + 18) = r2 54: 71 12 1b 00 00 00 00 00 r2 = *(u8 *)(r1 + 27) 55: 73 21 13 00 00 00 00 00 *(u8 *)(r1 + 19) = r2 56: 71 12 18 00 00 00 00 00 r2 = *(u8 *)(r1 + 24) 57: 73 21 10 00 00 00 00 00 *(u8 *)(r1 + 16) = r2 58: 71 12 19 00 00 00 00 00 r2 = *(u8 *)(r1 + 25) 59: 73 21 11 00 00 00 00 00 *(u8 *)(r1 + 17) = r2 60: 71 12 26 00 00 00 00 00 r2 = *(u8 *)(r1 + 38) 61: 73 21 1e 00 00 00 00 00 *(u8 *)(r1 + 30) = r2 62: 71 12 27 00 00 00 00 00 r2 = *(u8 *)(r1 + 39) 63: 73 21 1f 00 00 00 00 00 *(u8 *)(r1 + 31) = r2 64: 71 12 24 00 00 00 00 00 r2 = *(u8 *)(r1 + 36) 65: 73 21 1c 00 00 00 00 00 *(u8 *)(r1 + 28) = r2 66: 71 12 25 00 00 00 00 00 r2 = *(u8 *)(r1 + 37) 67: 73 21 1d 00 00 00 00 00 *(u8 *)(r1 + 29) = r2 68: 71 12 22 00 00 00 00 00 r2 = *(u8 *)(r1 + 34) 69: 73 21 1a 00 00 00 00 00 *(u8 *)(r1 + 26) = r2 70: 71 12 23 00 00 00 00 00 r2 = *(u8 *)(r1 + 35) 71: 73 21 1b 00 00 00 00 00 *(u8 *)(r1 + 27) = r2 72: 71 12 20 00 00 00 00 00 r2 = *(u8 *)(r1 + 32) 73: 73 21 18 00 00 00 00 00 *(u8 *)(r1 + 24) = r2 74: 71 12 21 00 00 00 00 00 r2 = *(u8 *)(r1 + 33) 75: 73 21 19 00 00 00 00 00 *(u8 *)(r1 + 25) = r2 76: 71 12 28 00 00 00 00 00 r2 = *(u8 *)(r1 + 40) 77: 73 21 20 00 00 00 00 00 *(u8 *)(r1 + 32) = r2 78: 71 12 29 00 00 00 00 00 r2 = *(u8 *)(r1 + 41) 79: 73 21 21 00 00 00 00 00 *(u8 *)(r1 + 33) = r2 0000000000000280 LBB5_3: 80: b7 00 00 00 00 00 00 00 r0 = 0 81: 95 00 00 00 00 00 00 00 exit After: # llvm-objdump --disassemble --section=test bpf_xdp.o bpf_xdp.o: file format ELF64-BPF Disassembly of section test: 0000000000000000 bpf_xdp_test: 0: bf 16 00 00 00 00 00 00 r6 = r1 1: 18 02 00 00 f8 ff ff ff 00 00 00 00 00 00 00 00 r2 = 4294967288 ll 3: 85 00 00 00 2c 00 00 00 call 44 4: 67 00 00 00 20 00 00 00 r0 <<= 32 5: 77 00 00 00 20 00 00 00 r0 >>= 32 6: 55 00 0f 00 00 00 00 00 if r0 != 0 goto +15 <LBB5_3> 7: 61 62 04 00 00 00 00 00 r2 = *(u32 *)(r6 + 4) 8: 61 61 00 00 00 00 00 00 r1 = *(u32 *)(r6 + 0) 9: bf 13 00 00 00 00 00 00 r3 = r1 10: 07 03 00 00 2a 00 00 00 r3 += 42 11: 2d 23 0a 00 00 00 00 00 if r3 > r2 goto +10 <LBB5_3> 12: 69 12 08 00 00 00 00 00 r2 = *(u16 *)(r1 + 8) 13: 6b 21 00 00 00 00 00 00 *(u16 *)(r1 + 0) = r2 14: 79 12 0a 00 00 00 00 00 r2 = *(u64 *)(r1 + 10) 15: 79 13 12 00 00 00 00 00 r3 = *(u64 *)(r1 + 18) 16: 7b 31 0a 00 00 00 00 00 *(u64 *)(r1 + 10) = r3 17: 7b 21 02 00 00 00 00 00 *(u64 *)(r1 + 2) = r2 18: 79 12 1a 00 00 00 00 00 r2 = *(u64 *)(r1 + 26) 19: 7b 21 12 00 00 00 00 00 *(u64 *)(r1 + 18) = r2 20: 79 12 22 00 00 00 00 00 r2 = *(u64 *)(r1 + 34) 21: 7b 21 1a 00 00 00 00 00 *(u64 *)(r1 + 26) = r2 00000000000000b0 LBB5_3: 22: b7 00 00 00 00 00 00 00 r0 = 0 23: 95 00 00 00 00 00 00 00 exit Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Extend the builtin test suite and add __bpf_memmove() tests along the existing __bpf_mem{set,cpy,cmp}() ones. The memmove is split into four subtests: 1) same (non-overlapping) memcpy test just with memmove, 2) overlapping with dst < src, 3) overlapping with dst == src, 4) overlapping with dst > src. Also improve / only use barrier_data() where it makes sense. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
test-me-please |
@@ -12,11 +12,24 @@ | |||
# define lock_xadd(P, V) ((void) __sync_fetch_and_add((P), (V))) | |||
#endif | |||
|
|||
/* Unfortunately verifier forces aligned stack access while other memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ptr leaks are OK and the return type is not a pointer we could probably allow unaligned stack access, any idea if that would help performance? Or maybe being clever the ptr leaks could be avoided as well by checking the slot type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for the memcpy case (not this PR)? BPF stack requires alignment, I think if we we'd try to get rid of it from kernel side then it might be at the cost of higher complexity. If we'd only keep spilled pointers aligned, it could work, agree. Though we won't be able to get rid of __align_stack_8 for older kernels, so likely no change either way. Either way, performance wise should be the same if LLVM would have optimised code generation vs our builtin replacements here. Here, we don't have to be generic and can optimise a bit better wrt our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right memcpy. Otherwise makes sense to me.
retest-net-next |
See commit msgs.