Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: optimized memmove for XDP + DSR #11676

Merged
merged 2 commits into from May 25, 2020
Merged

bpf: optimized memmove for XDP + DSR #11676

merged 2 commits into from May 25, 2020

Conversation

borkmann
Copy link
Member

See commit msgs.

Add an implementation for small sizes and throw a build-bug for unsupported
ones. This is used in XDP's DSR implementation, see ctx_adjust_room(). There,
we also know a-priori that dst <= src always holds, so __bpf_memmove_fwd()
is used directly.

Example code generation for DSR with offset used in IPv4:

  __section("test")
  int bpf_xdp_test(struct __ctx_buff *ctx)
  {
       ctx_adjust_room(ctx, 8, BPF_ADJ_ROOM_NET, 0);
       barrier_data(ctx);
       return 0;
  }

Before:

  # llvm-objdump --disassemble --section=test bpf_xdp.o

  bpf_xdp.o:	file format ELF64-BPF

  Disassembly of section test:

  0000000000000000 bpf_xdp_test:
       0:	bf 16 00 00 00 00 00 00	r6 = r1
       1:	18 02 00 00 f8 ff ff ff 00 00 00 00 00 00 00 00	r2 = 4294967288 ll
       3:	85 00 00 00 2c 00 00 00	call 44
       4:	67 00 00 00 20 00 00 00	r0 <<= 32
       5:	77 00 00 00 20 00 00 00	r0 >>= 32
       6:	55 00 49 00 00 00 00 00	if r0 != 0 goto +73 <LBB5_3>
       7:	61 62 04 00 00 00 00 00	r2 = *(u32 *)(r6 + 4)
       8:	61 61 00 00 00 00 00 00	r1 = *(u32 *)(r6 + 0)
       9:	bf 13 00 00 00 00 00 00	r3 = r1
      10:	07 03 00 00 2a 00 00 00	r3 += 42
      11:	2d 23 44 00 00 00 00 00	if r3 > r2 goto +68 <LBB5_3>
      12:	71 12 0f 00 00 00 00 00	r2 = *(u8 *)(r1 + 15)
      13:	73 21 07 00 00 00 00 00	*(u8 *)(r1 + 7) = r2
      14:	71 12 0e 00 00 00 00 00	r2 = *(u8 *)(r1 + 14)
      15:	73 21 06 00 00 00 00 00	*(u8 *)(r1 + 6) = r2
      16:	71 12 0d 00 00 00 00 00	r2 = *(u8 *)(r1 + 13)
      17:	73 21 05 00 00 00 00 00	*(u8 *)(r1 + 5) = r2
      18:	71 12 0c 00 00 00 00 00	r2 = *(u8 *)(r1 + 12)
      19:	73 21 04 00 00 00 00 00	*(u8 *)(r1 + 4) = r2
      20:	71 12 0b 00 00 00 00 00	r2 = *(u8 *)(r1 + 11)
      21:	73 21 03 00 00 00 00 00	*(u8 *)(r1 + 3) = r2
      22:	71 12 0a 00 00 00 00 00	r2 = *(u8 *)(r1 + 10)
      23:	73 21 02 00 00 00 00 00	*(u8 *)(r1 + 2) = r2
      24:	71 12 09 00 00 00 00 00	r2 = *(u8 *)(r1 + 9)
      25:	73 21 01 00 00 00 00 00	*(u8 *)(r1 + 1) = r2
      26:	71 12 08 00 00 00 00 00	r2 = *(u8 *)(r1 + 8)
      27:	73 21 00 00 00 00 00 00	*(u8 *)(r1 + 0) = r2
      28:	71 12 16 00 00 00 00 00	r2 = *(u8 *)(r1 + 22)
      29:	73 21 0e 00 00 00 00 00	*(u8 *)(r1 + 14) = r2
      30:	71 12 17 00 00 00 00 00	r2 = *(u8 *)(r1 + 23)
      31:	73 21 0f 00 00 00 00 00	*(u8 *)(r1 + 15) = r2
      32:	71 12 14 00 00 00 00 00	r2 = *(u8 *)(r1 + 20)
      33:	73 21 0c 00 00 00 00 00	*(u8 *)(r1 + 12) = r2
      34:	71 12 15 00 00 00 00 00	r2 = *(u8 *)(r1 + 21)
      35:	73 21 0d 00 00 00 00 00	*(u8 *)(r1 + 13) = r2
      36:	71 12 12 00 00 00 00 00	r2 = *(u8 *)(r1 + 18)
      37:	73 21 0a 00 00 00 00 00	*(u8 *)(r1 + 10) = r2
      38:	71 12 13 00 00 00 00 00	r2 = *(u8 *)(r1 + 19)
      39:	73 21 0b 00 00 00 00 00	*(u8 *)(r1 + 11) = r2
      40:	71 12 10 00 00 00 00 00	r2 = *(u8 *)(r1 + 16)
      41:	73 21 08 00 00 00 00 00	*(u8 *)(r1 + 8) = r2
      42:	71 12 11 00 00 00 00 00	r2 = *(u8 *)(r1 + 17)
      43:	73 21 09 00 00 00 00 00	*(u8 *)(r1 + 9) = r2
      44:	71 12 1e 00 00 00 00 00	r2 = *(u8 *)(r1 + 30)
      45:	73 21 16 00 00 00 00 00	*(u8 *)(r1 + 22) = r2
      46:	71 12 1f 00 00 00 00 00	r2 = *(u8 *)(r1 + 31)
      47:	73 21 17 00 00 00 00 00	*(u8 *)(r1 + 23) = r2
      48:	71 12 1c 00 00 00 00 00	r2 = *(u8 *)(r1 + 28)
      49:	73 21 14 00 00 00 00 00	*(u8 *)(r1 + 20) = r2
      50:	71 12 1d 00 00 00 00 00	r2 = *(u8 *)(r1 + 29)
      51:	73 21 15 00 00 00 00 00	*(u8 *)(r1 + 21) = r2
      52:	71 12 1a 00 00 00 00 00	r2 = *(u8 *)(r1 + 26)
      53:	73 21 12 00 00 00 00 00	*(u8 *)(r1 + 18) = r2
      54:	71 12 1b 00 00 00 00 00	r2 = *(u8 *)(r1 + 27)
      55:	73 21 13 00 00 00 00 00	*(u8 *)(r1 + 19) = r2
      56:	71 12 18 00 00 00 00 00	r2 = *(u8 *)(r1 + 24)
      57:	73 21 10 00 00 00 00 00	*(u8 *)(r1 + 16) = r2
      58:	71 12 19 00 00 00 00 00	r2 = *(u8 *)(r1 + 25)
      59:	73 21 11 00 00 00 00 00	*(u8 *)(r1 + 17) = r2
      60:	71 12 26 00 00 00 00 00	r2 = *(u8 *)(r1 + 38)
      61:	73 21 1e 00 00 00 00 00	*(u8 *)(r1 + 30) = r2
      62:	71 12 27 00 00 00 00 00	r2 = *(u8 *)(r1 + 39)
      63:	73 21 1f 00 00 00 00 00	*(u8 *)(r1 + 31) = r2
      64:	71 12 24 00 00 00 00 00	r2 = *(u8 *)(r1 + 36)
      65:	73 21 1c 00 00 00 00 00	*(u8 *)(r1 + 28) = r2
      66:	71 12 25 00 00 00 00 00	r2 = *(u8 *)(r1 + 37)
      67:	73 21 1d 00 00 00 00 00	*(u8 *)(r1 + 29) = r2
      68:	71 12 22 00 00 00 00 00	r2 = *(u8 *)(r1 + 34)
      69:	73 21 1a 00 00 00 00 00	*(u8 *)(r1 + 26) = r2
      70:	71 12 23 00 00 00 00 00	r2 = *(u8 *)(r1 + 35)
      71:	73 21 1b 00 00 00 00 00	*(u8 *)(r1 + 27) = r2
      72:	71 12 20 00 00 00 00 00	r2 = *(u8 *)(r1 + 32)
      73:	73 21 18 00 00 00 00 00	*(u8 *)(r1 + 24) = r2
      74:	71 12 21 00 00 00 00 00	r2 = *(u8 *)(r1 + 33)
      75:	73 21 19 00 00 00 00 00	*(u8 *)(r1 + 25) = r2
      76:	71 12 28 00 00 00 00 00	r2 = *(u8 *)(r1 + 40)
      77:	73 21 20 00 00 00 00 00	*(u8 *)(r1 + 32) = r2
      78:	71 12 29 00 00 00 00 00	r2 = *(u8 *)(r1 + 41)
      79:	73 21 21 00 00 00 00 00	*(u8 *)(r1 + 33) = r2

  0000000000000280 LBB5_3:
      80:	b7 00 00 00 00 00 00 00	r0 = 0
      81:	95 00 00 00 00 00 00 00	exit

After:

  # llvm-objdump --disassemble --section=test bpf_xdp.o

  bpf_xdp.o:	file format ELF64-BPF

  Disassembly of section test:

  0000000000000000 bpf_xdp_test:
       0:	bf 16 00 00 00 00 00 00	r6 = r1
       1:	18 02 00 00 f8 ff ff ff 00 00 00 00 00 00 00 00	r2 = 4294967288 ll
       3:	85 00 00 00 2c 00 00 00	call 44
       4:	67 00 00 00 20 00 00 00	r0 <<= 32
       5:	77 00 00 00 20 00 00 00	r0 >>= 32
       6:	55 00 0f 00 00 00 00 00	if r0 != 0 goto +15 <LBB5_3>
       7:	61 62 04 00 00 00 00 00	r2 = *(u32 *)(r6 + 4)
       8:	61 61 00 00 00 00 00 00	r1 = *(u32 *)(r6 + 0)
       9:	bf 13 00 00 00 00 00 00	r3 = r1
      10:	07 03 00 00 2a 00 00 00	r3 += 42
      11:	2d 23 0a 00 00 00 00 00	if r3 > r2 goto +10 <LBB5_3>
      12:	69 12 08 00 00 00 00 00	r2 = *(u16 *)(r1 + 8)
      13:	6b 21 00 00 00 00 00 00	*(u16 *)(r1 + 0) = r2
      14:	79 12 0a 00 00 00 00 00	r2 = *(u64 *)(r1 + 10)
      15:	79 13 12 00 00 00 00 00	r3 = *(u64 *)(r1 + 18)
      16:	7b 31 0a 00 00 00 00 00	*(u64 *)(r1 + 10) = r3
      17:	7b 21 02 00 00 00 00 00	*(u64 *)(r1 + 2) = r2
      18:	79 12 1a 00 00 00 00 00	r2 = *(u64 *)(r1 + 26)
      19:	7b 21 12 00 00 00 00 00	*(u64 *)(r1 + 18) = r2
      20:	79 12 22 00 00 00 00 00	r2 = *(u64 *)(r1 + 34)
      21:	7b 21 1a 00 00 00 00 00	*(u64 *)(r1 + 26) = r2

  00000000000000b0 LBB5_3:
      22:	b7 00 00 00 00 00 00 00	r0 = 0
      23:	95 00 00 00 00 00 00 00	exit

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Extend the builtin test suite and add __bpf_memmove() tests along the
existing __bpf_mem{set,cpy,cmp}() ones. The memmove is split into four
subtests: 1) same (non-overlapping) memcpy test just with memmove, 2)
overlapping with dst < src, 3) overlapping with dst == src, 4) overlapping
with dst > src. Also improve / only use barrier_data() where it makes
sense.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann added pending-review sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels May 25, 2020
@borkmann borkmann requested review from brb and a team May 25, 2020 12:48
@borkmann borkmann requested a review from a team as a code owner May 25, 2020 12:48
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 May 25, 2020
@borkmann
Copy link
Member Author

test-me-please

@borkmann borkmann requested a review from pchaigno May 25, 2020 13:08
@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 36.9% when pulling ac303a5 on pr/optimized-memmove into 7fb10af on master.

@@ -12,11 +12,24 @@
# define lock_xadd(P, V) ((void) __sync_fetch_and_add((P), (V)))
#endif

/* Unfortunately verifier forces aligned stack access while other memory
Copy link
Contributor

@jrfastab jrfastab May 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ptr leaks are OK and the return type is not a pointer we could probably allow unaligned stack access, any idea if that would help performance? Or maybe being clever the ptr leaks could be avoided as well by checking the slot type.

Copy link
Member Author

@borkmann borkmann May 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for the memcpy case (not this PR)? BPF stack requires alignment, I think if we we'd try to get rid of it from kernel side then it might be at the cost of higher complexity. If we'd only keep spilled pointers aligned, it could work, agree. Though we won't be able to get rid of __align_stack_8 for older kernels, so likely no change either way. Either way, performance wise should be the same if LLVM would have optimised code generation vs our builtin replacements here. Here, we don't have to be generic and can optimise a bit better wrt our code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right memcpy. Otherwise makes sense to me.

@borkmann
Copy link
Member Author

retest-net-next

@borkmann borkmann merged commit 52bb8f3 into master May 25, 2020
1.8.0 automation moved this from In progress to Merged May 25, 2020
@borkmann borkmann deleted the pr/optimized-memmove branch May 25, 2020 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.8.0
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants