Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increment operator cause unnecessary data races by always calling bpf_map_update_elem #3175

Closed
netoptimizer opened this issue May 16, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@netoptimizer
Copy link

netoptimizer commented May 16, 2024

What reproduces the bug? Provide code if possible.

The simple increment operator @++ cause unnecessary data races by always calling bpf_map_update_elem.

This simple bpftrace oneliner: sudo bpftrace -e 'kprobe:cgroup_rstat_flush_locked {@flush_calls++}'

Produces the following eBPF bytecode:

int64 kprobe_cgroup_rstat_flush_locked_1(int8 * ctx):
   0: (b7) r1 = 0
   1: (7b) *(u64 *)(r10 -16) = r1
   2: (bf) r2 = r10
   3: (07) r2 += -16
   4: (18) r1 = map[id:1104351]
   6: (85) call __htab_map_lookup_elem#231648
   7: (15) if r0 == 0x0 goto pc+1
   8: (07) r0 += 56
   9: (b7) r1 = 1
  10: (15) if r0 == 0x0 goto pc+2
  11: (79) r1 = *(u64 *)(r0 +0)
  12: (07) r1 += 1
  13: (7b) *(u64 *)(r10 -8) = r1
  14: (bf) r2 = r10
  15: (07) r2 += -16
  16: (bf) r3 = r10
  17: (07) r3 += -8
  18: (18) r1 = map[id:1104351]
  20: (b7) r4 = 0
  21: (85) call htab_map_update_elem#242720
  22: (b7) r0 = 0
  23: (95) exit

Looking closely at the eBPF bytecode instructions, we observe that BPF-helper map_update_elem is always invoked.

The code in line 11 gets a pointer to the map value (in r0) and increment the value in memory, which is great as we are basically done and could exit, but instead the code continues and calls htab_map_update_elem.

  • This creates an unnecessary data race, that I was hit by in production

I looked at the code generated for @=count() that does the right thing and skips calling map_update_elem.

bpftrace --info output

$ sudo bpftrace --info
System
OS: Linux 6.6.30-cloudflare-2024.5.4 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010
Arch: x86_64

Build
version: v0.20.0-149-g742f
LLVM: 18.1.5
unsafe probe: no
bfd: yes
liblldb (DWARF support): yes
libsystemd (systemd notify support): no

Kernel helpers
probe_read: yes
probe_read_str: yes
probe_read_user: yes
probe_read_user_str: yes
probe_read_kernel: yes
probe_read_kernel_str: yes
get_current_cgroup_id: yes
send_signal: yes
override_return: yes
get_boot_ns: yes
dpath: yes
skboutput: yes
get_tai_ns: yes
get_func_ip: yes
jiffies64: yes
for_each_map_elem: yes

Kernel features
Instruction limit: 1000000
Loop support: yes
btf: yes
module btf: yes
map batch: yes
uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes

Map types
hash: yes
percpu hash: yes
array: yes
percpu array: yes
stack_trace: yes
perf_event_array: yes
ringbuf: yes

Probe types
kprobe: yes
tracepoint: yes
perf_event: yes
kfunc: yes
kprobe_multi: no
uprobe_multi: yes
raw_tp_special: yes
iter: yes

@netoptimizer netoptimizer added the bug Something isn't working label May 16, 2024
@netoptimizer
Copy link
Author

netoptimizer commented May 16, 2024

Poke @jordalgo as it looks related to improvement in PR #2795
And @viktormalik

@jordalgo
Copy link
Contributor

@netoptimizer Good catch. Let me take a look.

jordalgo pushed a commit to jordalgo/bpftrace that referenced this issue May 17, 2024
This will save a call to bpf_map_update_elem

Issue: bpftrace#3175
netoptimizer added a commit to xdp-project/xdp-project that referenced this issue May 17, 2024
…omic

Fix comment about increment operator being atomic.

It is both slow and contains data races as described here:
 - bpftrace/bpftrace#3175

Maybe this will soon get fixed via:
 - bpftrace/bpftrace#3179
 - bpftrace#3179

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
@jordalgo
Copy link
Contributor

jordalgo commented Jun 7, 2024

Just wanted to follow up a bit after some internal discussion.

As per the docs ++ uses a hash map (shared across CPUs) whereas count uses a per-cpu hash map, which allows for safe direct pointer updating. If we're updating the pointer directly in ++ AND calling map_update_elem that's a bug (as you pointed out @netoptimizer). We should decide on one or the other.

I'm also thinking that maybe we should have a config setting for atomic/safer writes. This would include per-cpu map updating as there still can be races if a bpf prog thread is pre-empted by another (via NMI) and updates the same map. Most of the time the overhead is probably not worth accounting for this edge case but it might be worth adding as an option for some users.

@jordalgo
Copy link
Contributor

Dug into this a little more and AFAICT we're NOT double updating e.g. if I comment out this line, then the value doesn't get updated at all for a map increment @++
https://github.com/bpftrace/bpftrace/blob/master/src/ast/passes/codegen_llvm.cpp#L3931

The LLVM IR also shows that we're loading the map pointer value into a register and adding that to another pointer instead of directly updating. Maybe it's possible there some additional optimization happening where that indirect updating is being removed but I'm having trouble reproing it.

Going to close this task for now but feel free to re-open if you think I missed something. That being said, I am going to look into an atomic increment for ++ instead of a map_update_elem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants