Increment operator cause unnecessary data races by always calling bpf_map_update_elem #3175

netoptimizer · 2024-05-16T12:11:53Z

What reproduces the bug? Provide code if possible.

The simple increment operator @++ cause unnecessary data races by always calling bpf_map_update_elem.

This simple bpftrace oneliner: sudo bpftrace -e 'kprobe:cgroup_rstat_flush_locked {@flush_calls++}'

Produces the following eBPF bytecode:

int64 kprobe_cgroup_rstat_flush_locked_1(int8 * ctx):
   0: (b7) r1 = 0
   1: (7b) *(u64 *)(r10 -16) = r1
   2: (bf) r2 = r10
   3: (07) r2 += -16
   4: (18) r1 = map[id:1104351]
   6: (85) call __htab_map_lookup_elem#231648
   7: (15) if r0 == 0x0 goto pc+1
   8: (07) r0 += 56
   9: (b7) r1 = 1
  10: (15) if r0 == 0x0 goto pc+2
  11: (79) r1 = *(u64 *)(r0 +0)
  12: (07) r1 += 1
  13: (7b) *(u64 *)(r10 -8) = r1
  14: (bf) r2 = r10
  15: (07) r2 += -16
  16: (bf) r3 = r10
  17: (07) r3 += -8
  18: (18) r1 = map[id:1104351]
  20: (b7) r4 = 0
  21: (85) call htab_map_update_elem#242720
  22: (b7) r0 = 0
  23: (95) exit

Looking closely at the eBPF bytecode instructions, we observe that BPF-helper map_update_elem is always invoked.

The code in line 11 gets a pointer to the map value (in r0) and increment the value in memory, which is great as we are basically done and could exit, but instead the code continues and calls htab_map_update_elem.

This creates an unnecessary data race, that I was hit by in production

I looked at the code generated for @=count() that does the right thing and skips calling map_update_elem.

Looks like PR Use direct increment for count builtin instead of map update #2795 fixed a similar issue for count()

`bpftrace --info` output

$ sudo bpftrace --info
System
OS: Linux 6.6.30-cloudflare-2024.5.4 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010
Arch: x86_64

Build
version: v0.20.0-149-g742f
LLVM: 18.1.5
unsafe probe: no
bfd: yes
liblldb (DWARF support): yes
libsystemd (systemd notify support): no

Kernel helpers
probe_read: yes
probe_read_str: yes
probe_read_user: yes
probe_read_user_str: yes
probe_read_kernel: yes
probe_read_kernel_str: yes
get_current_cgroup_id: yes
send_signal: yes
override_return: yes
get_boot_ns: yes
dpath: yes
skboutput: yes
get_tai_ns: yes
get_func_ip: yes
jiffies64: yes
for_each_map_elem: yes

Kernel features
Instruction limit: 1000000
Loop support: yes
btf: yes
module btf: yes
map batch: yes
uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes

Map types
hash: yes
percpu hash: yes
array: yes
percpu array: yes
stack_trace: yes
perf_event_array: yes
ringbuf: yes

Probe types
kprobe: yes
tracepoint: yes
perf_event: yes
kfunc: yes
kprobe_multi: no
uprobe_multi: yes
raw_tp_special: yes
iter: yes

The text was updated successfully, but these errors were encountered:

netoptimizer · 2024-05-16T12:15:44Z

Poke @jordalgo as it looks related to improvement in PR #2795
And @viktormalik

jordalgo · 2024-05-16T13:42:08Z

@netoptimizer Good catch. Let me take a look.

This will save a call to bpf_map_update_elem Issue: bpftrace#3175

…omic Fix comment about increment operator being atomic. It is both slow and contains data races as described here: - bpftrace/bpftrace#3175 Maybe this will soon get fixed via: - bpftrace/bpftrace#3179 - bpftrace#3179 Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

jordalgo · 2024-06-07T11:34:45Z

Just wanted to follow up a bit after some internal discussion.

As per the docs ++ uses a hash map (shared across CPUs) whereas count uses a per-cpu hash map, which allows for safe direct pointer updating. If we're updating the pointer directly in ++ AND calling map_update_elem that's a bug (as you pointed out @netoptimizer). We should decide on one or the other.

I'm also thinking that maybe we should have a config setting for atomic/safer writes. This would include per-cpu map updating as there still can be races if a bpf prog thread is pre-empted by another (via NMI) and updates the same map. Most of the time the overhead is probably not worth accounting for this edge case but it might be worth adding as an option for some users.

jordalgo · 2024-06-18T16:58:18Z

Dug into this a little more and AFAICT we're NOT double updating e.g. if I comment out this line, then the value doesn't get updated at all for a map increment @++
https://github.com/bpftrace/bpftrace/blob/master/src/ast/passes/codegen_llvm.cpp#L3931

The LLVM IR also shows that we're loading the map pointer value into a register and adding that to another pointer instead of directly updating. Maybe it's possible there some additional optimization happening where that indirect updating is being removed but I'm having trouble reproing it.

Going to close this task for now but feel free to re-open if you think I missed something. That being said, I am going to look into an atomic increment for ++ instead of a map_update_elem

netoptimizer added the bug Something isn't working label May 16, 2024

jordalgo pushed a commit to jordalgo/bpftrace that referenced this issue May 17, 2024

Use direct inc/dec for ++/-- op

4848530

This will save a call to bpf_map_update_elem Issue: bpftrace#3175

jordalgo mentioned this issue May 17, 2024

Use direct inc/dec for ++/-- op #3179

Closed

3 tasks

netoptimizer mentioned this issue May 17, 2024

Minor fixes to areas/latency/cgroup_rstat_tracepoint.bt xdp-project/xdp-project#99

Open

jordalgo closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increment operator cause unnecessary data races by always calling bpf_map_update_elem #3175

Increment operator cause unnecessary data races by always calling bpf_map_update_elem #3175

netoptimizer commented May 16, 2024 •

edited

Loading

netoptimizer commented May 16, 2024 •

edited

Loading

jordalgo commented May 16, 2024

jordalgo commented Jun 7, 2024 •

edited

Loading

jordalgo commented Jun 18, 2024

Increment operator cause unnecessary data races by always calling bpf_map_update_elem #3175

Increment operator cause unnecessary data races by always calling bpf_map_update_elem #3175

Comments

netoptimizer commented May 16, 2024 • edited Loading

What reproduces the bug? Provide code if possible.

bpftrace --info output

netoptimizer commented May 16, 2024 • edited Loading

jordalgo commented May 16, 2024

jordalgo commented Jun 7, 2024 • edited Loading

jordalgo commented Jun 18, 2024

netoptimizer commented May 16, 2024 •

edited

Loading

`bpftrace --info` output

netoptimizer commented May 16, 2024 •

edited

Loading

jordalgo commented Jun 7, 2024 •

edited

Loading