[RFC] Probe expansion in codegen #3005

viktormalik · 2024-02-14T12:47:55Z

While working on #2334, I realized that we'll need to significantly change the way we do probe expansion, so I'm opening an RFC to see what other people opinions are, before I start implementing it.

Let us have a simple wildcarded probe kfunc:vfs_* { ... }.

At the moment, we generate one LLVM function, LLVM generates one BPF program from it, then we perform the expansion (74 probes), load the BPF program 74 times (each time with a different BTF id of the probe), and attach each instance to a different probe.

The problem is that if we delegate probe loading to libbpf, it will need to discover the probes from the ELF object and therefore we'll need to generate 74 copies of the same LLVM function (unless we somehow force LLVM to create multiple symbol table entries for the same BPF function). This will heavily enlarge the codegen output and the size of the ELF file.

My stand is that this is still worth it as moving to libbpf will have several advantages:

less code (e.g. no need for custom relocations),
possibility to use libbpf's attachment in future (removing dependency on BCC),
a "standard" ELF produced by bpftrace, possibly usable by other loaders other than libbpf,
access to other libbpf features.

The codegen/ELF size itself is a hidden technical detail and it'll probably only cause trouble for debugging. Also, for some probes (e.g. kprobe), we'll be ok with just a single LLVM function but for others (like k(ret)func), we'll always need to do the expansion.

The text was updated successfully, but these errors were encountered:

ajor · 2024-02-14T18:27:05Z

we generate one LLVM function

LLVM generates one BPF program from it

we perform the expansion (74 probes)

load the BPF program 74 times (each time with a different BTF id of the probe)

attach each instance to a different probe

What do you mean by "perform the expansion" in step 3? I'm not familiar with this code - are we copying the bytecode currently?

My concern with unnecessary probe expansion is the performance impact if we want to attach to 100k+ probes (e.g. fentry:*, fexit:* {}). Generating multiple symbols for the same function seems like something that should be possible to me.

viktormalik · 2024-02-16T13:50:57Z

we generate one LLVM function

LLVM generates one BPF program from it

we perform the expansion (74 probes)

load the BPF program 74 times (each time with a different BTF id of the probe)

attach each instance to a different probe

What do you mean by "perform the expansion" in step 3? I'm not familiar with this code - are we copying the bytecode currently?

We're not copying it directly but we create one Probe (from types.h) object per expanded probe (see bpftrace::add_probe), then call bpf_prog_load for each, and the copy is done in the kernel upon loading.

My concern with unnecessary probe expansion is the performance impact if we want to attach to 100k+ probes (e.g. fentry:*, fexit:* {}).

In reality, attaching to such a large number of fentry probes is already terribly slow (it's caused by the kernel, not bpftrace):

# time src/bpftrace -e 'kfunc:vfs_* { @[func] = count() } i:ms:1 { exit() }'
Attaching 75 probes...
[...]
real	0m20.172s
user	0m0.644s
sys	0m0.982s

# time src/bpftrace -e 'kfunc:cpu* { @[func] = count() } i:ms:1 { exit() }'
Attaching 375 probes...
[...]
real	1m38.768s
user	0m2.197s
sys	0m2.964s

Also remember that there's a limit of 512 probes which we have (can be lifted by setting an env variable).

All in all, attaching to a huge number of kfuncs is not practical and it's not their main use-case in the first place. The only other probe types which could use such a large number of attach points are kprobes and uprobes, and here we could use kprobe_multi and uprobe_multi link types and generate just a single LLVM function.

Generating multiple symbols for the same function seems like something that should be possible to me.

I agree but we'd still rely on libbpf to do the program collection by iterating the symbol table. If that ever changes (IMHO it's very unlikely), we'd have to adapt. Also, we can always add this if we find that there are performance issues with the full expansion approach.

jordalgo · 2024-02-16T16:18:13Z

Trying to digest this a bit. It seems (as per @viktormalik 's point) that perhaps the only real concern here is around expansion of kfunc/kretfunc as we can use the "multi" variants for the kprobes/uprobes. If attaching to this many kfuncs is an anti-pattern, of sorts, I'm fine to do the un-optimized (copies of the same LLVM function in the ELF file) if that's easier and (perhaps?) more future proof then messing around with the symbol table. We can also issue warnings to the user about both the size of the ELF file and the number of attached kfuncs (perhaps encouraging the use of kprobes in that situation). All that said, I don't feel strongly.

viktormalik · 2024-02-19T06:46:09Z

Trying to digest this a bit. It seems (as per @viktormalik 's point) that perhaps the only real concern here is around expansion of kfunc/kretfunc as we can use the "multi" variants for the kprobes/uprobes.

The only problem is that the "multi" variants are rather new and therefore won't be supported on older kernels. Still, the 512 probe limit would hit on those kernels so we shouldn't get an ELF with thousands of copies of a BPF function.

If attaching to this many kfuncs is an anti-pattern, of sorts, I'm fine to do the un-optimized (copies of the same LLVM function in the ELF file) if that's easier and (perhaps?) more future proof then messing around with the symbol table.

I agree, unless the compiler has a "standard" way to do that. I haven't found any, yet.

We can also issue warnings to the user about both the size of the ELF file and the number of attached kfuncs (perhaps encouraging the use of kprobes in that situation). All that said, I don't feel strongly.

ajor · 2024-02-19T15:11:51Z

Attaching to a huge amount of fentry probes isn't currently possible due to the kernel's performance as you said. It is something that users want to do and should be able to do though, so we need to keep it in mind for whenever a kernel fix comes along. This is something that @tyroguru is interested in.

It's the probe detach which is slow rather than the attach, if that makes any difference (try with this script: fentry:vfs_* { } BEGIN { print("begin"); } END { print("end") }).

Duplicate Symbols

I can create duplicate symbols for functions with Clang, so there must be an interface for doing this in libLLVM:

void bar()  {}
asm("asdf:");
asm(".globl asdf");
void foo() {}

0000000000000000 T bar
0000000000000007 T asdf
0000000000000007 T foo

Maybe MCContext::getOrCreateSymbol? https://llvm.org/doxygen/classllvm_1_1MCContext.html#ac11eef690074972378846024abbe8722

libbpf

It looks like retsnoop is doing something special for mass attaching to fentries, but I suppose it has the requirement of being compiled ahead of time: https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L510-L528

Pinging @anakryiko for any input on using libbpf.

anakryiko · 2024-02-19T19:25:32Z

It looks like retsnoop is doing something special for mass attaching to fentries, but I suppose it has the requirement of being compiled ahead of time: https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L510-L528

Pinging @anakryiko for any input on using libbpf.

There is nothing that retsnoop or libbpf can do to speed up attachment/detachment of fentry/fexit BPF programs, unfortunately. Kernel doesn't support single shot multi-attachment for them (there were discussions but it never got implemented). The piece you linked is just preparing few different copies of programs, depending on number of arguments. This is done to let libbpf perform relocations and all other adjustments, so that retsnoop can just grab raw BPF instructions and clone them for each btf_id (see clone_prog(), https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L977). So fentry/fexit mode is supported by retsnoop, but it's slow, with its own limitations, and definitely not the preferred mode. It does have advantages in some situations (fentry pollutes LBR entries much less compared to kprobes).

It's very different for kprobe/kretprobe. Retsnoop by default will use multi-kprobes and will be able to attach to thousands of programs almost instantaneous. With just one program for entry and one for exit programs.

viktormalik · 2024-02-20T08:22:11Z

There is nothing that retsnoop or libbpf can do to speed up attachment/detachment of fentry/fexit BPF programs, unfortunately. Kernel doesn't support single shot multi-attachment for them (there were discussions but it never got implemented). The piece you linked is just preparing few different copies of programs, depending on number of arguments. This is done to let libbpf perform relocations and all other adjustments, so that retsnoop can just grab raw BPF instructions and clone them for each btf_id (see clone_prog(), https://github.com/anakryiko/retsnoop/blob/2d730d468719ed35d0f3bc2dbc958bd90f31342e/src/mass_attacher.c#L977). So fentry/fexit mode is supported by retsnoop, but it's slow, with its own limitations, and definitely not the preferred mode. It does have advantages in some situations (fentry pollutes LBR entries much less compared to kprobes).

It's very different for kprobe/kretprobe. Retsnoop by default will use multi-kprobes and will be able to attach to thousands of programs almost instantaneous. With just one program for entry and one for exit programs.

Thanks for the insights @anakryiko. The clone_prog part looks like what we're doing in bpftrace for every probe type now - call bpf_prog_load for each attachment target. I'd like to get rid of this approach since it prevents us from using struct bpf_object to manipulate BPF programs (and all the features that come with it). The idea was to do the cloning on the level of LLVM but in the case of fentry/fexit programs (or kprobes when kprobe-multi is not available), it may lead to a very large ELF objects, unless we're able to do the cloning efficiently (see below).

Duplicate Symbols

I can create duplicate symbols for functions with Clang, so there must be an interface for doing this in libLLVM:
[...]
Maybe MCContext::getOrCreateSymbol? https://llvm.org/doxygen/classllvm_1_1MCContext.html#ac11eef690074972378846024abbe8722

There's also symbol aliasing in LLVM which sounds like what we need. I'll have a look into it.

The previous probe expansion approach tried to minimize the amount of LLVM functions generated by emitting a single function for all probe matches in most cases. While this was efficient, it came with a couple of drawbacks: - It is necessary to generate a separate LLVM function for each match (e.g. when the 'probe' builtin is used). This leads to having two very similar loops for iterating matches in BPFtrace::add_probe and in CodegenLLVM::visit(Probe) which is quite confusing and hard to maintain. - libbpf needs one BPF program (i.e. one LLVM function) per probe so if we want to delegate program loading (and possibly attachment) to libbpf (which we do), we cannot use this approach. See [1] for more details. This refactors probe expansion by moving most of it into codegen. Overall, we now distinguish three types of probe expansion: Full expansion - A separate LLVM function is generated for each match. This is used for probe types with a small number of matches (e.g. "hardware"), when the 'probe' builtin is used, or for USDT probes (b/c they may access args). Alias expansion - Generates one LLVM function for all matches and one alias pointing to that function for each match. This is an efficient way of creating an ELF with multiple BPF programs sharing the same code which libbpf is able to parse. Used for expansion of most probe types. Multi expansion - Used for k(u)probes when k(u)probe_multi is available. Generates one LLVM function and one BPF program for all matches and attaches the expanded functions via bpf_link_create_opts. This allows to drop a lot of duplicated code. The expansion for "full" and "alias" is done in CodegenLLVM::visit(Probe), the expansion for "multi" is done in BPFtrace::add_probe. One particular area where this refactoring caused problems is unit tests in tests/bpftrace.cpp. Previously, it was sufficient to generate a simple ast::Probe and pass it to BPFtrace::add_probe since that was where most of the expansion was done. Now that the expansion was moved to codegen, we need to do full parser -> field analyser -> clang parser -> semantic analyser -> codegen sequence. Also, the problem comes with USDT probes as it is not possible to easily mock USDTHelper which is a fully static class. Since we need to override AttachPoint::usdt::num_locations from tests, we allow to do that via a new internal env variable BPFTRACE_TEST_USDT_NUM_LOCATIONS. [1] bpftrace#3005

viktormalik · 2024-05-06T08:29:52Z

There's also symbol aliasing in LLVM which sounds like what we need. I'll have a look into it.

I did some investigation and experiments here and found that using symbol aliases will indeed produce multiple symbols with the same address and libbpf will correctly discover them as separate BPF programs (and do a copy of the instructions for each). The problem is that libbpf relocations will not work b/c libbpf doesn't count with multiple programs sharing the same instructions in the ELF file.

This should be possible to fix on libbpf side but it's a bigger change so I'd suggest going with full expansion (i.e. one LLVM function per wildcard match) for the first version of #2334.

The previous probe expansion approach tried to minimize the amount of LLVM functions generated by emitting a single function for all probe matches in most cases. While this was efficient, it came with a couple of drawbacks: - It is necessary to generate a separate LLVM function for each match (e.g. when the 'probe' builtin is used). This leads to having two very similar loops for iterating matches in BPFtrace::add_probe and in CodegenLLVM::visit(Probe) which is quite confusing and hard to maintain. - libbpf needs one BPF program (i.e. one LLVM function) per probe so if we want to delegate program loading (and possibly attachment) to libbpf (which we do), we cannot use this approach. See [1] for more details. This refactors probe expansion by moving most of it into codegen. Overall, we now distinguish two types of probe expansion: Full expansion - A separate LLVM function is generated for each match. This is used for probe types with a small number of matches (e.g. "hardware"), when the 'probe' builtin is used, or for USDT probes (b/c they may access args). Multi expansion - Used for k(u)probes when k(u)probe_multi is available. Generates one LLVM function and one BPF program for all matches and attaches the expanded functions via bpf_link_create_opts. This allows to drop a lot of duplicated code. The expansion for "full" is done in CodegenLLVM::visit(Probe), the expansion for "multi" is done in BPFtrace::add_probe. A drawback of this approach is that we generate substantially larger ELF objects for expansions of probe types which do not support multi-probes (e.g. kfuncs and tracepoints) as we generate duplicate LLVM functions. This is something we can live with for now since multi-attachment is not the main use-case for these probe types (e.g. attaching to many kfuncs is very slow) and there's usually an alternative to use multi-kprobes. One particular area where this refactoring caused problems is unit tests in tests/bpftrace.cpp. Previously, it was sufficient to generate a simple ast::Probe and pass it to BPFtrace::add_probe since that was where most of the expansion was done. Now that the expansion was moved to codegen, we need to do full parser -> field analyser -> clang parser -> semantic analyser -> codegen sequence. Also, the problem comes with USDT probes as it is not possible to easily mock USDTHelper which is a fully static class. Since we need to override AttachPoint::usdt::num_locations from tests, we allow to do that via a new internal env variable BPFTRACE_TEST_USDT_NUM_LOCATIONS. [1] bpftrace#3005

The previous probe expansion approach tried to minimize the amount of LLVM functions generated by emitting a single function for all probe matches in most cases. While this was efficient, it came with a couple of drawbacks: - It is necessary to generate a separate LLVM function for each match (e.g. when the 'probe' builtin is used). This leads to having two very similar loops for iterating matches in BPFtrace::add_probe and in CodegenLLVM::visit(Probe) which is quite confusing and hard to maintain. - libbpf needs one BPF program (i.e. one LLVM function) per probe so if we want to delegate program loading (and possibly attachment) to libbpf (which we do), we cannot use this approach. See [1] for more details. This refactors probe expansion by moving most of it into codegen. Overall, we now distinguish two types of probe expansion: Full expansion - A separate LLVM function is generated for each match. This is used for most expansions now. Multi expansion - Used for k(u)probes when k(u)probe_multi is available. Generates one LLVM function and one BPF program for all matches and attaches the expanded functions via bpf_link_create_opts. This allows to drop a lot of duplicated code. The expansion for "full" is done in CodegenLLVM::visit(Probe), the expansion for "multi" is done in BPFtrace::add_probe. A drawback of this approach is that we generate substantially larger ELF objects for expansions of probe types which do not support multi-probes (e.g. kfuncs and tracepoints) as we generate duplicate LLVM functions. This is something we can live with for now since multi-attachment is not the main use-case for these probe types (e.g. attaching to many kfuncs is very slow) and there's usually an alternative to use multi-kprobes. One particular area where this refactoring caused problems is unit tests in tests/bpftrace.cpp. Previously, it was sufficient to generate a simple ast::Probe and pass it to BPFtrace::add_probe since that was where most of the expansion was done. Now that the expansion was moved to codegen, we need to do full parser -> field analyser -> clang parser -> semantic analyser -> codegen sequence. Also, the problem comes with USDT probes as it is not possible to easily mock USDTHelper which is a fully static class. Since we need to override AttachPoint::usdt::num_locations from tests, we allow to do that via a new internal env variable BPFTRACE_TEST_USDT_NUM_LOCATIONS. [1] bpftrace#3005

The previous probe expansion approach tried to minimize the amount of LLVM functions generated by emitting a single function for all probe matches in most cases. While this was efficient, it came with a couple of drawbacks: - It is necessary to generate a separate LLVM function for each match (e.g. when the 'probe' builtin is used). This leads to having two very similar loops for iterating matches in BPFtrace::add_probe and in CodegenLLVM::visit(Probe) which is quite confusing and hard to maintain. - libbpf needs one BPF program (i.e. one LLVM function) per probe so if we want to delegate program loading (and possibly attachment) to libbpf (which we do), we cannot use this approach. See [1] for more details. This refactors probe expansion by moving most of it into codegen. Overall, we now distinguish two types of probe expansion: Full expansion - A separate LLVM function is generated for each match. This is used for most expansions now. Multi expansion - Used for k(u)probes when k(u)probe_multi is available. Generates one LLVM function and one BPF program for all matches and attaches the expanded functions via bpf_link_create_opts. This allows to drop a lot of duplicated code. The expansion for "full" is done in CodegenLLVM::visit(Probe), the expansion for "multi" is done in BPFtrace::add_probe. A drawback of this approach is that we generate substantially larger ELF objects for expansions of probe types which do not support multi-probes (e.g. kfuncs and tracepoints) as we generate duplicate LLVM functions. This is something we can live with for now since multi-attachment is not the main use-case for these probe types (e.g. attaching to many kfuncs is very slow) and there's usually an alternative to use multi-kprobes. One particular area where this refactoring caused problems is unit tests in tests/bpftrace.cpp. Previously, it was sufficient to generate a simple ast::Probe and pass it to BPFtrace::add_probe since that was where most of the expansion was done. Now that the expansion was moved to codegen, we need to do full parser -> field analyser -> clang parser -> semantic analyser -> codegen sequence. With this change, some tests had to be dropped, especially the tests with a single wildcard for uprobe/USDT target. The reason is that semantic analyser expands these wildcards by searching all paths on the system which is something that cannot be mocked and therefore should not be run in unit tests (e.g. it prevents running the unit tests as non-root). Also, the problem comes with USDT probes as it is not possible to easily mock USDTHelper which is a fully static class. Since we need to override AttachPoint::usdt::num_locations from tests, we allow to do that via a new internal env variable BPFTRACE_TEST_USDT_NUM_LOCATIONS. [1] bpftrace#3005

The previous probe expansion approach tried to minimize the amount of LLVM functions generated by emitting a single function for all probe matches in most cases. While this was efficient, it came with a couple of drawbacks: - It is necessary to generate a separate LLVM function for each match (e.g. when the 'probe' builtin is used). This leads to having two very similar loops for iterating matches in BPFtrace::add_probe and in CodegenLLVM::visit(Probe) which is quite confusing and hard to maintain. - libbpf needs one BPF program (i.e. one LLVM function) per probe so if we want to delegate program loading (and possibly attachment) to libbpf (which we do), we cannot use this approach. See [1] for more details. This refactors probe expansion by moving most of it into codegen. Overall, we now distinguish two types of probe expansion: Full expansion - A separate LLVM function is generated for each match. This is used for most expansions now. Multi expansion - Used for k(u)probes when k(u)probe_multi is available. Generates one LLVM function and one BPF program for all matches and attaches the expanded functions via bpf_link_create_opts. This allows to drop a lot of duplicated code. The expansion for "full" is done in CodegenLLVM::visit(Probe), the expansion for "multi" is done in BPFtrace::add_probe. A drawback of this approach is that we generate substantially larger ELF objects for expansions of probe types which do not support multi-probes (e.g. kfuncs and tracepoints) as we generate duplicate LLVM functions. This is something we can live with for now since multi-attachment is not the main use-case for these probe types (e.g. attaching to many kfuncs is very slow) and there's usually an alternative to use multi-kprobes. One particular area where this refactoring caused problems is unit tests in tests/bpftrace.cpp. Previously, it was sufficient to generate a simple ast::Probe and pass it to BPFtrace::add_probe since that was where most of the expansion was done. Now that the expansion was moved to codegen, we need to do full parser -> field analyser -> clang parser -> semantic analyser -> codegen sequence. With this change, some tests had to be dropped, especially the tests with a single wildcard for uprobe/USDT target. The reason is that semantic analyser expands these wildcards by searching all paths on the system which is something that cannot be mocked and therefore should not be run in unit tests (e.g. it prevents running the unit tests as non-root). Also, the problem comes with USDT probes as USDTHelper was originally fully static which prevented its mocking. Fortunately, we only need to mock the find() method which doesn't have to be static, so this refactors USDTHelper and some of its users and introduces MockUSDTHelper which mocks the find method for unit tests. [1] bpftrace#3005

The previous probe expansion approach tried to minimize the amount of LLVM functions generated by emitting a single function for all probe matches in most cases. While this was efficient, it came with a couple of drawbacks: - It is necessary to generate a separate LLVM function for each match (e.g. when the 'probe' builtin is used). This leads to having two very similar loops for iterating matches in BPFtrace::add_probe and in CodegenLLVM::visit(Probe) which is quite confusing and hard to maintain. - libbpf needs one BPF program (i.e. one LLVM function) per probe so if we want to delegate program loading (and possibly attachment) to libbpf (which we do), we cannot use this approach. See [1] for more details. This refactors probe expansion by moving most of it into codegen. Overall, we now distinguish two types of probe expansion: Full expansion - A separate LLVM function is generated for each match. This is used for most expansions now. Multi expansion - Used for k(u)probes when k(u)probe_multi is available. Generates one LLVM function and one BPF program for all matches and attaches the expanded functions via bpf_link_create_opts. This allows to drop a lot of duplicated code. The expansion for "full" is done in CodegenLLVM::visit(Probe), the expansion for "multi" is done in BPFtrace::add_probe. A drawback of this approach is that we generate substantially larger ELF objects for expansions of probe types which do not support multi-probes (e.g. kfuncs and tracepoints) as we generate duplicate LLVM functions. This is something we can live with for now since multi-attachment is not the main use-case for these probe types (e.g. attaching to many kfuncs is very slow) and there's usually an alternative to use multi-kprobes. One particular area where this refactoring caused problems is unit tests in tests/bpftrace.cpp. Previously, it was sufficient to generate a simple ast::Probe and pass it to BPFtrace::add_probe since that was where most of the expansion was done. Now that the expansion was moved to codegen, we need to do full parser -> field analyser -> clang parser -> semantic analyser -> codegen sequence. With this change, some tests had to be dropped, especially the tests with a single wildcard for uprobe/USDT target. The reason is that semantic analyser expands these wildcards by searching all paths on the system which is something that cannot be mocked and therefore should not be run in unit tests (e.g. it prevents running the unit tests as non-root). Also, the problem comes with USDT probes as USDTHelper was originally fully static which prevented its mocking. Fortunately, we only need to mock the find() method which doesn't have to be static, so this refactors USDTHelper and some of its users and introduces MockUSDTHelper which mocks the find method for unit tests. [1] #3005

viktormalik added the RFC Request for comment label Feb 14, 2024

viktormalik mentioned this issue May 6, 2024

libbpf as a main part of program loading #2334

Closed

viktormalik mentioned this issue May 7, 2024

Move probe expansion into codegen #3155

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Probe expansion in codegen #3005

[RFC] Probe expansion in codegen #3005

viktormalik commented Feb 14, 2024

ajor commented Feb 14, 2024 •

edited

Loading

viktormalik commented Feb 16, 2024

jordalgo commented Feb 16, 2024

viktormalik commented Feb 19, 2024

ajor commented Feb 19, 2024

anakryiko commented Feb 19, 2024

viktormalik commented Feb 20, 2024

Duplicate Symbols

viktormalik commented May 6, 2024

[RFC] Probe expansion in codegen #3005

[RFC] Probe expansion in codegen #3005

Comments

viktormalik commented Feb 14, 2024

ajor commented Feb 14, 2024 • edited Loading

viktormalik commented Feb 16, 2024

jordalgo commented Feb 16, 2024

viktormalik commented Feb 19, 2024

ajor commented Feb 19, 2024

Duplicate Symbols

libbpf

anakryiko commented Feb 19, 2024

viktormalik commented Feb 20, 2024

Duplicate Symbols

viktormalik commented May 6, 2024

ajor commented Feb 14, 2024 •

edited

Loading