Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

features: add HaveProgramHelper API #375

Merged
merged 2 commits into from
May 23, 2022
Merged

Conversation

rgo3
Copy link
Contributor

@rgo3 rgo3 commented Aug 13, 2021

HaveProgHelper(pt ebpf.ProgramType, helper asm.BuiltinFunc) error
allows to probe the available BPF helpers to a given BPF program
type. Probe results are cached and run at most once.

Signed-off-by: Robin Gögge r.goegge@gmail.com

@rgo3
Copy link
Contributor Author

rgo3 commented Aug 13, 2021

Compared to bpftool I just check for EACCES and EINVAL for now instead of checking the verifier log. Not sure if that suffices and if there is an (future) edge case that could break this and result in a false negative/positive. cc @ti-mo @qmonnet

@qmonnet
Copy link
Member

qmonnet commented Aug 13, 2021

Did you compare the results with bpftool's (minus the ones for the few program types that aren't working properly on bpftool)? Do you get the same thing?

@rgo3
Copy link
Contributor Author

rgo3 commented Aug 13, 2021

Initially I just checked a few helpers for equality randomly, but now did the full comparison and I actually come up short for like 5 helper calls regarding the bpf_get_current_task_btf helper. So I guess just checking for EINVAL isn't enough :(...

@qmonnet
Copy link
Member

qmonnet commented Aug 13, 2021

So there's just the bpf_get_current_task_btf() that is not detected as available with your PR, is this correct? I suppose your kernel is >= 5.11? (Asking because it would be nice to be sure where this EINVAL comes from)

@rgo3
Copy link
Contributor Author

rgo3 commented Aug 13, 2021

Yes, just bpf_get_current_task_btf(). I think the EINVAL comes from the verifier here. I modified bpftool so it outputs the verifier log and the error message for bpf_get_current_task_btf() is: invalid return type 11 of func bpf_get_current_task_btf#158

@qmonnet
Copy link
Member

qmonnet commented Aug 13, 2021

A few lines above then, because (again) there's no BTF. invalid return type is not the same as unknown return type. The unknown return type case would mean the helper would never pass the verifier. Anyway :).

Yeah, in that case it's not sure there's a way other than parsing the log. We could start making cases for helpers too, but this will be harder to maintain I'm afraid.

@rgo3
Copy link
Contributor Author

rgo3 commented Aug 13, 2021

Oh yeah I selected the wrong line 🙈.
Timo and I talked about this yesterday, how likely are the verifier messages to change aka how stable is it to use them as an API?

@qmonnet
Copy link
Member

qmonnet commented Aug 13, 2021

Tricky question. But then the error codes from the verifier are hardly a stable API either 🤷.

@lmb
Copy link
Collaborator

lmb commented Aug 23, 2021

Drive by observation: at least for error codes developers tend to try to preserve them. The same can't be said for the verifier log, and I think we shouldn't rely on it.

@rgo3
Copy link
Contributor Author

rgo3 commented Oct 19, 2021

Hey! 👋 Apologies for letting this PR go stale a bit, but I would like to pick it up again so that I eventually can make use of it in cilium.

To sum up the current status: Just relying on error codes can result in false negatives, e.g. bpf_get_current_task_btf(). We could rely on verifier log output, but I agree with Lorenz that verifier output is more likely subject to change (But I am new, so that is just a feeling to be honest).

To get this PR merged I can think of two possible solutions:

  • If we want to rely on error codes, we would have to make sure EINVAL is only ever returned when the helper isn't available. We can do this by adding exception cases to various helper probes, similar to what we have with the other probe APIs that were merged. As there are so many helpers, the maintenance burden of these exception cases might not actually be better than relying on the verifier output, because we have to cover more code paths of the verifier now.

  • Go with verifier log, accepting the risk that the messages might change, but this API then only codes against a few specific code paths in the verifier reducing complexity of the probes.

I hope @ti-mo and @lmb can give their opinions on the two solutions I proposed or maybe come up with an even better option.

@lmb
Copy link
Collaborator

lmb commented Oct 20, 2021

If at all possible we should avoid having to maintain "quirks" for individual helpers. Some ideas:

Parse kernel BTF and look for the appropriate function signatures.

  • Needs extra code for kernels without BTF (but something like btfhub could help)
  • Can't tell you whether a helper is supported for a specific program type

Parse /proc/kallsyms and look for function / variable names.

  • Dead simple, works on old kernels
  • Breaks due to function renaming
$ grep map_lookup_elem /proc/kallsyms 
0000000000000000 t map_lookup_elem
0000000000000000 T bpf_map_lookup_elem
0000000000000000 t __htab_map_lookup_elem
0000000000000000 t htab_percpu_map_lookup_elem
0000000000000000 t htab_lru_map_lookup_elem_sys
0000000000000000 t htab_map_lookup_elem
0000000000000000 t htab_of_map_lookup_elem
0000000000000000 t htab_lru_percpu_map_lookup_elem
0000000000000000 t htab_lru_map_lookup_elem
0000000000000000 T bpf_fd_htab_map_lookup_elem
0000000000000000 t fd_array_map_lookup_elem
0000000000000000 t array_map_lookup_elem
0000000000000000 t array_of_map_lookup_elem
0000000000000000 t percpu_array_map_lookup_elem
0000000000000000 T bpf_fd_array_map_lookup_elem
0000000000000000 t queue_stack_map_lookup_elem
0000000000000000 t ringbuf_map_lookup_elem
0000000000000000 t dev_map_lookup_elem
0000000000000000 T __dev_map_lookup_elem
0000000000000000 t cpu_map_lookup_elem
0000000000000000 T __cpu_map_lookup_elem
0000000000000000 t stack_map_lookup_elem
0000000000000000 t bpf_struct_ops_map_lookup_elem
0000000000000000 t xsk_map_lookup_elem
0000000000000000 t xsk_map_lookup_elem_sys_only
0000000000000000 D bpf_map_lookup_elem_proto

Maye a combination of "stupid" BPF program probes and either one of these approaches works?

@ti-mo
Copy link
Collaborator

ti-mo commented Oct 25, 2021

@rgo3 Welcome back!

If at all possible we should avoid having to maintain "quirks" for individual helpers.

This should indeed be the general approach, but there will be exceptions. :) (as there were with the other probe types)

Parse kernel BTF and look for the appropriate function signatures.

* Needs extra code for kernels without BTF (but something like btfhub could help)

Requiring an external service/db just to get probes to work doesn't sound very appealing IMO.

Some helpers (like bpf_snprintf_btf, to name one) rely on BTF, so absence of BTF support in the kernel is already a signal that this helper won't be available. This is case-by-case, though.

Parse /proc/kallsyms and look for function / variable names.

Expensive, but generally effective, have resorted to this before. Good last-resort solution for older kernels.

Maye a combination of "stupid" BPF program probes and either one of these approaches works?

Naturally, but this is what eventually causes death by a 1000 papercuts, as the fallback detection strategy required will vary depending on the helper. This can only be decided when actually implementing these, so let's discuss on actual code.

@lmb
Copy link
Collaborator

lmb commented Oct 25, 2021

Naturally, but this is what eventually causes death by a 1000 papercuts

The answer to that is to not provide arbitrary helper feature tests, my preferred solution :P

@rgo3
Copy link
Contributor Author

rgo3 commented Oct 27, 2021

Is it an option to move forward with this PR keeping it the way it is currently, basically not trying to create a catch-em-all solution right from the start? We already cover lots of possible progtype-helper combinations the way it is, which would suffice for its first use-case in cilium. If people have future use-cases with combinations we don't probe accurately at the moment, we can defer adding fallback strategies or extra probes until then?

As a side node, I played a bit with /proc/kallsyms and the results differ from bpftools output. E.g.:

$ grep bpf_get_current_task_btf /proc/kallsyms
0000000000000000 T bpf_get_current_task_btf
0000000000000000 r bpf_get_current_task_btf_proto

bpftool shows this helper to be available to multiple prog types on my kernel while /proc/kallsyms doesn't.

@ti-mo
Copy link
Collaborator

ti-mo commented Nov 2, 2021

Is it an option to move forward with this PR keeping it the way it is currently, basically not trying to create a catch-em-all solution right from the start?

I'm okay with the initial implementation not being perfect, as long as we have a way of documenting what the errata are. Preferably in tests, with a GH issue to track the follow-up work. That way, if anyone has the need for a particular broken probe, they can grab the issue and go ahead with the implementation.

@lmb
Copy link
Collaborator

lmb commented Feb 18, 2022

What's going to happen to this?

@rgo3
Copy link
Contributor Author

rgo3 commented Feb 18, 2022

I'd still like to get some version of this in, considering this would be the most useful when trying to integrate the features API into cilium (which was the original goal of my GSoC project).

I got stuck looking for ways to create tests that nicely show what the errata are because of all the possible combinations between prog types and available helpers on respective kernel versions. I could add tests per prog type, start with the prog types I'd need to use this in cilium and add more over time? Like that I wouldn't have to test the full matrix of possibilities for now but only a smaller subset. If yes I'd try to navigate along this list of versions and available helpers per prog type to incrementally add tests?

If that's not a clean enough solution to move forward with this PR, I might need to rethink how we could add helper probes to this API.

@rgo3 rgo3 marked this pull request as ready for review May 12, 2022 10:19
@rgo3
Copy link
Contributor Author

rgo3 commented May 12, 2022

Alright, I think it's finally ready to graduate from its draft status.
There's still a test-case that I added that's failing, but that's related to the kernel config in CI:

--- FAIL: TestHaveProgHelper (0.04s)
    ...
    --- FAIL: TestHaveProgHelper/CGroupSockAddr/FnGetCgroupClassid (0.00s)
    ...

Locally this test works for me, as my kernel has this helper type compiled in.

Additionally there are two more test-cases I couldn't add because the lib doesn't support those particular builtins yet (just a matter of updating the BuiltinFunc enum I think). I've mentioned those in comments above the test-cases.

I've also built a small tool to compare the features API to bpftools feature probes. Currently just living on a branch of my fork of the repo. Running that tool yields following output:

❯ go run -exec sudo cmd/bpftoolcmp/main.go
Running bpftool -j feature ...

Comparing available program types ...

Comparing available helper functions ...
    False negative: API got different result than bpftool for: RawTracepoint/FnGetFuncIp
    False negative: API got different result than bpftool for: Syscall/FnSkStorageGet
    False negative: API got different result than bpftool for: Syscall/FnSkStorageDelete
    False negative: API got different result than bpftool for: Syscall/FnDPath
    False negative: API got different result than bpftool for: Syscall/FnGetFuncIp
    False negative: API got different result than bpftool for: TracePoint/FnGetFuncIp
    False negative: API got different result than bpftool for: PerfEvent/FnGetFuncIp
    False negative: API got different result than bpftool for: RawTracepointWritable/FnGetFuncIp

I haven't fully investigated why those false negatives exist, but I'd create an issue to track and investigate if we choose to accept that these differ from bpftool currently.
This output excludes false negatives for program types LSM , Extension, StructOps and Tracing because we can't probe those prog types correctly.

cc @florianl

@lmb lmb requested a review from ti-mo May 12, 2022 15:25
@lmb
Copy link
Collaborator

lmb commented May 12, 2022

Could bpftoolcmp just be part of the testsuite? We could skip tests if the binary doesn't exist.

@rgo3
Copy link
Contributor Author

rgo3 commented May 12, 2022

So instead of having bpftoolcmp add a test like e.g. TestBPFToolCompare? And skip on missing binary and known exceptions?

Copy link
Contributor

@florianl florianl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this forward! 👍
The PR as is works fine here.

For possible future work wrt bpftoolcmp I would also vote for having this in a test case like TestBPFToolCompare. Rather than calling an external tool that exec.Cmd() another external tool (bpftool).

@rgo3
Copy link
Contributor Author

rgo3 commented May 13, 2022

Sorry, had to repush as I found some typos...

As for the FnGetFuncIp helper probes: I think these are actually false positives on bpftools/libbpfs end. The verifier has a check that only allows Kprobe and Tracing progs to call it. This API returns an error as ENOTSUPP is returned from the verifier. libbpf misses it because checking the verifier log output doesn't match in this case. cc @qmonnet

@ti-mo ti-mo mentioned this pull request May 16, 2022
1 task
Copy link
Collaborator

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it.

Could bpftoolcmp just be part of the testsuite? We could skip tests if the binary doesn't exist.

This can be done in a follow-up. 👍

features/prog.go Outdated Show resolved Hide resolved
features/prog_test.go Outdated Show resolved Hide resolved
features/prog_test.go Outdated Show resolved Hide resolved
features/prog_test.go Outdated Show resolved Hide resolved
features/prog_test.go Outdated Show resolved Hide resolved
features/prog_test.go Outdated Show resolved Hide resolved
@lmb
Copy link
Collaborator

lmb commented May 18, 2022

Bit of a late bike shed: can you call the function HaveProgramHelper?

@rgo3
Copy link
Contributor Author

rgo3 commented May 18, 2022

I can, sort of diverges from HaveProgType then though. Do we care about that?

@lmb
Copy link
Collaborator

lmb commented May 19, 2022

We can fix that up by introducing HaveProgramType and depreciating the shortened version.

rgo3 added 2 commits May 19, 2022 11:00
`HaveProgramHelper(pt ebpf.ProgramType, helper asm.BuiltinFunc) error`
allows to probe the available BPF helpers to a given BPF program
type. Probe results are cached and run at most once.

Signed-off-by: Robin Gögge <r.goegge@gmail.com>
In order to be in line with other naming schemes throughout
the library this commit deprecates HaveProgType() in favor
of HaveProgamType().

Signed-off-by: Robin Gögge <r.goegge@gmail.com>
@rgo3 rgo3 changed the title features: add HaveProgHelper API features: add HaveProgramHelper API May 19, 2022
@ti-mo ti-mo merged commit 951bb28 into cilium:master May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants