[FEATURE/BUG] Verifier issues with the current bpf probe #940

Andreagit97 · 2023-03-06T14:34:12Z

Motivation

Here we go again! Recently we faced different verifier issues in the current bpf probe :(
These are some examples:

I can see 2 main causes here:

We continuously add new features to the current probe without taking care of the increasing verification path complexity.
New compiler versions (like clang15) seem to optimize out our funny workarounds causing the probe rejection on some kernels.

IMHO we cannot withstand too many other changes before reaching the complexity limit... this is the reason why we crafted the modern bpf probe from scratch! BTW since almost all users are still strictly related to the current probe I think we need to find a temporary solution to this issue. One idea that could allow us to decrease the complexity a little bit could be to split the bpf_val_to_ring helper into dedicated helper functions. I've already started this work some time ago(#906) to address this kind of issue, but I think it could be time to extend this approach to all the codebase!

This refactor shouldn't be too huge and it should give us further time before the final 💥

WDYT? @FedeDP @incertum @Molter73 @hbrueckner @leogr

Feature

Split the bpf_val_to_ring helper into dedicated helper functions like in the modern BPF probe.

The text was updated successfully, but these errors were encountered:

Molter73 · 2023-03-06T14:53:00Z

Honestly, looking at #906 it doesn't look like a solution, it looks like code duplication hell.

The proposed solution of going into a code re-write spree in the eBPF probe seems like something that will a take A LOT of resources and time, ultimately making the code even more complex, harder to maintain AND potentially adding yet more places where something could go wrong. If the problem is that changes to clang and the driver cause issues with the verifier, then IMO we need to test the probe on a larger spectrum of kernels and compile them with as many compiler versions as possible, fixing issues as they are caught.

But that's just my opinion, obviously.

FedeDP · 2023-03-06T15:07:08Z

One idea that could allow us to decrease the complexity a little bit could be to split the bpf_val_to_ring helper into dedicated helper functions

I fully agree with the proposed solution: bpf_val_to_ring is a super complex piece of code inlined inside each of our fillers, often multiple times.
This would help us reduce code complexity (in the end i assume an API similar to the modern-bpf one will be reached) and having much less need of splitting code up in multiple tail-called fillers because a single bpf program gets too long.
Performances should also improve a bit.

The proposed solution of going into a code re-write spree in the eBPF probe seems like something that will a take A LOT of resources and time,

@Molter73 it will, but we are in a situation where each small change could lead to breaks. Moreover, we would have similar APIs between the old and the modern probe, leading to a cleaner code.

If the problem is that changes to clang and the driver cause issues with the verifier, then IMO we need to test the probe on a larger spectrum of kernels and compile them with as many compiler versions as possible, fixing issues as they are caught.

This is super hard! First of all, each clang/linux kernel patch release could kill us; moreover, maintaining such a test matrix is a huge PITA; this is a 3-dimensional matrix: { clang version, kernel version, architecture }.
If we were able to put this in place, it would be a huge win; i think this is actually much more work (and can be done in parallel of the bpf probe refactor).

Of course, #906 seems to contain code duplication; it is just because it is a first pass. Resulting API, as said earlier, should be much similar to the modern bpf one.

Andreagit97 · 2023-03-06T15:13:59Z

If the problem is that changes to clang and the driver cause issues with the verifier, then IMO we need to test the probe on a larger spectrum of kernels and compile them with as many compiler versions as possible, fixing issues as they are caught.

The problem is that the last time we fixed it with @FedeDP we loose one day finding a "reliable patch" (#858) :/ So the issue is not the clang version but that we are hitting the verifier complexity limit on many kernels... We put on a bunch of plasters but the dam is starting to leak :(

If we are able to enrich our testing matrix I would be super happy, but I think this is an orthogonal topic :/

BTW this is just a proposal if we are not on board, we can find something else maybe

incertum · 2023-03-06T23:23:12Z

Would love to see a balance. I get it that we have to find a better solution here, at the same time reciting @Molter73 it will drain resources from other tasks that in the grander scheme of things need to be prioritized more as well.

Currently some community feature requests are not even in any queue, because of ongoing refactors and limited engineering resources. Therefore, could we find a way to do it more slowly while not sacrificing prioritization of features that help making Falco more useful for end users in terms of "catching bad stuff"? Again, just trying to see if we can find a balance. For instance our refactor budget for Falco 0.35 has been already exhausted from my perspective, it would make sense to check on outstanding features we should add instead and then for the next libs and Falco release 0.36 yes we can mix in again refactors?

Same disclaimer as Mauro, it's obviously also just my opinion and observation.

Andreagit97 · 2023-03-07T08:47:30Z

Again, just trying to see if we can find a balance. For instance our refactor budget for Falco 0.35 has been already exhausted from my perspective, it would make sense to check on outstanding features we should add instead and then for the next libs and Falco release 0.36 yes we can mix in again refactors?

That's a good question to which I have no answer... I've seen these verifier issues in the past weeks :/ maybe if we don't touch the probe anymore we will have something stable for the release, but I'm not sure about that, IMHO we still have the socketcall management(#811) to merge for the next release. The only concern that I have is that if we find something bad near the release it would much more difficult to do this refactor because we don't have time to test it :/

FedeDP · 2023-03-07T08:50:14Z

Note moreover, that i think this is an effort that can be done in background, like 1 PR per week until we are over, perhaps porting away a single type from bpf_val_to_ring each iteration.
I will be glad to join and help you @Andreagit97 !

incertum · 2023-03-07T20:49:07Z

@Andreagit97 and @FedeDP would be on board with the slow and steady 1 PR / week, sequenced by highest impact (if possible) plus not sacrificing on features we still have in the queue for next release and other minor improvements that aim to make the tool more user friendly and address current top pain points amongst adopters.

Andreagit97 · 2023-03-08T09:26:49Z

let's say if we don't find further breaks we can postpone this fix. For this release I expect other 2 changes in the probe:

new(bpf, modern_bpf): introduce socketcall handling #811
[FEATURE] Catch data also when syscalls like wrte fail #942
new fillers to fix inconsistencies [Feature] implement new event pairs for a restricted set of syscalls #911
(am I missing something?)

They shouldn't be disruptive, let's see if the verifier doesn't complain we can postpone the fix to the next release, but I'm still convinced that at some point in time, we will need it :(

incertum · 2023-03-15T06:55:01Z

Hey @Andreagit97 got back to working on the localhost VM driver sanity tests, please checkout #982 - kernel compatibility indeed used to be better for some time.

While running tests today noticed the following issue with clang-15 on a 5.4 kernel and a bunch of the ubuntu kernels such as 4.19, 4.16 etc

math between map_value pointer and register with unbounded min value is not allowed
libscap: bpf_load_program() event=raw_tracepoint/filler/sys_unshare_e: Operation not permitted (1)

Can check over next days, but wanted to generally share that from my perspective verifier always complained when we forgot to initialize variables to 0 and or confusion with > or < aka it was always "math related" -> perhaps we should perform a simple audit in this respect?

Andreagit97 · 2023-03-15T08:31:18Z

Hey @incertum that's exactly why I opened this issue. The pain point is that these verifier logs could be misleading... Last time I got the same error math between map_value pointer and register with unbounded min value is not allowed in sys_empty and as you can notice there is not too much code :/

FILLER(sys_empty, true)
{
	return PPM_SUCCESS;
}

also this time in sys_unshare_e

FILLER(sys_unshare_e, true)
{
	unsigned long val;
	u32 flags;
	int res;

	val = bpf_syscall_get_argument(data, 0);
	flags = clone_flags_to_scap(val);
	res = bpf_val_to_ring(data, flags);
	return res;
}

If there is an issue, it is in the bpf_val_to_ring, that's why we want to split it! because even if we fix the val_to_ring for this filler we will probably break another filler (maybe on another kernel with another clang) since this function is shared between all our bpf progs :/

Andreagit97 · 2023-03-17T12:01:38Z

Seen the worrying graphs posted by @incertum here I would start to address it, step by step. WDYT @falcosecurity/libs-maintainers ?

FedeDP · 2023-03-17T12:03:10Z

I agree! I have a local branch with some changes, to split up bpf_ring_val integers related functions and use them in various fillers.
Good thing is, having tests help us in spot any issue we might introduce.
I can push the branch upstream so that we can work together?

Andreagit97 · 2023-03-17T12:07:43Z

I agree! I have a local branch with some changes, to split up bpf_ring_val integers related functions and use them in various fillers.
Good thing is, having tests help us in spot any issue we might introduce.
I can push the branch upstream so that we can work together?

Supercool, yep let's first see what other maintainers think about it :)

incertum · 2023-03-20T07:05:46Z

Re the graphs please also keep in mind that often the probe didn't compile, which sometimes is purely a container and GLIBC issue ... will try exporting 2 graphs in the future, one for "did it compile" vs "did it run".

But regardless I think @Andreagit97 is right with the hunch that we are slightly regressing especially since clang-15 ...

Andreagit97 · 2023-03-20T10:23:54Z

To track it I will assign /milestone next-driver, by the way, we still have to decide how to proceed so we are free to change the milestone in a second step

gnosek · 2023-03-20T12:28:29Z

Let me approach this from a slightly different angle: do we need all the complexity in bpf_val_to_ring? IIRC it's (mostly) a big switch over the different parameter types. What if (hits bong yet again) we replaced all the calls with more explicit calls to type-specific functions (i.e. bpf_u64_to_ring, bpf_user_str_to_ring etc.)?

We then need all the fillers to keep the type safety (schema param type == actual written param type) instead of putting it all in bpf_val_to_ring. I think it's actually a good thing, because right now there's no validation. Instead we coerce the written type to the requested type, whether it makes sense or not and the only check we do is the number of parameters stored vs schema. With type-specific helpers we'd have two approaches:

(in debug builds?) when calling e.g. bpf_u64_to_ring, assert that the next param really is an u64
(in release builds?) yolo and skip the validation entirely; we're not worse off in any case, now the (wrong) type cast will happen in userspace instead.

With (2) we still know the event is well-formed (even if e.g. what we expected to be an u64 is 1000 bytes long), but now we can detect (based on the param length) that the driver sent us the wrong type . This could give us a tiny perf boost and reduced in-kernel complexity (there's no explicit event schema in the kernel any more) with minimal downsides.

note 1: none of this invalidates the above discussion, keep on keeping on!
note 2: this applies equally to kmod (and udig) and val_to_ring, not sure about modern_bpf

EDIT: I see #906 is a step in that direction, yay 🎉

FedeDP · 2023-03-20T12:30:53Z

Yep! This is precisely the road we are pursuing :)

Andreagit97 · 2023-03-20T12:39:04Z

yep, that's the reason why we introduced the driver framework test. The modern probe doesn't have any type of checking or assertions at runtime, we entirely rely on the test case for each syscall! And it seems pretty reasonable to detect this in tests instead of at runtime 😂 One thing I never understood about our drivers is why we make number/type assertions at runtime, in production losing precious clock cycles 😆

incertum · 2023-03-23T21:20:16Z

@Andreagit97 I am open-minded to explore what striking a balance between checks at runtime vs only for tests could look like ... in general still unclear what type of optimizations move the needle, so far most noticeable thing is dropping uninteresting syscalls (obviously), it's likely going to be challenging to really measure subtle differences.

FedeDP · 2023-04-17T11:29:28Z

#1048 was a first good step; as highlighted by @incertum , we are now in a much better situation.
We are still investigating some issues on linux 4.16; therefore i'd leave this one open!

incertum · 2023-04-17T14:39:09Z

Yes let's see what else can bee 🐝 done 😉

FedeDP · 2023-04-27T09:18:44Z

/milestone next-driver

linajiang · 2023-11-18T06:20:38Z

how to diagnoise the problem, i alse got the verifier issues

FedeDP · 2023-11-20T08:19:21Z

@linajiang can you share the verifier log?

poiana · 2024-02-18T09:49:03Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 · 2024-02-26T10:13:45Z

since it seems we reached a good stability I think we can close it for now!

Andreagit97 added the kind/feature New feature or request label Mar 6, 2023

Andreagit97 mentioned this issue Mar 14, 2023

[WIP] Collect correct arguments when setproctitle is called #976

Open

Andreagit97 mentioned this issue Mar 15, 2023

[Tracking] localhost test/vm test suite - occasionally capture results tables #982

Closed

Andreagit97 self-assigned this Mar 20, 2023

Andreagit97 added this to the next-driver milestone Mar 20, 2023

Andreagit97 assigned FedeDP Mar 20, 2023

FedeDP mentioned this issue Apr 5, 2023

[Discussion] Falco libs 32bit app support (ia32 emulation)? falcosecurity/falco#2472

Closed

FedeDP mentioned this issue Apr 13, 2023

cleanup(driver/bpf): split bpf_val_to_ring into many int-pushing helpers #1048

Merged

poiana modified the milestones: 5.0.0+driver, next-driver Apr 27, 2023

Andreagit97 modified the milestones: next-driver, libs-backlog, driver-backlog Jun 7, 2023

Andreagit97 modified the milestones: driver-backlog, TBD Sep 4, 2023

poiana added the lifecycle/stale label Feb 18, 2024

Andreagit97 closed this as completed Feb 26, 2024

Andreagit97 modified the milestones: TBD, 0.15.0 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE/BUG] Verifier issues with the current bpf probe #940

[FEATURE/BUG] Verifier issues with the current bpf probe #940

Andreagit97 commented Mar 6, 2023 •

edited

Loading

Molter73 commented Mar 6, 2023

FedeDP commented Mar 6, 2023

Andreagit97 commented Mar 6, 2023 •

edited

Loading

incertum commented Mar 6, 2023

Andreagit97 commented Mar 7, 2023

FedeDP commented Mar 7, 2023

incertum commented Mar 7, 2023

Andreagit97 commented Mar 8, 2023 •

edited

Loading

incertum commented Mar 15, 2023 •

edited

Loading

Andreagit97 commented Mar 15, 2023

Andreagit97 commented Mar 17, 2023

FedeDP commented Mar 17, 2023

Andreagit97 commented Mar 17, 2023 •

edited

Loading

incertum commented Mar 20, 2023

Andreagit97 commented Mar 20, 2023

gnosek commented Mar 20, 2023 •

edited

Loading

FedeDP commented Mar 20, 2023

Andreagit97 commented Mar 20, 2023

incertum commented Mar 23, 2023

FedeDP commented Apr 17, 2023

incertum commented Apr 17, 2023

FedeDP commented Apr 27, 2023

linajiang commented Nov 18, 2023

FedeDP commented Nov 20, 2023

poiana commented Feb 18, 2024

Andreagit97 commented Feb 26, 2024

[FEATURE/BUG] Verifier issues with the current bpf probe #940

[FEATURE/BUG] Verifier issues with the current bpf probe #940

Comments

Andreagit97 commented Mar 6, 2023 • edited Loading

Molter73 commented Mar 6, 2023

FedeDP commented Mar 6, 2023

Andreagit97 commented Mar 6, 2023 • edited Loading

incertum commented Mar 6, 2023

Andreagit97 commented Mar 7, 2023

FedeDP commented Mar 7, 2023

incertum commented Mar 7, 2023

Andreagit97 commented Mar 8, 2023 • edited Loading

incertum commented Mar 15, 2023 • edited Loading

Andreagit97 commented Mar 15, 2023

Andreagit97 commented Mar 17, 2023

FedeDP commented Mar 17, 2023

Andreagit97 commented Mar 17, 2023 • edited Loading

incertum commented Mar 20, 2023

Andreagit97 commented Mar 20, 2023

gnosek commented Mar 20, 2023 • edited Loading

FedeDP commented Mar 20, 2023

Andreagit97 commented Mar 20, 2023

incertum commented Mar 23, 2023

FedeDP commented Apr 17, 2023

incertum commented Apr 17, 2023

FedeDP commented Apr 27, 2023

linajiang commented Nov 18, 2023

FedeDP commented Nov 20, 2023

poiana commented Feb 18, 2024

Andreagit97 commented Feb 26, 2024

Andreagit97 commented Mar 6, 2023 •

edited

Loading

Andreagit97 commented Mar 6, 2023 •

edited

Loading

Andreagit97 commented Mar 8, 2023 •

edited

Loading

incertum commented Mar 15, 2023 •

edited

Loading

Andreagit97 commented Mar 17, 2023 •

edited

Loading

gnosek commented Mar 20, 2023 •

edited

Loading