-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE/BUG] Verifier issues with the current bpf probe #940
Comments
Honestly, looking at #906 it doesn't look like a solution, it looks like code duplication hell. The proposed solution of going into a code re-write spree in the eBPF probe seems like something that will a take A LOT of resources and time, ultimately making the code even more complex, harder to maintain AND potentially adding yet more places where something could go wrong. If the problem is that changes to clang and the driver cause issues with the verifier, then IMO we need to test the probe on a larger spectrum of kernels and compile them with as many compiler versions as possible, fixing issues as they are caught. But that's just my opinion, obviously. |
I fully agree with the proposed solution:
@Molter73 it will, but we are in a situation where each small change could lead to breaks. Moreover, we would have similar APIs between the old and the modern probe, leading to a cleaner code.
This is super hard! First of all, each clang/linux kernel patch release could kill us; moreover, maintaining such a test matrix is a huge PITA; this is a 3-dimensional matrix: { clang version, kernel version, architecture }. Of course, #906 seems to contain code duplication; it is just because it is a first pass. Resulting API, as said earlier, should be much similar to the modern bpf one. |
The problem is that the last time we fixed it with @FedeDP we loose one day finding a "reliable patch" (#858) :/ So the issue is not the clang version but that we are hitting the verifier complexity limit on many kernels... We put on a bunch of plasters but the dam is starting to leak :( If we are able to enrich our testing matrix I would be super happy, but I think this is an orthogonal topic :/ BTW this is just a proposal if we are not on board, we can find something else maybe |
Would love to see a balance. I get it that we have to find a better solution here, at the same time reciting @Molter73 it will drain resources from other tasks that in the grander scheme of things need to be prioritized more as well. Currently some community feature requests are not even in any queue, because of ongoing refactors and limited engineering resources. Therefore, could we find a way to do it more slowly while not sacrificing prioritization of features that help making Falco more useful for end users in terms of "catching bad stuff"? Again, just trying to see if we can find a balance. For instance our refactor budget for Falco 0.35 has been already exhausted from my perspective, it would make sense to check on outstanding features we should add instead and then for the next libs and Falco release 0.36 yes we can mix in again refactors? Same disclaimer as Mauro, it's obviously also just my opinion and observation. |
That's a good question to which I have no answer... I've seen these verifier issues in the past weeks :/ maybe if we don't touch the probe anymore we will have something stable for the release, but I'm not sure about that, IMHO we still have the socketcall management(#811) to merge for the next release. The only concern that I have is that if we find something bad near the release it would much more difficult to do this refactor because we don't have time to test it :/ |
Note moreover, that i think this is an effort that can be done in background, like 1 PR per week until we are over, perhaps porting away a single type from |
@Andreagit97 and @FedeDP would be on board with the slow and steady 1 PR / week, sequenced by highest impact (if possible) plus not sacrificing on features we still have in the queue for next release and other minor improvements that aim to make the tool more user friendly and address current top pain points amongst adopters. |
let's say if we don't find further breaks we can postpone this fix. For this release I expect other 2 changes in the probe:
They shouldn't be disruptive, let's see if the verifier doesn't complain we can postpone the fix to the next release, but I'm still convinced that at some point in time, we will need it :( |
Hey @Andreagit97 got back to working on the While running tests today noticed the following issue with clang-15 on a 5.4 kernel and a bunch of the ubuntu kernels such as 4.19, 4.16 etc
Can check over next days, but wanted to generally share that from my perspective verifier always complained when we forgot to initialize variables to 0 and or confusion with |
Hey @incertum that's exactly why I opened this issue. The pain point is that these verifier logs could be misleading... Last time I got the same error FILLER(sys_empty, true)
{
return PPM_SUCCESS;
} also this time in FILLER(sys_unshare_e, true)
{
unsigned long val;
u32 flags;
int res;
val = bpf_syscall_get_argument(data, 0);
flags = clone_flags_to_scap(val);
res = bpf_val_to_ring(data, flags);
return res;
} If there is an issue, it is in the |
I agree! I have a local branch with some changes, to split up |
Supercool, yep let's first see what other maintainers think about it :) |
Re the graphs please also keep in mind that often the probe didn't compile, which sometimes is purely a container and GLIBC issue ... will try exporting 2 graphs in the future, one for "did it compile" vs "did it run". But regardless I think @Andreagit97 is right with the hunch that we are slightly regressing especially since clang-15 ... |
To track it I will assign |
Let me approach this from a slightly different angle: do we need all the complexity in bpf_val_to_ring? IIRC it's (mostly) a big We then need all the fillers to keep the type safety (schema param type == actual written param type) instead of putting it all in bpf_val_to_ring. I think it's actually a good thing, because right now there's no validation. Instead we coerce the written type to the requested type, whether it makes sense or not and the only check we do is the number of parameters stored vs schema. With type-specific helpers we'd have two approaches:
With (2) we still know the event is well-formed (even if e.g. what we expected to be an u64 is 1000 bytes long), but now we can detect (based on the param length) that the driver sent us the wrong type . This could give us a tiny perf boost and reduced in-kernel complexity (there's no explicit event schema in the kernel any more) with minimal downsides. note 1: none of this invalidates the above discussion, keep on keeping on! EDIT: I see #906 is a step in that direction, yay 🎉 |
Yep! This is precisely the road we are pursuing :) |
yep, that's the reason why we introduced the driver framework test. The modern probe doesn't have any type of checking or assertions at runtime, we entirely rely on the test case for each syscall! And it seems pretty reasonable to detect this in tests instead of at runtime 😂 One thing I never understood about our drivers is why we make number/type assertions at runtime, in production losing precious clock cycles 😆 |
@Andreagit97 I am open-minded to explore what striking a balance between checks at runtime vs only for tests could look like ... in general still unclear what type of optimizations move the needle, so far most noticeable thing is dropping uninteresting syscalls (obviously), it's likely going to be challenging to really measure subtle differences. |
Yes let's see what else can bee 🐝 done 😉 |
/milestone next-driver |
how to diagnoise the problem, i alse got the verifier issues |
@linajiang can you share the verifier log? |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
since it seems we reached a good stability I think we can close it for now! |
Motivation
Here we go again! Recently we faced different verifier issues in the current bpf probe :(
These are some examples:
I can see 2 main causes here:
IMHO we cannot withstand too many other changes before reaching the complexity limit... this is the reason why we crafted the modern bpf probe from scratch! BTW since almost all users are still strictly related to the current probe I think we need to find a temporary solution to this issue. One idea that could allow us to decrease the complexity a little bit could be to split the
bpf_val_to_ring
helper into dedicated helper functions. I've already started this work some time ago(#906) to address this kind of issue, but I think it could be time to extend this approach to all the codebase!This refactor shouldn't be too huge and it should give us further time before the final 💥
WDYT? @FedeDP @incertum @Molter73 @hbrueckner @leogr
Feature
Split the
bpf_val_to_ring
helper into dedicated helper functions like in the modern BPF probe.The text was updated successfully, but these errors were encountered: