-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new(proposals): driver kernel testing framework #1131
new(proposals): driver kernel testing framework #1131
Conversation
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
/milestone 0.12.0 |
|
||
*Architectures* | ||
|
||
Place higher priority on testing for `x86_64` compared to `aarch64`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm aarch64
is every day more common so not sure we really want to have a preferred architecture...moreover we are also trying to support s390x architecture, the only issue is that we don't have runners :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, can tweak.
Perhaps another good discussion point: s390x
is supported by libs but not officially by Falco. Maybe I can say that because of this initially we prioritize x86 and aarch64 tests, subject to change in future iterations of the test framework ...
@hbrueckner would you have ideas on how we can test s390x
specifically for libs better? And ideally cut down the durations of s390x libs builds as well, they take a fairly long time atm. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation is:
- Falco officially supported architectures should be prioritized by their adoption level (ie.
x86_64
is P0,aarch64
is P1) - just best effort (low priority) for any other arch supported by libs (ie.
s390x
is Nice-To-Have).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, this way we clearly articulate priorities.
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
e06e6b3
to
78992e3
Compare
Co-authored-by: Leonardo Grasso <me@leonardograsso.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
|
||
## Outlook | ||
|
||
The following possibilities serve as an outlook for future enhancements. These potential improvements are anticipated after the release of Falco 0.36. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a possible future improvement, we could think about having the kernels grid dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, I'll work this in with the next cleanup commit as we are collecting feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxgio92 I have worked your suggestions in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
|
||
Select the most appropriate compiler version and build container for the CI-integrated tests. Apart from the compiler version, the GLIBC version in the build container can also have an impact on the ability to compile the driver for a given kernel. | ||
|
||
> The expanded CI tests may necessitate the use of approximately 30 low-resource virtual machines (VMs) that run continuously 24/7. These VMs would be distributed across multiple third-party cloud providers. To adequately cover the condensed kernel test grid, it is estimated that up to 70 test runs would be required for each testing cycle. These tests can be launched using GitHub workflows leveraging SSH remote commands. The test results are then retrieved through this method as well. Initially, it would be logical to support these tests on demand only to avoid simultaneous runs that may try to access the same VM at the same time. In addition to the test VMs, it may be necessary to expand the CI workflows in terms of builder containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider also to spin up VMs with KVM on hosted GitHub runners, or with something like actuated (on microVMs).
Love this proposal, @incertum, thank you for this great work! |
Love this! Thank you so much for pulling this together! |
Co-authored-by: Massimiliano Giovagnoli <me@maxgio.it> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Thank you @TheFoxAtWork! |
@falcosecurity/libs-maintainers I have moved this PR out of WIP, it is now ready for final review. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments; LGTM, thank you @incertum ! Great work!
A minor question: are we talking about adding nightly tests? I think we do not want to run these tests on every PR/master change.
Perhaps weekly tests can be enough too.
Thank you again btw!
|
||
First, let's clarify a few definitions and provide further context. | ||
|
||
- `kernel versions`: In the context of the testing framework, kernel versions refer to changes in the major and minor version of the kernel (e.g., 5.15 or 6.4). These version changes are specifically relevant for testing the Falco drivers, with a particular emphasis on testing with Long-Term Support (LTS) releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd use kernel releases
here; that's what we use elsewhere, and what test-infra dbg, driverkit and kernel-crawler use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to give it a spin, WDYT?
<details> | ||
<summary>Build kernel drivers</summary> | ||
<ul> | ||
<li>Latest Archlinux kernel to spot possible incompatibilities with the latest kernel tree changes</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also nightly-test the build against latest mainline kernel now: https://github.com/falcosecurity/libs/blob/master/.github/workflows/latest-kernel.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) adjusted and slick btw!
<li>linux-3.10</li> | ||
<li>linux-4.18</li> | ||
<li>linux-5.19</li> | ||
<li>linux-6.2</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these are on x86_64.
I am looking into adding same tests for arm64 too.
Co-authored-by: Federico Di Pierro <nierro92@gmail.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
where I mentioned it to be "on demand" initially I added "(such as nightly tests)" |
Thanks for the clarification about the |
What needs to happen for us to move forward with this? |
@TheFoxAtWork as a first step as suggested we created a new dedicated CNCF Service Desk ticket https://cncfservicedesk.atlassian.net/servicedesk/customer/portal/1/CNCFSD-1822. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@incertum Thank you so much for this amazing proposal. 🙏
LGTM, just left a minor suggestion.
I'll give it a second look early next week then I will give my final approval.
Co-authored-by: Leonardo Grasso <me@leonardograsso.com> Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
LGTM label has been added. Git tree hash: d98ddee4f8accbc223b8c4c0b7eaef6d92b873e9
|
|
||
### Test Category 2 | ||
|
||
Verifying that the kernel driver can load, run, and capture events without errors. This is determined through [scap-open](https://github.com/falcosecurity/libs/tree/master/userspace/libscap/examples/01-open) and unit tests conducted in virtual machine (VM) environments. In essence, when we mention that the "driver loads and runs", it implies that the scap-open counter for captured events during a test run is positive and that the [drivers_test](https://github.com/falcosecurity/libs/tree/master/test/drivers) unit tests pass. The latter tests not only load the driver live but also simulate syscall events and verify that the expected information is extracted from the kernel tracepoint and retrieved by the libscap driver type-specific engine in userspace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run is positive and that the drivers_test unit tests
Just one minor note about this. We know that on older kernel versions ~<5.0
driver_tests could fail due to conflicting events captured. Let me explain better, is possible that while we search for a close
event spawned by our test we face another clone
event generated by the system causing a test failure (as far as i saw this is very frequent when we use tracepoints
instead of raw_traceoints
in the old probe, so in kernels between 4.14
and 4.17
) ... so probably we won't be able to rely on the drivers_test on all machines as a source of truth. We will try to fix driver_tests to avoid these failures but in the meanwhile, I would consider scap-open as the only source of truth and use the drivers_tests as a best-effort check. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Andreagit97 I agree, driver_tests
should be best effort at first and we can stabilize it subsequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I left just one comment on the driver_tests
part. Please unhold if you are fine with this :)
Thank you for the amazing job!
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Andreagit97, incertum, leogr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
What type of PR is this?
/kind design
Any specific area of the project related to this PR?
/area proposals
Does this PR require a change in the driver versions?
What this PR does / why we need it:
Formalize a driver kernel testing framework.
@falcosecurity/libs-maintainers
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: