Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Formalize container engine testing framework similar to the new kernel version testing framework #1298

Open
incertum opened this issue Aug 15, 2023 · 5 comments
Labels
kind/feature New feature or request lifecycle/stale
Milestone

Comments

@incertum
Copy link
Contributor

Motivation

The implementation of the formal kernel version testing framework (https://github.com/falcosecurity/libs/blob/master/proposals/20230530-driver-kernel-testing-framework.md) has had a highly positive impact on the overall progress and stability of The Falco Project.

I am proposing a similar testing framework for container engines, with a specific focus on maintaining the expected functionality of each engine for a particular container runtime.

This testing would be crucial not only to identify regressions but also to demonstrate the reliability of the container engine. This is because there is no expectation for it to be flawless. Simultaneously, we must comprehend the scenarios and conditions in which we might fail to retrieve container information. This understanding will help establish a form of Service Level Objective (SLO) for adopters. For instance, in edge case race conditions, we might provide less stringent guarantees compared to a situation where a container runs for 30 days without ever having its information available. The latter case serves as an example of an opportunity to enhance the engine's robustness. Returning to the notion that perfection is unattainable, embracing a data-driven approach will assist in setting escalation thresholds for reported container engine issues.

Feature

Set up a testbed to evaluate the following:

  • Test accurate and reliable container information enrichment for two scenarios: (1) container was active before agent launch, and (2) container launches after agent start.
  • Above shall include verifying each supported field's accuracy, similar to existing test/drivers unit tests.
  • Assess each officially supported container engine, prioritizing certain container runtimes as P1 (e.g., containerd, cri-o, docker), while others are labeled "best effort".
  • Perform semi-realistic tests on a Kubernetes server featuring multiple pods. These tests aim to observe continuous enrichment of container information over an extended period (e.g., several hours), encompassing stable pods as well as pods coming up and down. Apply upper limits as per https://kubernetes.io/docs/setup/best-practices/cluster-large/. However, reaching 110 pods per node with multiple containers within a pod is unlikely. A more realistic expectation would be a maximum of around 100-150 containers per node.

Note: Parallel testing may be applicable to certain runtimes, while for others, individual assessments are required.

CC @falcosecurity/core-maintainers

@incertum
Copy link
Contributor Author

incertum commented Dec 1, 2023

@jasondellaluce and @Andreagit97 and others it may be time for better container engine testing, we keep breaking it see latest oversight (that was on me) #1535

@leogr
Copy link
Member

leogr commented Dec 5, 2023

Hey @incertum

This is really interesting. Have we already collected a list of regressions we have encountered? 🤔 It would be useful to understand which aspects to focus on more.

@poiana
Copy link
Contributor

poiana commented Mar 4, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@incertum
Copy link
Contributor Author

incertum commented Mar 5, 2024

/remove-lifecycle stale

Some new e2e test efforts are a WIP @therealbobo

@poiana
Copy link
Contributor

poiana commented Jun 3, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request lifecycle/stale
Projects
None yet
Development

No branches or pull requests

4 participants