From cc1e3f990f8a66fb819e292e8a412564e01d8c37 Mon Sep 17 00:00:00 2001 From: Andrea Terzolo Date: Wed, 5 Nov 2025 10:29:24 +0100 Subject: [PATCH 1/2] first proposal draft Signed-off-by: Andrea Terzolo --- tetragon/CFP-4191-per-workload-policy.md | 143 +++++++++++++++++++++++ 1 file changed, 143 insertions(+) create mode 100644 tetragon/CFP-4191-per-workload-policy.md diff --git a/tetragon/CFP-4191-per-workload-policy.md b/tetragon/CFP-4191-per-workload-policy.md new file mode 100644 index 0000000..ac91b45 --- /dev/null +++ b/tetragon/CFP-4191-per-workload-policy.md @@ -0,0 +1,143 @@ +# CFP-4191: Tetragon per-workload policies + +**SIG:** SIG-TETRAGON + +**Begin Design Discussion:** 2025-11-05 + +**Tetragon Release:** X.XX + +**Authors:** Andrea Terzolo , Kornilios Kourtis + +**Status:** Draft + +## Summary + +Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values. This design proposes a new model to decouple the enforcement logic from per‑workload values, enabling significant reductions in eBPF programs and map memory usage. + +## Motivation + +The current approach of one `TracingPolicy` per workload leads to: + +- Program scaling limits: At least one eBPF program per policy. In clusters with hundreds of workloads, this may add latency when the hook is the hot path. +- Map memory scaling: Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy. +- Operational friction: Logic is identical but values differ (for example, allowed binaries), yet the model duplicates programs and maps. + +## Goals + +- Decouple shared enforcement logic from per‑workload values to avoid linear growth in programs and map memory as clusters scale. +- Retain TracingPolicy expressiveness (selectors, filters, actions). + +## Non-Goals + +- TBD + +## Proposal + +### Overview + +This design is accompanied by a [proof of concept](https://github.com/cilium/tetragon/issues/4191). It introduces two concepts: templates and bindings. + +A template is a `TracingPolicy` that declares variables populated at runtime rather than hardcoded at load time. Selectors reference these variables by name. + +```yaml +apiVersion: cilium.io/v1alpha1 +kind: TracingPolicy +metadata: + name: "block-process-template" +spec: + variables: + - name: "targetExecPaths" + type: "linux_binprm" # this could be used for extra validation but it's probably not strictly necessary + kprobes: + - call: "security_bprm_creds_for_exec" + syscall: false + args: + - index: 0 + type: "linux_binprm" + selectors: + - matchArgs: + - index: 0 + operator: "Equal" + valuesFromVariable: "targetExecPaths" +``` + +Deploying a template alone has no runtime effect because it lacks concrete values for comparisons. + +A binding is a new resource (for example, `TracingPolicyBinding`) that supplies concrete values for a template’s variables and targets specific workloads through a selector. The selector is not the same as a podSelector; it is intended to be mutually exclusive across bindings. + +```yaml +apiVersion: cilium.io/v1alpha1 +kind: TracingPolicyBinding +metadata: + name: "block-process-template-values-1" +spec: + policyTemplateRef: + name: "block-process-template" + exclusiveSelector: # it should be mutually exclusive across bindings + matchLabels: + app: "my-app-1" + bindings: + - name: "targetExecPaths" + values: + - "/usr/bin/true" + - "/usr/bin/ls" +``` + +Policy logic becomes active only when a `TracingPolicyBinding` is present. The binding populates eBPF maps with the specified values for cgroups that match its selector. + +### Details + +- When a template is deployed, Tetragon loads the same eBPF programs and maps as today. Additionally, it creates a `BPF_MAP_TYPE_HASH` map (`cg_to_policy_map`) to map cgroup IDs to binding `policy_id`s. Initially, the map is empty, so the template has no effect. +- When a binding is deployed: + - It receives a new `policy_id`. + - For each cgroup matching its selector, Tetragon inserts a (`cgroup_id` → `policy_id`) entry into `cg_to_policy_map`. If a cgroup already has a binding for that template, it is rejected. + - This mapping activates the template logic for the targeted cgroups. + - Binding values are stored in `BPF_MAP_TYPE_HASH_OF_MAPS` instances (`pol_str_maps_*`). This implementation is very specific to string/charbuf/filename types and the eq/neq operators, but the concept can be extended to other types/operators, more on this later. + - These maps are keyed by `policy_id` (looked up via `cg_to_policy_map`). The value is a hash set of strings sourced from the binding, using the same 11‑bucket sizing technique as existing `string_maps_*`. + +### Results + +This design enables: + +1. Single eBPF program: One shared program can serve many bindings (for example, 512–1024+), as they reference the same template. This drastically reduces programs loaded into the kernel. +2. Low memory overhead: Per‑binding cost is small—entries in `cg_to_policy_map` plus `pol_str_maps_*` (typically a few KB per binding for modest value lists). + +## Impacts / Key Questions + +### Impact: new `TracingPolicyBinding` CRD + +Adding a `TracingPolicyBinding` CRD introduces a new concept. Users must understand the relationship between templates and bindings and how to manage them effectively. + +### Key Question: Is the new CRD necessary? + +Could existing `TracingPolicy` mechanisms be extended to carry only values and reference a shared logic template, or is a dedicated CRD essential for clarity and usability? + +### Impact: selectors in bindings are intended to be mutually exclusive + +In this model, a `cgroup_id` can be associated with at most one binding (`policy_id`) per template. A new binding for the same cgroup is rejected. Binding the same cgroup to multiple policies simultaneously is not allowed. + +### Key Question: Is this an acceptable use-case? + +- Does exclusivity limit important use cases? +- If overlapping bindings are useful, how should precedence or merging be defined? +- How can we enforce idempotency/ordering between multiple Tetragon agents? + +### Impact: partial support for variable types and operators + +The proposed binding logic is currently limited to: + +- matchArgs / matchData filters +- String / charbuf / filename types +- eq / neq operators + +### Key Question: Is full support required for the first version? + +Extending the binding logic to other types/operators would require different eBPF maps/approaches. If the design of the API is flexible enough to allow future extensions without breaking changes, could we start with a limited set and expand later based on user needs? + +### Impact: multiple bindings per template + +Currently only one binding is supported for each template + +### Key Question: do we need multiple bindings per template? + +The design should be extensible to support multiple bindings without API changes. I'm not sure multi-binding support would be really needed in practice for this reason I would avoid complicating the code too much until we have a real use case for it. From 4991b833a1fddbbe51a1063be0074da5e764d7ec Mon Sep 17 00:00:00 2001 From: Andrea Terzolo Date: Fri, 14 Nov 2025 09:27:05 +0100 Subject: [PATCH 2/2] refactor: split the document into 2 problems (P1/P2) Signed-off-by: Andrea Terzolo Co-authored-by: Kornilios Kourtis --- tetragon/CFP-4191-per-workload-policy.md | 237 +++++++++++++++++------ 1 file changed, 178 insertions(+), 59 deletions(-) diff --git a/tetragon/CFP-4191-per-workload-policy.md b/tetragon/CFP-4191-per-workload-policy.md index ac91b45..772d188 100644 --- a/tetragon/CFP-4191-per-workload-policy.md +++ b/tetragon/CFP-4191-per-workload-policy.md @@ -12,36 +12,157 @@ ## Summary -Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values. This design proposes a new model to decouple the enforcement logic from per‑workload values, enabling significant reductions in eBPF programs and map memory usage. +Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values. ## Motivation -The current approach of one `TracingPolicy` per workload leads to: +The current approach of one `TracingPolicy` per workload leads to 2 main problems: -- Program scaling limits: At least one eBPF program per policy. In clusters with hundreds of workloads, this may add latency when the hook is the hot path. -- Map memory scaling: Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy. -- Operational friction: Logic is identical but values differ (for example, allowed binaries), yet the model duplicates programs and maps. +- P1 (Scaling): The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies (attachment limits, redundant evaluations, memory growth). + +- P2 (UX / Composition): Crafting many near-identical `TracingPolicy` resources with only small differences (e.g. filter values) is operationally cumbersome and error prone. + +In this document we propose to tackle scaling (P1) and UX (P2) separately. ## Goals -- Decouple shared enforcement logic from per‑workload values to avoid linear growth in programs and map memory as clusters scale. -- Retain TracingPolicy expressiveness (selectors, filters, actions). +- Reduce the number of attached eBPF programs and map footprint when a common enforcement logic is shared across many workloads. +- Expose a friendly API to reuse common part of the policies avoiding the user to rewrite the same logic multiple times. +- Preserve existing expressiveness (selectors, filters, actions). +- Minimize API churn. ## Non-Goals - TBD -## Proposal +## P1: Scaling the BPF Implementation + +The current Tetragon implementation creates at least one BPF program and several maps per `TracingPolicy` resource. +This has some limitations: + +- Many ebpf programs attached to the same hook may add latency when the hook is the hot path. +- If ebpf programs rely on eBPF trampoline, they are and are subject to the [`BPF_MAX_TRAMP_LINKS`](https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138) limit (38 on x86). So in some cases a limited number of programs can be attached to the same hook. +- Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy. + +### Option 1: Share eBPF Programs & Maps across different TracingPolicies + +This design for simplicity introduces two concepts: templates and bindings. + +- A template is a way to define the enforcement logic we want to share across policies. So it injects the eBPF programs plus some maps that will be populated later by bindings. +- A binding is a way to provide specific values for the enforcement logic. + +When a template is deployed, Tetragon creates a `BPF_MAP_TYPE_HASH` map (`cg_to_policy_map`) to match cgroup IDs with binding `policy_id`s. Initially, the map is empty, so the template has no effect. + +When a binding is deployed, for each cgroup matching the binding, Tetragon inserts a (`cgroup_id` → `binding_id`) entry into `cg_to_policy_map`. + +If a cgroup already has a binding for that template, there could be different options to handle the conflict: + +1. Rollback of both bindings involved in the conflict. This ensures that the system remains in a consistent state and avoids idempotency/ordering issues between multiple Tetragon agents. +2. Take the binding with the highest priority. Each binding should have a predefined priority value. + +Once the cgroup is associated with the binding, the id is used to look up the binding values in specialized maps. +According to the type of values and operators used in the template, different specialized maps are created to store the binding values. +For example, for string eq/neq filters, Tetragon could create a `BPF_MAP_TYPE_HASH_OF_MAPS` map where the key is the `binding_id` and the value is the hash set of strings sourced from the binding. Each binding will create an entry in this map. This map could use the same 11‑bucket sizing technique as existing `string_maps_*` maps. + +This [proof of concept](https://github.com/cilium/tetragon/issues/4191) shows a preliminary implementation of this design. + +**Pros**: + +- small code changes to the existing Tetragon eBPF programs +- possibility to change the binding values at runtime by updating the specialized maps without reloading the eBPF programs. +- ... + +**Cons**: + +- different types of values and operators require different specialized maps, increasing implementation complexity. +- ... + +### Option 2: Tail Call policy chains + +TBD + +**Pros**: + +- TBD + +**Cons**: -### Overview +- TBD -This design is accompanied by a [proof of concept](https://github.com/cilium/tetragon/issues/4191). It introduces two concepts: templates and bindings. +## P2: Avoid Repetition in Policy Definitions -A template is a `TracingPolicy` that declares variables populated at runtime rather than hardcoded at load time. Selectors reference these variables by name. +The current Tetragon `TracingPolicy` API requires users to repeat the same enforcement logic for each workload, changing only the values (for example, allowed binaries). The following is an example of two policies that share the same logic but have different values: ```yaml apiVersion: cilium.io/v1alpha1 kind: TracingPolicy +metadata: + name: "policy-1" +spec: + podSelector: + matchLabels: + app: "my-deployment-1" + kprobes: + - call: "security_bprm_creds_for_exec" + syscall: false + args: + - index: 0 + type: "linux_binprm" + selectors: + - matchArgs: + - index: 0 + operator: "NotEqual" + values: + - "/usr/bin/sleep" + - "/usr/bin/cat" + - "/usr/bin/my-server-1" + matchActions: + - action: Override + argError: -1 + options: + - name: disable-kprobe-multi + value: "1" +--- +apiVersion: cilium.io/v1alpha1 +kind: TracingPolicy +metadata: + name: "policy-2" +spec: + podSelector: + matchLabels: + app: "my-deployment-2" + kprobes: + - call: "security_bprm_creds_for_exec" + syscall: false + args: + - index: 0 + type: "linux_binprm" + selectors: + - matchArgs: + - index: 0 + operator: "NotEqual" + values: + - "/usr/bin/ls" + - "/usr/bin/my-server-2" + matchActions: + - action: Override + argError: -1 + options: + - name: disable-kprobe-multi + value: "1" +``` + +### Option 1: Template + Binding + +We could introduce two new CRDs: `TracingPolicyTemplate` and `TracingPolicyBinding`. + +#### `TracingPolicyTemplate` + +A `TracingPolicyTemplate` specifies variables which can be populated at runtime, rather than being hardcoded at load time. Selectors within the policy reference these variables by name. + +```yaml +apiVersion: cilium.io/v1alpha1 +kind: TracingPolicyTemplate metadata: name: "block-process-template" spec: @@ -61,21 +182,31 @@ spec: valuesFromVariable: "targetExecPaths" ``` -Deploying a template alone has no runtime effect because it lacks concrete values for comparisons. +#### `TracingPolicyBinding` -A binding is a new resource (for example, `TracingPolicyBinding`) that supplies concrete values for a template’s variables and targets specific workloads through a selector. The selector is not the same as a podSelector; it is intended to be mutually exclusive across bindings. +A `TracingPolicyBinding` associates a `TracingPolicyTemplate` with concrete values for its variables and a workload selector. ```yaml apiVersion: cilium.io/v1alpha1 kind: TracingPolicyBinding metadata: + #