From cc1e3f990f8a66fb819e292e8a412564e01d8c37 Mon Sep 17 00:00:00 2001
From: Andrea Terzolo <andreaterzolo3@gmail.com>
Date: Wed, 5 Nov 2025 10:29:24 +0100
Subject: [PATCH 1/2] first proposal draft

Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com>
---
 tetragon/CFP-4191-per-workload-policy.md | 143 +++++++++++++++++++++++
 1 file changed, 143 insertions(+)
 create mode 100644 tetragon/CFP-4191-per-workload-policy.md

diff --git a/tetragon/CFP-4191-per-workload-policy.md b/tetragon/CFP-4191-per-workload-policy.md
new file mode 100644
index 0000000..ac91b45
--- /dev/null
+++ b/tetragon/CFP-4191-per-workload-policy.md
@@ -0,0 +1,143 @@
+# CFP-4191: Tetragon per-workload policies
+
+**SIG:** SIG-TETRAGON
+
+**Begin Design Discussion:** 2025-11-05
+
+**Tetragon Release:** X.XX
+
+**Authors:** Andrea Terzolo <andrea.terzolo@suse.com>, Kornilios Kourtis <kkourt@cisco.com>
+
+**Status:** Draft
+
+## Summary
+
+Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values. This design proposes a new model to decouple the enforcement logic from per‑workload values, enabling significant reductions in eBPF programs and map memory usage.
+
+## Motivation
+
+The current approach of one `TracingPolicy` per workload leads to:
+
+- Program scaling limits: At least one eBPF program per policy. In clusters with hundreds of workloads, this may add latency when the hook is the hot path.
+- Map memory scaling: Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy.
+- Operational friction: Logic is identical but values differ (for example, allowed binaries), yet the model duplicates programs and maps.
+
+## Goals
+
+- Decouple shared enforcement logic from per‑workload values to avoid linear growth in programs and map memory as clusters scale.
+- Retain TracingPolicy expressiveness (selectors, filters, actions).
+
+## Non-Goals
+
+- TBD
+
+## Proposal
+
+### Overview
+
+This design is accompanied by a [proof of concept](https://github.com/cilium/tetragon/issues/4191). It introduces two concepts: templates and bindings.
+
+A template is a `TracingPolicy` that declares variables populated at runtime rather than hardcoded at load time. Selectors reference these variables by name.
+
+```yaml
+apiVersion: cilium.io/v1alpha1
+kind: TracingPolicy
+metadata:
+  name: "block-process-template"
+spec:
+  variables:
+  - name: "targetExecPaths"
+    type: "linux_binprm" # this could be used for extra validation but it's probably not strictly necessary
+  kprobes:
+  - call: "security_bprm_creds_for_exec"
+    syscall: false
+    args:
+    - index: 0
+      type: "linux_binprm"
+    selectors:
+    - matchArgs:
+      - index: 0
+        operator: "Equal"
+        valuesFromVariable: "targetExecPaths"
+```
+
+Deploying a template alone has no runtime effect because it lacks concrete values for comparisons.
+
+A binding is a new resource (for example, `TracingPolicyBinding`) that supplies concrete values for a template’s variables and targets specific workloads through a selector. The selector is not the same as a podSelector; it is intended to be mutually exclusive across bindings.
+
+```yaml
+apiVersion: cilium.io/v1alpha1
+kind: TracingPolicyBinding
+metadata:
+  name: "block-process-template-values-1"
+spec:
+  policyTemplateRef:
+    name: "block-process-template"
+  exclusiveSelector: # it should be mutually exclusive across bindings
+    matchLabels:
+      app: "my-app-1"
+  bindings:
+  - name: "targetExecPaths"
+    values:
+    - "/usr/bin/true"
+    - "/usr/bin/ls"
+```
+
+Policy logic becomes active only when a `TracingPolicyBinding` is present. The binding populates eBPF maps with the specified values for cgroups that match its selector.
+
+### Details
+
+- When a template is deployed, Tetragon loads the same eBPF programs and maps as today. Additionally, it creates a `BPF_MAP_TYPE_HASH` map (`cg_to_policy_map`) to map cgroup IDs to binding `policy_id`s. Initially, the map is empty, so the template has no effect.
+- When a binding is deployed:
+  - It receives a new `policy_id`.
+  - For each cgroup matching its selector, Tetragon inserts a (`cgroup_id` → `policy_id`) entry into `cg_to_policy_map`. If a cgroup already has a binding for that template, it is rejected.
+  - This mapping activates the template logic for the targeted cgroups.
+  - Binding values are stored in `BPF_MAP_TYPE_HASH_OF_MAPS` instances (`pol_str_maps_*`). This implementation is very specific to string/charbuf/filename types and the eq/neq operators, but the concept can be extended to other types/operators, more on this later.
+  - These maps are keyed by `policy_id` (looked up via `cg_to_policy_map`). The value is a hash set of strings sourced from the binding, using the same 11‑bucket sizing technique as existing `string_maps_*`.
+
+### Results
+
+This design enables:
+
+1. Single eBPF program: One shared program can serve many bindings (for example, 512–1024+), as they reference the same template. This drastically reduces programs loaded into the kernel.
+2. Low memory overhead: Per‑binding cost is small—entries in `cg_to_policy_map` plus `pol_str_maps_*` (typically a few KB per binding for modest value lists).
+
+## Impacts / Key Questions
+
+### Impact: new `TracingPolicyBinding` CRD
+
+Adding a `TracingPolicyBinding` CRD introduces a new concept. Users must understand the relationship between templates and bindings and how to manage them effectively.
+
+### Key Question: Is the new CRD necessary?
+
+Could existing `TracingPolicy` mechanisms be extended to carry only values and reference a shared logic template, or is a dedicated CRD essential for clarity and usability?
+
+### Impact: selectors in bindings are intended to be mutually exclusive
+
+In this model, a `cgroup_id` can be associated with at most one binding (`policy_id`) per template. A new binding for the same cgroup is rejected. Binding the same cgroup to multiple policies simultaneously is not allowed.
+
+### Key Question: Is this an acceptable use-case?
+
+- Does exclusivity limit important use cases?
+- If overlapping bindings are useful, how should precedence or merging be defined?
+- How can we enforce idempotency/ordering between multiple Tetragon agents?
+
+### Impact: partial support for variable types and operators
+
+The proposed binding logic is currently limited to:
+
+- matchArgs / matchData filters
+- String / charbuf / filename types
+- eq / neq operators
+
+### Key Question: Is full support required for the first version?
+
+Extending the binding logic to other types/operators would require different eBPF maps/approaches. If the design of the API is flexible enough to allow future extensions without breaking changes, could we start with a limited set and expand later based on user needs?
+
+### Impact: multiple bindings per template
+
+Currently only one binding is supported for each template
+
+### Key Question: do we need multiple bindings per template?
+
+The design should be extensible to support multiple bindings without API changes. I'm not sure multi-binding support would be really needed in practice for this reason I would avoid complicating the code too much until we have a real use case for it.

From 4991b833a1fddbbe51a1063be0074da5e764d7ec Mon Sep 17 00:00:00 2001
From: Andrea Terzolo <andreaterzolo3@gmail.com>
Date: Fri, 14 Nov 2025 09:27:05 +0100
Subject: [PATCH 2/2] refactor: split the document into 2 problems (P1/P2)

Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com>
Co-authored-by: Kornilios Kourtis <kornilios@isovalent.com>
---
 tetragon/CFP-4191-per-workload-policy.md | 237 +++++++++++++++++------
 1 file changed, 178 insertions(+), 59 deletions(-)

diff --git a/tetragon/CFP-4191-per-workload-policy.md b/tetragon/CFP-4191-per-workload-policy.md
index ac91b45..772d188 100644
--- a/tetragon/CFP-4191-per-workload-policy.md
+++ b/tetragon/CFP-4191-per-workload-policy.md
@@ -12,36 +12,157 @@
 
 ## Summary
 
-Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values. This design proposes a new model to decouple the enforcement logic from per‑workload values, enabling significant reductions in eBPF programs and map memory usage.
+Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values.
 
 ## Motivation
 
-The current approach of one `TracingPolicy` per workload leads to:
+The current approach of one `TracingPolicy` per workload leads to 2 main problems:
 
-- Program scaling limits: At least one eBPF program per policy. In clusters with hundreds of workloads, this may add latency when the hook is the hot path.
-- Map memory scaling: Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy.
-- Operational friction: Logic is identical but values differ (for example, allowed binaries), yet the model duplicates programs and maps.
+- P1 (Scaling): The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies (attachment limits, redundant evaluations, memory growth).
+
+- P2 (UX / Composition): Crafting many near-identical `TracingPolicy` resources with only small differences (e.g. filter values) is operationally cumbersome and error prone.
+
+In this document we propose to tackle scaling (P1) and UX (P2) separately.
 
 ## Goals
 
-- Decouple shared enforcement logic from per‑workload values to avoid linear growth in programs and map memory as clusters scale.
-- Retain TracingPolicy expressiveness (selectors, filters, actions).
+- Reduce the number of attached eBPF programs and map footprint when a common enforcement logic is shared across many workloads.
+- Expose a friendly API to reuse common part of the policies avoiding the user to rewrite the same logic multiple times.
+- Preserve existing expressiveness (selectors, filters, actions).
+- Minimize API churn.
 
 ## Non-Goals
 
 - TBD
 
-## Proposal
+## P1: Scaling the BPF Implementation
+
+The current Tetragon implementation creates at least one BPF program and several maps per `TracingPolicy` resource.
+This has some limitations:
+
+- Many ebpf programs attached to the same hook may add latency when the hook is the hot path.
+- If ebpf programs rely on eBPF trampoline, they are and are subject to the [`BPF_MAX_TRAMP_LINKS`](https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138) limit (38 on x86). So in some cases a limited number of programs can be attached to the same hook.
+- Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy.
+
+### Option 1: Share eBPF Programs & Maps across different TracingPolicies
+
+This design for simplicity introduces two concepts: templates and bindings.
+
+- A template is a way to define the enforcement logic we want to share across policies. So it injects the eBPF programs plus some maps that will be populated later by bindings.
+- A binding is a way to provide specific values for the enforcement logic.
+
+When a template is deployed, Tetragon creates a `BPF_MAP_TYPE_HASH` map (`cg_to_policy_map`) to match cgroup IDs with binding `policy_id`s. Initially, the map is empty, so the template has no effect.
+
+When a binding is deployed, for each cgroup matching the binding, Tetragon inserts a (`cgroup_id` → `binding_id`) entry into `cg_to_policy_map`.
+
+If a cgroup already has a binding for that template, there could be different options to handle the conflict:
+
+1. Rollback of both bindings involved in the conflict. This ensures that the system remains in a consistent state and avoids idempotency/ordering issues between multiple Tetragon agents.
+2. Take the binding with the highest priority. Each binding should have a predefined priority value.
+
+Once the cgroup is associated with the binding, the id is used to look up the binding values in specialized maps.
+According to the type of values and operators used in the template, different specialized maps are created to store the binding values.
+For example, for string eq/neq filters, Tetragon could create a `BPF_MAP_TYPE_HASH_OF_MAPS` map where the key is the `binding_id` and the value is the hash set of strings sourced from the binding. Each binding will create an entry in this map. This map could use the same 11‑bucket sizing technique as existing `string_maps_*` maps.
+
+This [proof of concept](https://github.com/cilium/tetragon/issues/4191) shows a preliminary implementation of this design.
+
+**Pros**:
+
+- small code changes to the existing Tetragon eBPF programs
+- possibility to change the binding values at runtime by updating the specialized maps without reloading the eBPF programs.
+- ...
+
+**Cons**:
+
+- different types of values and operators require different specialized maps, increasing implementation complexity.
+- ...
+
+### Option 2: Tail Call policy chains
+
+TBD
+
+**Pros**:
+
+- TBD
+
+**Cons**:
 
-### Overview
+- TBD
 
-This design is accompanied by a [proof of concept](https://github.com/cilium/tetragon/issues/4191). It introduces two concepts: templates and bindings.
+## P2: Avoid Repetition in Policy Definitions
 
-A template is a `TracingPolicy` that declares variables populated at runtime rather than hardcoded at load time. Selectors reference these variables by name.
+The current Tetragon `TracingPolicy` API requires users to repeat the same enforcement logic for each workload, changing only the values (for example, allowed binaries). The following is an example of two policies that share the same logic but have different values:
 
 ```yaml
 apiVersion: cilium.io/v1alpha1
 kind: TracingPolicy
+metadata:
+  name: "policy-1"
+spec:
+  podSelector:
+    matchLabels:
+      app: "my-deployment-1"
+  kprobes:
+  - call: "security_bprm_creds_for_exec"
+    syscall: false
+    args:
+    - index: 0
+      type: "linux_binprm"
+    selectors:
+    - matchArgs:
+      - index: 0
+        operator: "NotEqual"
+        values:
+        - "/usr/bin/sleep"
+        - "/usr/bin/cat"
+        - "/usr/bin/my-server-1"
+      matchActions:
+      - action: Override
+        argError: -1
+  options:
+  - name: disable-kprobe-multi
+    value: "1"
+---
+apiVersion: cilium.io/v1alpha1
+kind: TracingPolicy
+metadata:
+  name: "policy-2"
+spec:
+  podSelector:
+    matchLabels:
+      app: "my-deployment-2"
+  kprobes:
+  - call: "security_bprm_creds_for_exec"
+    syscall: false
+    args:
+    - index: 0
+      type: "linux_binprm"
+    selectors:
+    - matchArgs:
+      - index: 0
+        operator: "NotEqual"
+        values:
+        - "/usr/bin/ls"
+        - "/usr/bin/my-server-2"
+      matchActions:
+      - action: Override
+        argError: -1
+  options:
+  - name: disable-kprobe-multi
+    value: "1"
+```
+
+### Option 1: Template + Binding
+
+We could introduce two new CRDs: `TracingPolicyTemplate` and `TracingPolicyBinding`.
+
+#### `TracingPolicyTemplate`
+
+A `TracingPolicyTemplate` specifies variables which can be populated at runtime, rather than being hardcoded at load time. Selectors within the policy reference these variables by name.
+
+```yaml
+apiVersion: cilium.io/v1alpha1
+kind: TracingPolicyTemplate
 metadata:
   name: "block-process-template"
 spec:
@@ -61,21 +182,31 @@ spec:
         valuesFromVariable: "targetExecPaths"
 ```
 
-Deploying a template alone has no runtime effect because it lacks concrete values for comparisons.
+#### `TracingPolicyBinding`
 
-A binding is a new resource (for example, `TracingPolicyBinding`) that supplies concrete values for a template’s variables and targets specific workloads through a selector. The selector is not the same as a podSelector; it is intended to be mutually exclusive across bindings.
+A `TracingPolicyBinding` associates a `TracingPolicyTemplate` with concrete values for its variables and a workload selector.
 
 ```yaml
 apiVersion: cilium.io/v1alpha1
 kind: TracingPolicyBinding
 metadata:
+  # <template>-<bindingName>
   name: "block-process-template-values-1"
 spec:
-  policyTemplateRef:
+  templateRef:
     name: "block-process-template"
-  exclusiveSelector: # it should be mutually exclusive across bindings
+  overrideAllowed: false
+  overridePriority: 0
+  # Same selectors used today in TracingPolicy with no constraints
+  podSelector:
     matchLabels:
       app: "my-app-1"
+  containerSelector:
+    matchExpressions:
+      - key: name
+        operator: In
+        values:
+        - main
   bindings:
   - name: "targetExecPaths"
     values:
@@ -83,61 +214,49 @@ spec:
     - "/usr/bin/ls"
 ```
 
-Policy logic becomes active only when a `TracingPolicyBinding` is present. The binding populates eBPF maps with the specified values for cgroups that match its selector.
-
-### Details
-
-- When a template is deployed, Tetragon loads the same eBPF programs and maps as today. Additionally, it creates a `BPF_MAP_TYPE_HASH` map (`cg_to_policy_map`) to map cgroup IDs to binding `policy_id`s. Initially, the map is empty, so the template has no effect.
-- When a binding is deployed:
-  - It receives a new `policy_id`.
-  - For each cgroup matching its selector, Tetragon inserts a (`cgroup_id` → `policy_id`) entry into `cg_to_policy_map`. If a cgroup already has a binding for that template, it is rejected.
-  - This mapping activates the template logic for the targeted cgroups.
-  - Binding values are stored in `BPF_MAP_TYPE_HASH_OF_MAPS` instances (`pol_str_maps_*`). This implementation is very specific to string/charbuf/filename types and the eq/neq operators, but the concept can be extended to other types/operators, more on this later.
-  - These maps are keyed by `policy_id` (looked up via `cg_to_policy_map`). The value is a hash set of strings sourced from the binding, using the same 11‑bucket sizing technique as existing `string_maps_*`.
+if at least one of the binding involved in the conflict has `overrideAllowed` set to `false`, Tetragon should rollback all bindings involved and log an error. In case of `overrideAllowed` set to `false`, `overridePriority` is ignored.
 
-### Results
+If `overrideAllowed` is set to `true`, Tetragon should use `overridePriority` to determine which binding takes precedence.
 
-This design enables:
+**Pros**:
 
-1. Single eBPF program: One shared program can serve many bindings (for example, 512–1024+), as they reference the same template. This drastically reduces programs loaded into the kernel.
-2. Low memory overhead: Per‑binding cost is small—entries in `cg_to_policy_map` plus `pol_str_maps_*` (typically a few KB per binding for modest value lists).
+- freedom to use arbitrary selectors for each binding.
+- no requirements for the user to add labels or other identifiers to workloads to ensure exclusivity.
 
-## Impacts / Key Questions
+**Cons**:
 
-### Impact: new `TracingPolicyBinding` CRD
+- users need to be aware of potential overlapping bindings for the same template and handle conflicts accordingly.
 
-Adding a `TracingPolicyBinding` CRD introduces a new concept. Users must understand the relationship between templates and bindings and how to manage them effectively.
+### Option 2: Template + Binding (mutually exclusive selectors)
 
-### Key Question: Is the new CRD necessary?
+Similar to Option 1, but we require that workload selectors in bindings for the same template are mutually exclusive. This guarantees to users that there will be no overlapping bindings for the same template, but at the cost of flexibility.
 
-Could existing `TracingPolicy` mechanisms be extended to carry only values and reference a shared logic template, or is a dedicated CRD essential for clarity and usability?
+The template resource is identical to Option 1, while the binding resource slightly changes:
 
-### Impact: selectors in bindings are intended to be mutually exclusive
-
-In this model, a `cgroup_id` can be associated with at most one binding (`policy_id`) per template. A new binding for the same cgroup is rejected. Binding the same cgroup to multiple policies simultaneously is not allowed.
-
-### Key Question: Is this an acceptable use-case?
-
-- Does exclusivity limit important use cases?
-- If overlapping bindings are useful, how should precedence or merging be defined?
-- How can we enforce idempotency/ordering between multiple Tetragon agents?
-
-### Impact: partial support for variable types and operators
-
-The proposed binding logic is currently limited to:
-
-- matchArgs / matchData filters
-- String / charbuf / filename types
-- eq / neq operators
-
-### Key Question: Is full support required for the first version?
+```yaml
+apiVersion: cilium.io/v1alpha1
+kind: TracingPolicyBinding
+metadata:
+  # <template>-<bindingName>
+  name: "block-process-template-values-1"
+spec:
+  templateRef:
+    name: "block-process-template"
+  bindings:
+  - name: "targetExecPaths"
+    values:
+    - "/usr/bin/true"
+    - "/usr/bin/ls"
+```
 
-Extending the binding logic to other types/operators would require different eBPF maps/approaches. If the design of the API is flexible enough to allow future extensions without breaking changes, could we start with a limited set and expand later based on user needs?
+No selectors are specified in the binding. Instead, users must ensure that pods are labeled in the correct way in order to match the intended binding. To match the above binding, users must label their pods with `tetragon-block-process-template=values-1` (`tetragon-<template>=<bindingName>`).
 
-### Impact: multiple bindings per template
+**Pros**:
 
-Currently only one binding is supported for each template
+- mutually exclusive selectors by design, no risk of accidental overlapping bindings.
 
-### Key Question: do we need multiple bindings per template?
+**Cons**:
 
-The design should be extensible to support multiple bindings without API changes. I'm not sure multi-binding support would be really needed in practice for this reason I would avoid complicating the code too much until we have a real use case for it.
+- require users to label workloads appropriately to ensure exclusivity.
+- no override allowed, this could be a limitation for some use cases, in which users may want to override a global binding with a more specific one.
+- doesn't allow to specify different values for different containers within the same pod.