Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
262 changes: 262 additions & 0 deletions tetragon/CFP-4191-per-workload-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
# CFP-4191: Tetragon per-workload policies
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on reading the first proposal draft and considering the discussion we already had at cilium/tetragon#4191, there are two different sub-problems.

  • P1: The BPF implementation to support multiple policies for different workloads scales poorly (e.g., program scaling limits and map memory scaling)
  • P2: The UI/UX of writing policies for different workloads is cumbersome (operational friction)

My suggestion would be to treat them as separate problems when discussing solutions.

One way to think about Tetragon policies that, I think, is useful is as

(workload, operation, object) -> action

Workloads are defined with selectors such as podSelector, matchBinary, matchNamespace, matchCredentials.
Operations are defined with things like kprobes.call, objects are defined with things like matchArgs/matchBinaries, while actions with things like Post, NoPost, Override, NotifyEnforcer.

Scaling the BPF implementation (P1)

The current Tetragon implementation creates (and attaches) one BPF program per TracingPolicy resource.
This has limitations. For one, a limited number of programs can be attached to every hook. Moreover, there is a lot of redundant computation (e.g., if we have 100 policies, we will check 100 times if the cgroup of the workload is part of the corresponding policy; this can be optimized so we control the path so we don't have to check every time), as well as duplicated information in case the policies are similar (e.g., only the values in matchAgs/matchBinaries differ)

Potential solutions:

  • Avoid redundant computations and reduce memory footprint by sharing maps across different TracingPolicy resources
  • tail-call only into the programs of the policies matching a given workload
  • ???

The UI/UX of writing policies for different workloads is cumbersome (P2)

The challenge here, I think, is one where policies might have common sections of Workloads, operations, or objects, and we want to avoid having the user re-write them.

Potential solutions:

  • Split what is now a single TracingPolicy into a parent/template and multple children policies. The parent/template contains the common parts, while the children parts contain the different parts (typically per workload).
  • Extend the current selectors (e.g., allow for PodSelectors not only at the top of the policy, but also under selectors)
  • ???

Approach

Given all the above, I would see this CFP as a way to specify solutions to the above problems (as well as how they interact), their tradeoffs, and how we (i.e., the Tetragon OSS project) can move forward to address them. An alternative approach would be to focus on the specific solution that you propose in cilium/tetragon#4191. Either approach is fine by me, but we should define (and agree on) the goals/scope.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can split this into 2 different sub-problems. More in detail, I would address P1 first since the solution to this problem could give us precious hints on how to design the UX.

Given all the above, I would see this CFP as a way to specify solutions to the above problems (as well as how they interact), their tradeoffs, and how we (i.e., the Tetragon OSS project) can move forward to address them.

I would pick this approach

An alternative approach would be to focus on the specific solution that you propose in cilium/tetragon#4191. Either approach is fine by me, but we should define (and agree on) the goals/scope.

I would not focus on the proposed solution cilium/tetragon#4191 but would seek to explore alternatives like the "tail-call" approach you suggested, to understand the advantages it can provide.

If you agree, as a next step, I would compare the tail call solution with the proposed one, evaluating their pros and cons, and see if any other ideas come up from the discussion. WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Maybe as a first step, we can structure the CFP in a way that reflects the above? That is, separate the two problems and have sections for proposed solutions?

CC: @tpapagian who can provide more details on the tail call solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I will provide some background on what we have done already (for a different context). It seems similar to that but it also has some differences.

Assuming a policy similar to the following

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "example-policy"
spec:
  podSelector:
    matchLabels:
      app: "workload-a"
  kprobes:
  - call: "some_call"
    syscall: true
    args: [...]
    selectors: [...]

we wanted to have a large number of such policies, each of them for a separate workload (defined under podSelector) that all hook at the same function (same kprobes.call) but have different selectors. As you already mentioned, the problem is that this will result in multiple programs, each of them attached on that function, and each program has its own maps etc. The main issue was that if we have 100 such policies, we always need to run all 100 programs and this will affect the performance of the users workload.

The approach that we took was to have a mapping from cgroup_ids to policy_ids. This was introduced in cilium/tetragon#3122 (map named policy_filter_cgroup_maps). We also need a map from policy_id to program (type BPF_MAP_TYPE_PROG_ARRAY) that has all the mapping from policy_id to the actual eBPF programs.

Now we need a single eBPF program that is attached to the hook that we care about. This will get the policy ids (from policy_filter_cgroup_maps) and call bpf_tail_call helper to the first one. When the first program is done, this will again call bpf_tail_call to the second, etc.

When a new policy is loaded, we just need to load a new program with its maps, that corresponds to the selectors for this tracing policy, and update the BPF_MAP_TYPE_PROG_ARRAY to add a new mapping. But we don't need to attach that to any hooks.

On the one hand, this makes scalability (in terms of CPU overheads) with the number of tracing policies to be constant (only the programs that needs to run will be called). We also managed to do that without the need to change the UI/UX. It was also relatively easy to implement. On the other hand, this will not use less maps (i.e. less memory) compared to what we have now.

I tried to keep it at high-level to make that easier to follow. It you feel that anything is not clear, just let me know to provide more details.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for deleting the previous messages because it wasn't thought through well)

@tpapagian thanks for your reply and I think it's a great idea. It's also great to see that Tetragon already has part of the implementation.

I have a question about the bpf_tail_call. If I understand it correctly, Linux has a maximum 33 bpf_tail_call depth limit from MAX_TAIL_CALL_CNT macro.

I think Tetragon has used many bpf_tail_call. For example, TAIL_CALL_ARGS and TAIL_CALL_FILTER would be called many times depending on the policy. Do you think it would fit the 33 maximum if we have multiple policy_ids?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tpapagian, for the detailed explanation.
Today, in the code, I see that the policy_filter_cgroup_maps map is instantiated once in the policy filter state, so it is shared by all eBPF programs. Is this map also global in your approach?

Let's say we are in a specific hook security_bprm_creds_for_exec and we are the cgroup1. When we look up the policy IDs in policy_filter_cgroup_maps, how can we understand which policies are associated with the particular hook we are calling from (security_bprm_creds_for_exec in this example)? I imagine that a cgroup could be associated with different policies using different hooks.

On the one hand, this makes scalability (in terms of CPU overheads) with the number of tracing policies to be constant (only the programs that needs to run will be called). We also managed to do that without the need to change the UI/UX. It was also relatively easy to implement.

This sounds really promising! What I really like is that all selectors are supported out of the box without the need to create new maps of maps to handle values from different policies

On the other hand, this will not use less maps (i.e. less memory) compared to what we have now.

Do you think that this limitation can be overcome in some way, or do you believe it is not feasible with this approach? Having 2/3 MB for each policy still seems a little bit too much when you have many workloads in the cluster. Putting it another way, do you already have any plans for how to mitigate it, and how we might help?


**SIG:** SIG-TETRAGON

**Begin Design Discussion:** 2025-11-05

**Tetragon Release:** X.XX

**Authors:** Andrea Terzolo <andrea.terzolo@suse.com>, Kornilios Kourtis <kkourt@cisco.com>

**Status:** Draft

## Summary

Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values.

## Motivation

The current approach of one `TracingPolicy` per workload leads to 2 main problems:

- P1 (Scaling): The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies (attachment limits, redundant evaluations, memory growth).

- P2 (UX / Composition): Crafting many near-identical `TracingPolicy` resources with only small differences (e.g. filter values) is operationally cumbersome and error prone.

In this document we propose to tackle scaling (P1) and UX (P2) separately.

## Goals

- Reduce the number of attached eBPF programs and map footprint when a common enforcement logic is shared across many workloads.
- Expose a friendly API to reuse common part of the policies avoiding the user to rewrite the same logic multiple times.
- Preserve existing expressiveness (selectors, filters, actions).
- Minimize API churn.

## Non-Goals

- TBD

## P1: Scaling the BPF Implementation

The current Tetragon implementation creates at least one BPF program and several maps per `TracingPolicy` resource.
This has some limitations:

- Many ebpf programs attached to the same hook may add latency when the hook is the hot path.
- If ebpf programs rely on eBPF trampoline, they are and are subject to the [`BPF_MAX_TRAMP_LINKS`](https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138) limit (38 on x86). So in some cases a limited number of programs can be attached to the same hook.
- Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy.

### Option 1: Share eBPF Programs & Maps across different TracingPolicies

This design for simplicity introduces two concepts: templates and bindings.

- A template is a way to define the enforcement logic we want to share across policies. So it injects the eBPF programs plus some maps that will be populated later by bindings.
- A binding is a way to provide specific values for the enforcement logic.

When a template is deployed, Tetragon creates a `BPF_MAP_TYPE_HASH` map (`cg_to_policy_map`) to match cgroup IDs with binding `policy_id`s. Initially, the map is empty, so the template has no effect.

When a binding is deployed, for each cgroup matching the binding, Tetragon inserts a (`cgroup_id` → `binding_id`) entry into `cg_to_policy_map`.

If a cgroup already has a binding for that template, there could be different options to handle the conflict:

1. Rollback of both bindings involved in the conflict. This ensures that the system remains in a consistent state and avoids idempotency/ordering issues between multiple Tetragon agents.
2. Take the binding with the highest priority. Each binding should have a predefined priority value.

Once the cgroup is associated with the binding, the id is used to look up the binding values in specialized maps.
According to the type of values and operators used in the template, different specialized maps are created to store the binding values.
For example, for string eq/neq filters, Tetragon could create a `BPF_MAP_TYPE_HASH_OF_MAPS` map where the key is the `binding_id` and the value is the hash set of strings sourced from the binding. Each binding will create an entry in this map. This map could use the same 11‑bucket sizing technique as existing `string_maps_*` maps.

This [proof of concept](https://github.com/cilium/tetragon/issues/4191) shows a preliminary implementation of this design.

**Pros**:

- small code changes to the existing Tetragon eBPF programs
- possibility to change the binding values at runtime by updating the specialized maps without reloading the eBPF programs.
- ...

**Cons**:

- different types of values and operators require different specialized maps, increasing implementation complexity.
- ...

### Option 2: Tail Call policy chains

TBD

**Pros**:

- TBD

**Cons**:

- TBD

## P2: Avoid Repetition in Policy Definitions

The current Tetragon `TracingPolicy` API requires users to repeat the same enforcement logic for each workload, changing only the values (for example, allowed binaries). The following is an example of two policies that share the same logic but have different values:

```yaml
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "policy-1"
spec:
podSelector:
matchLabels:
app: "my-deployment-1"
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values:
- "/usr/bin/sleep"
- "/usr/bin/cat"
- "/usr/bin/my-server-1"
matchActions:
- action: Override
argError: -1
options:
- name: disable-kprobe-multi
value: "1"
---
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "policy-2"
spec:
podSelector:
matchLabels:
app: "my-deployment-2"
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values:
- "/usr/bin/ls"
- "/usr/bin/my-server-2"
matchActions:
- action: Override
argError: -1
options:
- name: disable-kprobe-multi
value: "1"
```

### Option 1: Template + Binding

We could introduce two new CRDs: `TracingPolicyTemplate` and `TracingPolicyBinding`.

#### `TracingPolicyTemplate`

A `TracingPolicyTemplate` specifies variables which can be populated at runtime, rather than being hardcoded at load time. Selectors within the policy reference these variables by name.

```yaml
apiVersion: cilium.io/v1alpha1
kind: TracingPolicyTemplate
metadata:
name: "block-process-template"
spec:
variables:
- name: "targetExecPaths"
type: "linux_binprm" # this could be used for extra validation but it's probably not strictly necessary
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "Equal"
valuesFromVariable: "targetExecPaths"
```

#### `TracingPolicyBinding`

A `TracingPolicyBinding` associates a `TracingPolicyTemplate` with concrete values for its variables and a workload selector.

```yaml
apiVersion: cilium.io/v1alpha1
kind: TracingPolicyBinding
metadata:
# <template>-<bindingName>
name: "block-process-template-values-1"
spec:
templateRef:
name: "block-process-template"
overrideAllowed: false
overridePriority: 0
# Same selectors used today in TracingPolicy with no constraints
podSelector:
matchLabels:
app: "my-app-1"
containerSelector:
matchExpressions:
- key: name
operator: In
values:
- main
bindings:
- name: "targetExecPaths"
values:
- "/usr/bin/true"
- "/usr/bin/ls"
```

if at least one of the binding involved in the conflict has `overrideAllowed` set to `false`, Tetragon should rollback all bindings involved and log an error. In case of `overrideAllowed` set to `false`, `overridePriority` is ignored.

If `overrideAllowed` is set to `true`, Tetragon should use `overridePriority` to determine which binding takes precedence.

**Pros**:

- freedom to use arbitrary selectors for each binding.
- no requirements for the user to add labels or other identifiers to workloads to ensure exclusivity.

**Cons**:

- users need to be aware of potential overlapping bindings for the same template and handle conflicts accordingly.

### Option 2: Template + Binding (mutually exclusive selectors)

Similar to Option 1, but we require that workload selectors in bindings for the same template are mutually exclusive. This guarantees to users that there will be no overlapping bindings for the same template, but at the cost of flexibility.

The template resource is identical to Option 1, while the binding resource slightly changes:

```yaml
apiVersion: cilium.io/v1alpha1
kind: TracingPolicyBinding
metadata:
# <template>-<bindingName>
name: "block-process-template-values-1"
spec:
templateRef:
name: "block-process-template"
bindings:
- name: "targetExecPaths"
values:
- "/usr/bin/true"
- "/usr/bin/ls"
```

No selectors are specified in the binding. Instead, users must ensure that pods are labeled in the correct way in order to match the intended binding. To match the above binding, users must label their pods with `tetragon-block-process-template=values-1` (`tetragon-<template>=<bindingName>`).

**Pros**:

- mutually exclusive selectors by design, no risk of accidental overlapping bindings.

**Cons**:

- require users to label workloads appropriately to ensure exclusivity.
- no override allowed, this could be a limitation for some use cases, in which users may want to override a global binding with a more specific one.
- doesn't allow to specify different values for different containers within the same pod.