-
Notifications
You must be signed in to change notification settings - Fork 40
CFP-4191: Tetragon per-workload policies #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Andreagit97
wants to merge
2
commits into
cilium:main
Choose a base branch
from
Andreagit97:tetragon-workload-policies
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,262 @@ | ||
| # CFP-4191: Tetragon per-workload policies | ||
|
|
||
| **SIG:** SIG-TETRAGON | ||
|
|
||
| **Begin Design Discussion:** 2025-11-05 | ||
|
|
||
| **Tetragon Release:** X.XX | ||
|
|
||
| **Authors:** Andrea Terzolo <andrea.terzolo@suse.com>, Kornilios Kourtis <kkourt@cisco.com> | ||
|
|
||
| **Status:** Draft | ||
|
|
||
| ## Summary | ||
|
|
||
| Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values. | ||
|
|
||
| ## Motivation | ||
|
|
||
| The current approach of one `TracingPolicy` per workload leads to 2 main problems: | ||
|
|
||
| - P1 (Scaling): The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies (attachment limits, redundant evaluations, memory growth). | ||
|
|
||
| - P2 (UX / Composition): Crafting many near-identical `TracingPolicy` resources with only small differences (e.g. filter values) is operationally cumbersome and error prone. | ||
|
|
||
| In this document we propose to tackle scaling (P1) and UX (P2) separately. | ||
|
|
||
| ## Goals | ||
|
|
||
| - Reduce the number of attached eBPF programs and map footprint when a common enforcement logic is shared across many workloads. | ||
| - Expose a friendly API to reuse common part of the policies avoiding the user to rewrite the same logic multiple times. | ||
| - Preserve existing expressiveness (selectors, filters, actions). | ||
| - Minimize API churn. | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| - TBD | ||
|
|
||
| ## P1: Scaling the BPF Implementation | ||
|
|
||
| The current Tetragon implementation creates at least one BPF program and several maps per `TracingPolicy` resource. | ||
| This has some limitations: | ||
|
|
||
| - Many ebpf programs attached to the same hook may add latency when the hook is the hot path. | ||
| - If ebpf programs rely on eBPF trampoline, they are and are subject to the [`BPF_MAX_TRAMP_LINKS`](https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138) limit (38 on x86). So in some cases a limited number of programs can be attached to the same hook. | ||
| - Tens of maps per policy and several large maps (for example, policy filters, socktrack_map, override_tasks) lead to multi‑MB memlock per policy. | ||
|
|
||
| ### Option 1: Share eBPF Programs & Maps across different TracingPolicies | ||
|
|
||
| This design for simplicity introduces two concepts: templates and bindings. | ||
|
|
||
| - A template is a way to define the enforcement logic we want to share across policies. So it injects the eBPF programs plus some maps that will be populated later by bindings. | ||
| - A binding is a way to provide specific values for the enforcement logic. | ||
|
|
||
| When a template is deployed, Tetragon creates a `BPF_MAP_TYPE_HASH` map (`cg_to_policy_map`) to match cgroup IDs with binding `policy_id`s. Initially, the map is empty, so the template has no effect. | ||
|
|
||
| When a binding is deployed, for each cgroup matching the binding, Tetragon inserts a (`cgroup_id` → `binding_id`) entry into `cg_to_policy_map`. | ||
|
|
||
| If a cgroup already has a binding for that template, there could be different options to handle the conflict: | ||
|
|
||
| 1. Rollback of both bindings involved in the conflict. This ensures that the system remains in a consistent state and avoids idempotency/ordering issues between multiple Tetragon agents. | ||
| 2. Take the binding with the highest priority. Each binding should have a predefined priority value. | ||
|
|
||
| Once the cgroup is associated with the binding, the id is used to look up the binding values in specialized maps. | ||
| According to the type of values and operators used in the template, different specialized maps are created to store the binding values. | ||
| For example, for string eq/neq filters, Tetragon could create a `BPF_MAP_TYPE_HASH_OF_MAPS` map where the key is the `binding_id` and the value is the hash set of strings sourced from the binding. Each binding will create an entry in this map. This map could use the same 11‑bucket sizing technique as existing `string_maps_*` maps. | ||
|
|
||
| This [proof of concept](https://github.com/cilium/tetragon/issues/4191) shows a preliminary implementation of this design. | ||
|
|
||
| **Pros**: | ||
|
|
||
| - small code changes to the existing Tetragon eBPF programs | ||
| - possibility to change the binding values at runtime by updating the specialized maps without reloading the eBPF programs. | ||
| - ... | ||
|
|
||
| **Cons**: | ||
|
|
||
| - different types of values and operators require different specialized maps, increasing implementation complexity. | ||
| - ... | ||
|
|
||
| ### Option 2: Tail Call policy chains | ||
|
|
||
| TBD | ||
|
|
||
| **Pros**: | ||
|
|
||
| - TBD | ||
|
|
||
| **Cons**: | ||
|
|
||
| - TBD | ||
|
|
||
| ## P2: Avoid Repetition in Policy Definitions | ||
|
|
||
| The current Tetragon `TracingPolicy` API requires users to repeat the same enforcement logic for each workload, changing only the values (for example, allowed binaries). The following is an example of two policies that share the same logic but have different values: | ||
|
|
||
| ```yaml | ||
| apiVersion: cilium.io/v1alpha1 | ||
| kind: TracingPolicy | ||
| metadata: | ||
| name: "policy-1" | ||
| spec: | ||
| podSelector: | ||
| matchLabels: | ||
| app: "my-deployment-1" | ||
| kprobes: | ||
| - call: "security_bprm_creds_for_exec" | ||
| syscall: false | ||
| args: | ||
| - index: 0 | ||
| type: "linux_binprm" | ||
| selectors: | ||
| - matchArgs: | ||
| - index: 0 | ||
| operator: "NotEqual" | ||
| values: | ||
| - "/usr/bin/sleep" | ||
| - "/usr/bin/cat" | ||
| - "/usr/bin/my-server-1" | ||
| matchActions: | ||
| - action: Override | ||
| argError: -1 | ||
| options: | ||
| - name: disable-kprobe-multi | ||
| value: "1" | ||
| --- | ||
| apiVersion: cilium.io/v1alpha1 | ||
| kind: TracingPolicy | ||
| metadata: | ||
| name: "policy-2" | ||
| spec: | ||
| podSelector: | ||
| matchLabels: | ||
| app: "my-deployment-2" | ||
| kprobes: | ||
| - call: "security_bprm_creds_for_exec" | ||
| syscall: false | ||
| args: | ||
| - index: 0 | ||
| type: "linux_binprm" | ||
| selectors: | ||
| - matchArgs: | ||
| - index: 0 | ||
| operator: "NotEqual" | ||
| values: | ||
| - "/usr/bin/ls" | ||
| - "/usr/bin/my-server-2" | ||
| matchActions: | ||
| - action: Override | ||
| argError: -1 | ||
| options: | ||
| - name: disable-kprobe-multi | ||
| value: "1" | ||
| ``` | ||
|
|
||
| ### Option 1: Template + Binding | ||
|
|
||
| We could introduce two new CRDs: `TracingPolicyTemplate` and `TracingPolicyBinding`. | ||
|
|
||
| #### `TracingPolicyTemplate` | ||
|
|
||
| A `TracingPolicyTemplate` specifies variables which can be populated at runtime, rather than being hardcoded at load time. Selectors within the policy reference these variables by name. | ||
|
|
||
| ```yaml | ||
| apiVersion: cilium.io/v1alpha1 | ||
| kind: TracingPolicyTemplate | ||
| metadata: | ||
| name: "block-process-template" | ||
| spec: | ||
| variables: | ||
| - name: "targetExecPaths" | ||
| type: "linux_binprm" # this could be used for extra validation but it's probably not strictly necessary | ||
| kprobes: | ||
| - call: "security_bprm_creds_for_exec" | ||
| syscall: false | ||
| args: | ||
| - index: 0 | ||
| type: "linux_binprm" | ||
| selectors: | ||
| - matchArgs: | ||
| - index: 0 | ||
| operator: "Equal" | ||
| valuesFromVariable: "targetExecPaths" | ||
| ``` | ||
|
|
||
| #### `TracingPolicyBinding` | ||
|
|
||
| A `TracingPolicyBinding` associates a `TracingPolicyTemplate` with concrete values for its variables and a workload selector. | ||
|
|
||
| ```yaml | ||
| apiVersion: cilium.io/v1alpha1 | ||
| kind: TracingPolicyBinding | ||
| metadata: | ||
| # <template>-<bindingName> | ||
| name: "block-process-template-values-1" | ||
| spec: | ||
| templateRef: | ||
| name: "block-process-template" | ||
| overrideAllowed: false | ||
| overridePriority: 0 | ||
| # Same selectors used today in TracingPolicy with no constraints | ||
| podSelector: | ||
| matchLabels: | ||
| app: "my-app-1" | ||
| containerSelector: | ||
| matchExpressions: | ||
| - key: name | ||
| operator: In | ||
| values: | ||
| - main | ||
| bindings: | ||
| - name: "targetExecPaths" | ||
| values: | ||
| - "/usr/bin/true" | ||
| - "/usr/bin/ls" | ||
| ``` | ||
|
|
||
| if at least one of the binding involved in the conflict has `overrideAllowed` set to `false`, Tetragon should rollback all bindings involved and log an error. In case of `overrideAllowed` set to `false`, `overridePriority` is ignored. | ||
|
|
||
| If `overrideAllowed` is set to `true`, Tetragon should use `overridePriority` to determine which binding takes precedence. | ||
|
|
||
| **Pros**: | ||
|
|
||
| - freedom to use arbitrary selectors for each binding. | ||
| - no requirements for the user to add labels or other identifiers to workloads to ensure exclusivity. | ||
|
|
||
| **Cons**: | ||
|
|
||
| - users need to be aware of potential overlapping bindings for the same template and handle conflicts accordingly. | ||
|
|
||
| ### Option 2: Template + Binding (mutually exclusive selectors) | ||
|
|
||
| Similar to Option 1, but we require that workload selectors in bindings for the same template are mutually exclusive. This guarantees to users that there will be no overlapping bindings for the same template, but at the cost of flexibility. | ||
|
|
||
| The template resource is identical to Option 1, while the binding resource slightly changes: | ||
|
|
||
| ```yaml | ||
| apiVersion: cilium.io/v1alpha1 | ||
| kind: TracingPolicyBinding | ||
| metadata: | ||
| # <template>-<bindingName> | ||
| name: "block-process-template-values-1" | ||
| spec: | ||
| templateRef: | ||
| name: "block-process-template" | ||
| bindings: | ||
| - name: "targetExecPaths" | ||
| values: | ||
| - "/usr/bin/true" | ||
| - "/usr/bin/ls" | ||
| ``` | ||
|
|
||
| No selectors are specified in the binding. Instead, users must ensure that pods are labeled in the correct way in order to match the intended binding. To match the above binding, users must label their pods with `tetragon-block-process-template=values-1` (`tetragon-<template>=<bindingName>`). | ||
|
|
||
| **Pros**: | ||
|
|
||
| - mutually exclusive selectors by design, no risk of accidental overlapping bindings. | ||
|
|
||
| **Cons**: | ||
|
|
||
| - require users to label workloads appropriately to ensure exclusivity. | ||
| - no override allowed, this could be a limitation for some use cases, in which users may want to override a global binding with a more specific one. | ||
| - doesn't allow to specify different values for different containers within the same pod. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on reading the first proposal draft and considering the discussion we already had at cilium/tetragon#4191, there are two different sub-problems.
My suggestion would be to treat them as separate problems when discussing solutions.
One way to think about Tetragon policies that, I think, is useful is as
Workloads are defined with selectors such as
podSelector,matchBinary,matchNamespace,matchCredentials.Operations are defined with things like
kprobes.call, objects are defined with things likematchArgs/matchBinaries, while actions with things likePost,NoPost,Override,NotifyEnforcer.Scaling the BPF implementation (P1)
The current Tetragon implementation creates (and attaches) one BPF program per TracingPolicy resource.
This has limitations. For one, a limited number of programs can be attached to every hook. Moreover, there is a lot of redundant computation (e.g., if we have
100policies, we will check100times if the cgroup of the workload is part of the corresponding policy; this can be optimized so we control the path so we don't have to check every time), as well as duplicated information in case the policies are similar (e.g., only the values inmatchAgs/matchBinariesdiffer)Potential solutions:
The UI/UX of writing policies for different workloads is cumbersome (P2)
The challenge here, I think, is one where policies might have common sections of
Workloads,operations, orobjects, and we want to avoid having the user re-write them.Potential solutions:
PodSelectorsnot only at the top of the policy, but also underselectors)Approach
Given all the above, I would see this CFP as a way to specify solutions to the above problems (as well as how they interact), their tradeoffs, and how we (i.e., the Tetragon OSS project) can move forward to address them. An alternative approach would be to focus on the specific solution that you propose in cilium/tetragon#4191. Either approach is fine by me, but we should define (and agree on) the goals/scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can split this into 2 different sub-problems. More in detail, I would address P1 first since the solution to this problem could give us precious hints on how to design the UX.
I would pick this approach
I would not focus on the proposed solution cilium/tetragon#4191 but would seek to explore alternatives like the "tail-call" approach you suggested, to understand the advantages it can provide.
If you agree, as a next step, I would compare the tail call solution with the proposed one, evaluating their pros and cons, and see if any other ideas come up from the discussion. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Maybe as a first step, we can structure the CFP in a way that reflects the above? That is, separate the two problems and have sections for proposed solutions?
CC: @tpapagian who can provide more details on the tail call solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First I will provide some background on what we have done already (for a different context). It seems similar to that but it also has some differences.
Assuming a policy similar to the following
we wanted to have a large number of such policies, each of them for a separate workload (defined under
podSelector) that all hook at the same function (samekprobes.call) but have different selectors. As you already mentioned, the problem is that this will result in multiple programs, each of them attached on that function, and each program has its own maps etc. The main issue was that if we have 100 such policies, we always need to run all 100 programs and this will affect the performance of the users workload.The approach that we took was to have a mapping from
cgroup_idstopolicy_ids. This was introduced in cilium/tetragon#3122 (map namedpolicy_filter_cgroup_maps). We also need a map frompolicy_idtoprogram(typeBPF_MAP_TYPE_PROG_ARRAY) that has all the mapping frompolicy_idto the actual eBPF programs.Now we need a single eBPF program that is attached to the hook that we care about. This will get the policy ids (from
policy_filter_cgroup_maps) and callbpf_tail_callhelper to the first one. When the first program is done, this will again callbpf_tail_callto the second, etc.When a new policy is loaded, we just need to load a new program with its maps, that corresponds to the selectors for this tracing policy, and update the
BPF_MAP_TYPE_PROG_ARRAYto add a new mapping. But we don't need to attach that to any hooks.On the one hand, this makes scalability (in terms of CPU overheads) with the number of tracing policies to be constant (only the programs that needs to run will be called). We also managed to do that without the need to change the UI/UX. It was also relatively easy to implement. On the other hand, this will not use less maps (i.e. less memory) compared to what we have now.
I tried to keep it at high-level to make that easier to follow. It you feel that anything is not clear, just let me know to provide more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry for deleting the previous messages because it wasn't thought through well)
@tpapagian thanks for your reply and I think it's a great idea. It's also great to see that Tetragon already has part of the implementation.
I have a question about the bpf_tail_call. If I understand it correctly, Linux has a maximum 33 bpf_tail_call depth limit from MAX_TAIL_CALL_CNT macro.
I think Tetragon has used many
bpf_tail_call. For example,TAIL_CALL_ARGSandTAIL_CALL_FILTERwould be called many times depending on the policy. Do you think it would fit the 33 maximum if we have multiplepolicy_ids?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tpapagian, for the detailed explanation.
Today, in the code, I see that the
policy_filter_cgroup_mapsmap is instantiated once in the policy filter state, so it is shared by all eBPF programs. Is this map also global in your approach?Let's say we are in a specific hook
security_bprm_creds_for_execand we are thecgroup1. When we look up the policy IDs inpolicy_filter_cgroup_maps, how can we understand which policies are associated with the particular hook we are calling from (security_bprm_creds_for_execin this example)? I imagine that a cgroup could be associated with different policies using different hooks.This sounds really promising! What I really like is that all selectors are supported out of the box without the need to create new maps of maps to handle values from different policies
Do you think that this limitation can be overcome in some way, or do you believe it is not feasible with this approach? Having 2/3 MB for each policy still seems a little bit too much when you have many workloads in the cluster. Putting it another way, do you already have any plans for how to mitigate it, and how we might help?