Skip to content

Conversation

@Andreagit97
Copy link

Summary

Today, with Tetragon, it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workloa,d users need to create a separate TracingPolicy with the same enforcement but with different values. This design proposes a new model to decouple the enforcement logic from per‑workload values, enabling significant reductions in eBPF programs and map memory usage.

...

[continued in CFP]

Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com>
Copy link

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -0,0 +1,143 @@
# CFP-4191: Tetragon per-workload policies
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on reading the first proposal draft and considering the discussion we already had at cilium/tetragon#4191, there are two different sub-problems.

  • P1: The BPF implementation to support multiple policies for different workloads scales poorly (e.g., program scaling limits and map memory scaling)
  • P2: The UI/UX of writing policies for different workloads is cumbersome (operational friction)

My suggestion would be to treat them as separate problems when discussing solutions.

One way to think about Tetragon policies that, I think, is useful is as

(workload, operation, object) -> action

Workloads are defined with selectors such as podSelector, matchBinary, matchNamespace, matchCredentials.
Operations are defined with things like kprobes.call, objects are defined with things like matchArgs/matchBinaries, while actions with things like Post, NoPost, Override, NotifyEnforcer.

Scaling the BPF implementation (P1)

The current Tetragon implementation creates (and attaches) one BPF program per TracingPolicy resource.
This has limitations. For one, a limited number of programs can be attached to every hook. Moreover, there is a lot of redundant computation (e.g., if we have 100 policies, we will check 100 times if the cgroup of the workload is part of the corresponding policy; this can be optimized so we control the path so we don't have to check every time), as well as duplicated information in case the policies are similar (e.g., only the values in matchAgs/matchBinaries differ)

Potential solutions:

  • Avoid redundant computations and reduce memory footprint by sharing maps across different TracingPolicy resources
  • tail-call only into the programs of the policies matching a given workload
  • ???

The UI/UX of writing policies for different workloads is cumbersome (P2)

The challenge here, I think, is one where policies might have common sections of Workloads, operations, or objects, and we want to avoid having the user re-write them.

Potential solutions:

  • Split what is now a single TracingPolicy into a parent/template and multple children policies. The parent/template contains the common parts, while the children parts contain the different parts (typically per workload).
  • Extend the current selectors (e.g., allow for PodSelectors not only at the top of the policy, but also under selectors)
  • ???

Approach

Given all the above, I would see this CFP as a way to specify solutions to the above problems (as well as how they interact), their tradeoffs, and how we (i.e., the Tetragon OSS project) can move forward to address them. An alternative approach would be to focus on the specific solution that you propose in cilium/tetragon#4191. Either approach is fine by me, but we should define (and agree on) the goals/scope.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can split this into 2 different sub-problems. More in detail, I would address P1 first since the solution to this problem could give us precious hints on how to design the UX.

Given all the above, I would see this CFP as a way to specify solutions to the above problems (as well as how they interact), their tradeoffs, and how we (i.e., the Tetragon OSS project) can move forward to address them.

I would pick this approach

An alternative approach would be to focus on the specific solution that you propose in cilium/tetragon#4191. Either approach is fine by me, but we should define (and agree on) the goals/scope.

I would not focus on the proposed solution cilium/tetragon#4191 but would seek to explore alternatives like the "tail-call" approach you suggested, to understand the advantages it can provide.

If you agree, as a next step, I would compare the tail call solution with the proposed one, evaluating their pros and cons, and see if any other ideas come up from the discussion. WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Maybe as a first step, we can structure the CFP in a way that reflects the above? That is, separate the two problems and have sections for proposed solutions?

CC: @tpapagian who can provide more details on the tail call solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I will provide some background on what we have done already (for a different context). It seems similar to that but it also has some differences.

Assuming a policy similar to the following

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "example-policy"
spec:
  podSelector:
    matchLabels:
      app: "workload-a"
  kprobes:
  - call: "some_call"
    syscall: true
    args: [...]
    selectors: [...]

we wanted to have a large number of such policies, each of them for a separate workload (defined under podSelector) that all hook at the same function (same kprobes.call) but have different selectors. As you already mentioned, the problem is that this will result in multiple programs, each of them attached on that function, and each program has its own maps etc. The main issue was that if we have 100 such policies, we always need to run all 100 programs and this will affect the performance of the users workload.

The approach that we took was to have a mapping from cgroup_ids to policy_ids. This was introduced in cilium/tetragon#3122 (map named policy_filter_cgroup_maps). We also need a map from policy_id to program (type BPF_MAP_TYPE_PROG_ARRAY) that has all the mapping from policy_id to the actual eBPF programs.

Now we need a single eBPF program that is attached to the hook that we care about. This will get the policy ids (from policy_filter_cgroup_maps) and call bpf_tail_call helper to the first one. When the first program is done, this will again call bpf_tail_call to the second, etc.

When a new policy is loaded, we just need to load a new program with its maps, that corresponds to the selectors for this tracing policy, and update the BPF_MAP_TYPE_PROG_ARRAY to add a new mapping. But we don't need to attach that to any hooks.

On the one hand, this makes scalability (in terms of CPU overheads) with the number of tracing policies to be constant (only the programs that needs to run will be called). We also managed to do that without the need to change the UI/UX. It was also relatively easy to implement. On the other hand, this will not use less maps (i.e. less memory) compared to what we have now.

I tried to keep it at high-level to make that easier to follow. It you feel that anything is not clear, just let me know to provide more details.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for deleting the previous messages because it wasn't thought through well)

@tpapagian thanks for your reply and I think it's a great idea. It's also great to see that Tetragon already has part of the implementation.

I have a question about the bpf_tail_call. If I understand it correctly, Linux has a maximum 33 bpf_tail_call depth limit from MAX_TAIL_CALL_CNT macro.

I think Tetragon has used many bpf_tail_call. For example, TAIL_CALL_ARGS and TAIL_CALL_FILTER would be called many times depending on the policy. Do you think it would fit the 33 maximum if we have multiple policy_ids?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tpapagian, for the detailed explanation.
Today, in the code, I see that the policy_filter_cgroup_maps map is instantiated once in the policy filter state, so it is shared by all eBPF programs. Is this map also global in your approach?

Let's say we are in a specific hook security_bprm_creds_for_exec and we are the cgroup1. When we look up the policy IDs in policy_filter_cgroup_maps, how can we understand which policies are associated with the particular hook we are calling from (security_bprm_creds_for_exec in this example)? I imagine that a cgroup could be associated with different policies using different hooks.

On the one hand, this makes scalability (in terms of CPU overheads) with the number of tracing policies to be constant (only the programs that needs to run will be called). We also managed to do that without the need to change the UI/UX. It was also relatively easy to implement.

This sounds really promising! What I really like is that all selectors are supported out of the box without the need to create new maps of maps to handle values from different policies

On the other hand, this will not use less maps (i.e. less memory) compared to what we have now.

Do you think that this limitation can be overcome in some way, or do you believe it is not feasible with this approach? Having 2/3 MB for each policy still seems a little bit too much when you have many workloads in the cluster. Putting it another way, do you already have any plans for how to mitigate it, and how we might help?

Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com>
Co-authored-by: Kornilios Kourtis <kornilios@isovalent.com>
@Andreagit97
Copy link
Author

Maybe as a first step, we can structure the CFP in a way that reflects the above? That is, separate the two problems and have sections for proposed solutions?

Sorry for the delay. Done in the last commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants