Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapath: add tail call hooks for custom metrics, bytecounter example #13191

Merged
merged 7 commits into from
Mar 25, 2021

Conversation

qmonnet
Copy link
Member

@qmonnet qmonnet commented Sep 16, 2020

Custom Metrics

Byte counters! Let's count how much data is received and sent for each endpoint. Related PR: #13173

But more than byte counters, this PR introduces hooks in the datapath so that users can tail call into custom programs, and collect all kind of metrics they want - Provided they implement it themselves in eBPF. Here is an overview:

  • First commit introduces the tail call hooks in the datapath (IPv4 only), plus management for the per-endpoint prog_array maps used to reference the custom programs. Those maps have two entries, one for ingress and one for egress.

  • Second commit makes these tail calls optional, and opt-in, through the introduction of a dedicated Cilium option.

  • Third commit adds a sample application: a simple byte counter which can be attached to keep track of the amount of data received or sent for a given endpoint.

  • Fourth commit adds the tail call hooks to the IPv6 datapath, on the same model as for IPv4.

To use it, once Cilium is deployed:

  • Compile the custom program
    make -C bpf/custom
  • Load it
    bpftool prog load bpf/custom/bpf_custom.o /sys/fs/bpf/tc/globals/bytecounter type classifier \
            pinmaps /sys/fs/bpf/tc/globals/bytecounter_maps
  • Reference it from the prog_array map
    EP=<endpoint_id>
    bpftool map update pinned /sys/fs/bpf/tc/globals/cilium_calls_custom_$(printf '%05d' ${EP}) \
            key 0 0 0 0 value pinned /sys/fs/bpf/tc/globals/bytecounter              # or key `1 0 0 0` for egress
  • For now, dump counters' value manually
    bpftool map dump pinned /sys/fs/bpf/tc/globals/bytecounter_maps/bytecount_map

PR Status

Here are some points of interest:

  • I am still trying to finalise the location for the ingress hook. In the current version, the tail call added to tail_ipv4_policy() is not always executed, as handle_policy() (which calls it) is skipped in ipv4_local_delivery() if USE_BPF_PROG_FOR_INGRESS_POLICY is defined i.e. if enable-endpoint-routes is set to true. This is the case for GKE, for example. So my current plan is to add the tail call to the different branches in ipv4_local_delivery(), possibly by replacing returns with goto in that function to write the tail call just once, and also possibly with preprocessor guards to make sure it is only called from bpf_lxc.

  • The custom program needs to return with the same value the original program would have used, to preserve the datapath logics. This means that we must pass the desired value from bpf_lxc to the custom program we hook into. I'm using cb[CB_CT_STATE] for that because it did not seem to be used at that location, but I might be wrong. Feedback on this approach (or an alternatives) would be welcome.

  • I also pass the current direction (ingress or egress) and the source identity to the custom program. Given that direction needs 1 bit, identity needs 24, and existing return values are low and needs less than 7 bits, I make it all fit on the same cb cell.

  • Once we have the counters in an eBPF map, we would need to extract and process it. It could be a Cilium metric, but I'm not sure how to implement custom, user-defined metrics. It could be just from the command line, but similarly this requires to give a way to users to specify what kind of metrics they want. At the moment, the values from the counters can be retrieved from bpftool.

@qmonnet qmonnet requested review from a team September 16, 2020 15:00
@qmonnet qmonnet requested review from a team as code owners September 16, 2020 15:00
@qmonnet qmonnet added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Sep 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 16, 2020
@qmonnet qmonnet marked this pull request as draft September 16, 2020 15:01
@qmonnet qmonnet added release-note/major This PR introduces major new functionality to Cilium. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Sep 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Sep 16, 2020
@qmonnet qmonnet removed the release-note/major This PR introduces major new functionality to Cilium. label Sep 16, 2020
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the comments and explanations! That makes it very easy to review :-)

A few nits below, I haven't commented on the position of hooks within bpf_lxc since we discussed that during #sig-datapath meeting already.

My main remaining concern is around the API we expose. I'm assuming we won't be able to change what we expose to custom program after the fact because it may break custom programs in the wild. I've added a couple comments below in that regard.

bpf/lib/maps.h Outdated Show resolved Hide resolved
pkg/datapath/linux/config/config.go Outdated Show resolved Hide resolved
bpf/custom/bpf_custom.c Outdated Show resolved Hide resolved
bpf/custom/bytecount.h Outdated Show resolved Hide resolved
bpf/custom/bpf_custom.c Show resolved Hide resolved
@qmonnet qmonnet marked this pull request as ready for review September 28, 2020 17:29
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the agent side changes and they seem pretty small to me

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty clean to me, nice work.

The main questions that come to mind reviewing this PR are:

  • Custom metrics on packets that are subject to L7 forwarding seems a little inconsistent
  • Seems like there are two approaches for handling ingress/egress hinted in the PR, either having two programs (one ingress / one egress) or passing the direction bit in the CB. Is this just left over from a previous implementation or did you intend to support more than one custom program?
  • Some helpers could be called by a custom program and influence the handling of the packet, such as bpf_redirect(). What kind of guard rails do we want for cases like this? (One simple answer can be just leave it up to the implementer not to mess up; or if we're more concerned we could put compile-time warnings in place to discourage such behaviour)
  • Is IPv6 support intended? How will we handle this?

I've added comments on each of these topics below so we can have threaded discussions.

Beyond that, there's just a scattering of minor nits, none of which are overly consequential.

pkg/datapath/linux/config/config.go Show resolved Hide resolved
bpf/lib/maps.h Outdated Show resolved Hide resolved
bpf/custom/bpf_custom.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/custom/bpf_custom.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Show resolved Hide resolved
@qmonnet qmonnet added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Oct 12, 2020
@aanm aanm added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Oct 13, 2020
@aanm aanm removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Nov 10, 2020
@stale

This comment has been minimized.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Dec 11, 2020
@pchaigno pchaigno marked this pull request as draft December 11, 2020 09:04
@stale stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Dec 11, 2020
@stale

This comment has been minimized.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 10, 2021
@qmonnet qmonnet force-pushed the pr/qmonnet/custom_metrics branch 2 times, most recently from 83916fd to 4e124dd Compare March 24, 2021 21:13
Add a new CI test to check that custom programs loaded and run via the
datapath tail call hooks work as expected. In particular, we use the
byte counter example provided in a previous commit.

The test is the following: We deploy Cilium and some pods, we load the
byte counter program and attach it to a hook (ingress or egress) for a
given endpoint, we send some traffic (ping requests and replies), and
then check the values in the eBPF map where the count is stored. Try
again with per-endpoint routes, to check the remaining ingress hook.

A new manifest is added, to make sure that we have two simple
application endpoints, located on different nodes. This is to check that
the egress hook is working (there is no egress hook for socket-based
load-balancing yet).

This manifest also pulls an image which is used to compile the byte
counter program before we can load it. This image should be updated in
the future to embed the sample byte counter and all related headers, so
we wouldn't have to mount a volume against the source repository on the
node. This would permit to remove the byte counter from Cilium's
repository, and to keep it in that dedicated image instead.

Skip the test on 4.9 kernels because bpftool map updates do not work on
such kernels. Skip on GKE because we do not have Cilium's source to
compile the custom program from. Both issues should be addressed in the
future, by moving the program sources to a dedicated image which should
also embed its own loader. But actually, just run on net-next, since the
coverage for other kernel versions would be the same.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@qmonnet
Copy link
Member Author

qmonnet commented Mar 24, 2021

Rebased, following merge conflicts
test-me-please

@qmonnet
Copy link
Member Author

qmonnet commented Mar 25, 2021

test-gke

@pchaigno
Copy link
Member

pchaigno commented Mar 25, 2021

All required CI jobs are green. A review from @cilium/kubernetes is missing but I think we can skip it as it was only requested for install/kubernetes/ and several person familiar with the agent and Helm reviewed. Marking as ready to merge.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 25, 2021
@joestringer joestringer merged commit 37f6192 into master Mar 25, 2021
1.10.0 automation moved this from Must have to Done Mar 25, 2021
@joestringer joestringer deleted the pr/qmonnet/custom_metrics branch March 25, 2021 16:42
@joestringer joestringer moved this from Done to Merged Features in 1.10.0 Mar 29, 2021
qmonnet added a commit to qmonnet/cilium that referenced this pull request Jun 28, 2021
The AfterAll() and AfterEach() blocks in the test file for custom calls
run everytime, even if the Context block for the actual tests is
skipped. In that case, running the final blocks results in an attempt to
remove deployments that have never been set up in the first place. This
may lead to the blocks failing when the tests were in fact skipped, and
may produce test artifacts even though Jenkins does not considered the
test failed.

Let's reorganise those blocks, to make sure they are called only when
necessary. Note that we do need to keep both DeleteCilium() and
DeleteAll(), even if they are now in the same block, as calling only
DeleteAll() would not remove the Cilium ConfigMap.

Fixes: 37f6192 ("test: add CI test for tail calls hooks for custom programs")
Fixes: cilium#13191
Fixes: cilium#16633

Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
aanm pushed a commit that referenced this pull request Jun 28, 2021
The AfterAll() and AfterEach() blocks in the test file for custom calls
run everytime, even if the Context block for the actual tests is
skipped. In that case, running the final blocks results in an attempt to
remove deployments that have never been set up in the first place. This
may lead to the blocks failing when the tests were in fact skipped, and
may produce test artifacts even though Jenkins does not considered the
test failed.

Let's reorganise those blocks, to make sure they are called only when
necessary. Note that we do need to keep both DeleteCilium() and
DeleteAll(), even if they are now in the same block, as calling only
DeleteAll() would not remove the Cilium ConfigMap.

Fixes: 37f6192 ("test: add CI test for tail calls hooks for custom programs")
Fixes: #13191
Fixes: #16633

Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
tklauser pushed a commit to tklauser/cilium that referenced this pull request Jul 1, 2021
[ upstream commit 9d4e99d ]

The AfterAll() and AfterEach() blocks in the test file for custom calls
run everytime, even if the Context block for the actual tests is
skipped. In that case, running the final blocks results in an attempt to
remove deployments that have never been set up in the first place. This
may lead to the blocks failing when the tests were in fact skipped, and
may produce test artifacts even though Jenkins does not considered the
test failed.

Let's reorganise those blocks, to make sure they are called only when
necessary. Note that we do need to keep both DeleteCilium() and
DeleteAll(), even if they are now in the same block, as calling only
DeleteAll() would not remove the Cilium ConfigMap.

Fixes: 37f6192 ("test: add CI test for tail calls hooks for custom programs")
Fixes: cilium#13191
Fixes: cilium#16633

Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
aanm pushed a commit that referenced this pull request Jul 2, 2021
[ upstream commit 9d4e99d ]

The AfterAll() and AfterEach() blocks in the test file for custom calls
run everytime, even if the Context block for the actual tests is
skipped. In that case, running the final blocks results in an attempt to
remove deployments that have never been set up in the first place. This
may lead to the blocks failing when the tests were in fact skipped, and
may produce test artifacts even though Jenkins does not considered the
test failed.

Let's reorganise those blocks, to make sure they are called only when
necessary. Note that we do need to keep both DeleteCilium() and
DeleteAll(), even if they are now in the same block, as calling only
DeleteAll() would not remove the Cilium ConfigMap.

Fixes: 37f6192 ("test: add CI test for tail calls hooks for custom programs")
Fixes: #13191
Fixes: #16633

Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@Thearas
Copy link
Contributor

Thearas commented Jun 10, 2022

Hi @qmonnet, great job! I'm just looking for a traffic byte-counter, and I have a few question:

  1. The byte-counter is implemented using BPF_MAP_TYPE_HASH, would it be more efficient to implement it using BPF_MAP_TYPE_PERCPU_HASH? Just like cilium_metrics.
  2. How to determine if the __ctx_buff is on ingress path or on egress path?

Looking forward to your reply, much appreciated!

@qmonnet
Copy link
Member Author

qmonnet commented Jun 13, 2022

Hi, thanks for using the feature!

  • The byte-counter is implemented using BPF_MAP_TYPE_HASH, would it be more efficient to implement it using BPF_MAP_TYPE_PERCPU_HASH? Just like cilium_metrics.

I'm not sure what gain you expect here. The updates on the counter value are atomic operations, so it shouldn't be an issue if multiple CPUs work with the same map. This being said, the point of these hooks is to have custom programs, so of course you're free to experiment :)

  • How to determine if the __ctx_buff is on ingress path or on egress path?

Please have a look at the related README.rst, and in particular at step 3 in the instructions for the byte-counter program. There are several hooks where you can attach the program, some on ingress, some on egress. The current program does not make the distinction between ingress and egress. If you want it, you can edit the program to e.g. use another map for egress (distinct from the one for ingress), and attach this new program to the hooks on egress.

@Thearas
Copy link
Contributor

Thearas commented Jun 14, 2022

Thanks!
One more question, how to get byte-counter program to be attached automatically when the cilium_calls_custom_* map is created?

@qmonnet
Copy link
Member Author

qmonnet commented Jun 21, 2022

There's nothing of the sort implemented in the agent currently, so I suppose you'd have to create some watcher to check for the presence of the maps and attach the programs when necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.10.0
Merged Features
Development

Successfully merging this pull request may close these issues.

None yet