hubble: Add recorder API #15680

gandro · 2021-04-13T20:41:52Z

This commit adds a new API and implementation for what we call
the Hubble Recorder API. It is intended to be used for low-level
packet capture on the XDP datapath parts when Cilium is running in
LB-mode. Therefore, it only supports 5-tuple filters instead of the more
expressive Hubble flow metadata queries of the Hubble Observer API.

To start a recording, the client has to send a StartRecording message
via the Record method. To stop it, a StopRecording message must be
sent. This means that the recording itself is bound to a client context
and therefore allows the server to stop a recording if the client has
disconnected. The stop message is explicit such that the client can wait
for the final status report.

This API has been designed to be possibly extended in the future to
support other kinds of sinks.

See commit messages for details.

ti-mo

Nice! I have some concerns around the ktime_get_ns offset calculation logic, feel free to ping offline if you need some more input.

pkg/hubble/recorder/sink/dispatch.go

ti-mo · 2021-04-15T10:49:59Z

pkg/hubble/recorder/sink/dispatch.go

+	return d.startWallTime.Add(elapsedSinceStart)
+}
+
+func getTimeNow() (bootTime int64, wallTime time.Time, err error) {


I've implemented something similar here: https://github.com/ti-mo/conntracct/blob/master/pkg/boottime/boottime.go. I opted to expose runtime.nanotime() in my package so I wouldn't have to make a slow syscall, and vDSO is used as a source for both timestamps. Feel free to take a look, it's well-documented.

A few traps I've fallen into while implementing this:

There can be an arbitrary amount of time between the execution of time.Now() and unix.ClockGettime(), especially under scheduler pressure during program startup. This can be slightly improved by running these 2 calls under runtime.LockOSThread(), which at least takes the Go scheduler out of the equation. Ideally, these results need to be sampled and averaged to eliminate (OS) scheduler jitter. (and it still won't be accurate 😅)

Hardware timer pauses (during machine/laptop suspend, VM migrations or other virtualization-related artifacts) invalidate this offset. CLOCK_BOOTTIME does not advance during a hardware pause, so it's possible for 2 events that occurred right before and right after a (let's say, 1hr) pause window to have timestamps that are 1ns apart. This happens much more often than you'd expect. (the cloud is someone else's computer after all) This will result in events seemingly occurring earlier than they actually have, causing events to fall outside of a pcap window. (being a few seconds off can already be an issue if a user runs e.g. a curl right after starting a pcap) The solution I came up with is refreshing the offset every 5/10 seconds or so, since userspace has no way of knowing when these pauses occur.

Hope this helps!

Thanks a ton for the in-depth reply here. I haven't considered the scheduler issue, I might add a LockOSThread and do a few repeats to reduce the error, but yeah, anything we do here will always be best-effort I'm afraid.

So the reason I picked CLOCK_BOOTTIME, is because this is what the datapath is using, see:

cilium/bpf/lib/pcap.h

Lines 78 to 83 in 2ff996c

/* For later pcap file generation, we export boot time to the RB

* such that user space can later reconstruct a real time of day

* timestamp in-place.

*/

cilium_capture(ctx, CAPTURE_INGRESS, rule_id,

bpf_ktime_cache_set(boot_ns), cap_len);

bpf_ktime_get_boot_ns is CLOCK_BOOTTIME, see https://lkml.org/lkml/2020/4/20/1443

If CLOCK_BOOTTIME is not what we want, then we need to fix that in the datapath first. But, I actually believe we do want CLOCK_BOOTTIME. My reading is that CLOCK_BOOTTIME is not affected by your concerns regarding suspension (while runtime.nanotime(), which is CLOCK_MONOTONIC, is affected)

CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2).
https://linux.die.net/man/2/clock_gettime

Unless I'm misinterpreting something here I'll need to stick with CLOCK_BOOTTIME which runtime unfortunately does not expose (see also golang/go#24595). But I like your linker trick to access it anyway :D

bpf_ktime_get_boot_ns is CLOCK_BOOTTIME, see https://lkml.org/lkml/2020/4/20/1443

Oh, wasn't aware ktime_get_boot_ns made it in, my implementation predates that helper, and so relies on monotonic being available only. 100% correct on the _MONOTONIC vs. _BOOTTIME, didn't catch that.

For posterity, the kernel commit that introduced it (71d19214776e) only landed in Linux 5.8, so this lb-only datapath code is not backwards-compatible. bpf_ktime_cache_set(boot_ns) does indeed call ktime_get_boot_ns(), so the 5.8 (or backport kernel) req should probably be documented in #15633.

@borkmann for future (and backwards) compat, userspace is going to need to know what the BPF event's clock source is (mono/boottime/timeofday, mentioning all three since we might want to target pre-5.8 at some point). Maybe we should already extend struct pcap_timeoff and make sure userspace can make that distinction? (thinking about upgrade safety)

Right, the 02e55a7 will check for availability and bail out if --enable-recorder=true is used and kernel doesn't have the boot time source. Potentially if users don't care too much about the specific clock, we could also use monotonic and don't translate anything from hubble side (apart from timeval conversion). I would probably wait till such need comes up. After my blocker PRs are done, I'm planning to document all stand-alone LB features as a getting started guide for Cilium in detail, so I'll definitely include such kernel requirements there (right now only the agent will tell it in its error, but a GSG of similar scope and depth as https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/ would be super useful for the LB-only mode).

I have updated the mechanism to fetch the diff multiple times (to be more robust) and take the minimum. I also simplified the clock offset calculation a bit.

pkg/hubble/recorder/sink/dispatch.go

This commit is intended to make Hubble more usable when Cilium is running in LB-only mode. Identity lookup in the Hubble parser might fail for various reasons, for example, when running in the LB-datapath mode. Since the user cannot do anything about this (and the absent data is still detectable via Hubble API), stop emitting a warning in the logs and use a debug statement instead. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit adds the protobuf definition for the new Hubble Recorder API. It is intended to be used for low-level packet capture on the XDP datapath parts when Cilium is running in LB-mode. Therefore, it only supports 5-tuple filters instead of the more expressive Hubble flow metadata queries of the Hubble Observer API. To start a recording, the client has to send a `StartRecording` message via the `Record` method. To stop it, a `StopRecording` message must be sent. This means that the recording itself is bound to a client context and therefore allows the server to stop a recording if the client has disconnected. The stop message is explicit such that the client can wait for the final status report. This API has been designed to be possibly extended in the future to support other kinds of sinks. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

Also remove the unneccesary invocation of `apk add`. The builder image already contains all necessary tools. The image version will be updated in a subsequent PR, as currently the infrastructure to push new versions of this image is not present in our GitHub Actions setup. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2021-04-15T13:48:50Z

test-me-please

Edit: Fixed linter

gandro · 2021-04-15T14:47:03Z

test-me-please

ti-mo · 2021-04-15T15:32:11Z

pkg/hubble/recorder/sink/dispatch.go

+			bootTimeOffset = offset
+		}
+	}
+	runtime.UnlockOSThread()


Nit: this can be removed.

Good catch, thanks! Originally I had more code before the return statement so I wanted to do an early unlock, but that's not needed anymore in the current version.

Will remove once CI is green (to avoid re-triggering it)

jrfastab · 2021-04-15T18:31:49Z

api/v1/recorder/recorder.proto

+message Filter {
+    // source_cidr. Must not be empty.
+    // Set to 0.0.0.0/0 to match any IPv4 source address (::/0 for IPv6).
+    string source_cidr = 1;


I'm asking because this looks like it becomes API. How would we extend this in the future for ARP or non-{TCP/UDP} traffic. I assume zero these fields and set the protocol? Why not default them to zero if not set?

So the current API is intentionally very restrictive in that it requires the user to explicitly pick between IPv4 or IPv6. This is why I don't default to zero, because it would not be clear if the user means IPv4 or IPv6 "zero". If the user wants any IPv4 and any IPv6 traffic captured, they need to submit two filters to make that explicit. We can always relax that in the future and make the API more ergonomic, but I'd prefer for the initial version to be a 1:1 mapping to the datapath filters to make it more transparent how expensive certain filters are.

You do bring up a good point in that I have not considered how this API would look like for non-IP traffic, because it does mean that an empty value here could either mean "match any ip traffic" or "match only non-ip traffic". Maybe we should wrap the L3 and L4 fields separately, such that a filter for a specific layer can explicitly be absent with well-defined semantics. I'll play around to see how complicated that would be.

@gandro @jrfastab Hmm, right now at the point where we add the cilium_capture_{in,out}() we will never see non-ip traffic. But I guess there could be a l3 filter of some sort, so maybe the api could be extended with an proto_l3 field, not sure yet. Thoughts?

Yes, thinking about it some more, I think the current API should be future proof enough. We can relax it similarly to how the normal Hubble FlowFilter works, i.e. where a absent field just means "don't care about this, check if any other field is applicable". This should give us enough flexibility to support adding new fields to e.g. filter ARP traffic.

Agree, sounds reasonable to me.

pkg/hubble/recorder/pcap/pcap.go

jrfastab

couple small nits but overall LGTM

This commit adds the API implementation and required plumbing for the Hubble Recorder API. It contains of three main components: - `pkg/hubble/recorder/pcap` is a minimalistic library to write pcap files into a `io.Writer`. - `pkg/hubble/recorder/sink` contains the recorder sinks. Whenever a new datapath recorder has been set up, the corresponding captures are pushed into the monitor perf event ring buffer. The `sink.Dispatch` type attaches to the monitor to receive and decode these kind of events and dispatches incoming packets to registered pcap writers. - `pkg/hubble/recorder` contains the API implementation. It is responsible to start and stop recordings on behalf of the client. It is responsible to allocate `ruleIDs` for the datapath filters. When a new recording is started, it calls into `pkg/recorder` to install the filters and into `pkg/hubble/recorder/sink` to set up the corresponding file sinks. Its options are defined in a separate `recorderoption` package. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit adds the required command-line arguments and setup code in the Cilium Agent to serve the Hubble Recorder API. It is only served on the Hubble Unix domain socket and needs to be explicitly enabled. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2021-04-16T07:44:02Z

Addressed the nit by Timo: https://github.com/cilium/cilium/compare/512492f0c71d6ae64c892022e16a2fb239131eee..0989fee5bc8d168b33877552d1045a93c5095fd3

CI was all green, except https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/208/ (NodePort request failed with Exitcode: 42) which looks very much like an unrelated fluke. f0d79f5d_K8sServicesTest_Checks_service_across_nodes_with_L7_policy_Tests_NodePort_with_L7_Policy.zip

I'm marking this ready to merge.

borkmann

One small nit, but can be resolved after merge. Great work @gandro !

borkmann · 2021-04-16T08:21:08Z

Documentation/cmdref/cilium-agent.md

@@ -85,6 +85,7 @@ cilium-agent [flags]
      --enable-host-port                                     Enable k8s hostPort mapping feature (requires enabling enable-node-port) (default true)
      --enable-host-reachable-services                       Enable reachability of services for host applications
      --enable-hubble                                        Enable hubble server
+      --enable-hubble-recorder-api                           Enable the Hubble recorder API


One small nit, but can be done as follow-up. I would love to avoid adding yet another flag here. Given we already need --enable-recorder=true and if users also have --enable-hubble=true, then this could also automatically imply that the hubble recorder api will be enabled. So I would just remove this additional flag, given it's more hassle for users to configure, wdyt?

I think we can default the flag to true if the recorder is enabled, but I'd like to have a flag to turn off the API, since it's quite privileged.

@gandro Ok, that works as well. In that sense the use case for --enable-recorder=true and --enable-hubble-recorder-api=false would be if someone would hook up their own agent to configure recorder objects for the datapath (e.g. using same API as the CLI) and then process the traffic from the perf RB instead of letting the agent do it. I presume the agent doesn't hook up anything to consume the perf RB w/o active users, right?

Correct. Maybe to extend on that a bit:
I mainly added the enable-hubble-recorder-api because we might also want to add a enable-hubble-observer-api=false flag to disable the "main" Hubble API. Right now, users of the Recorder API also have to enable normal Hubble which potentially has some non-negligible overhead to the agent's memory and cpu consumption.

So just from the Hubble side alone it seems reasonable to me to maybe be able to turn off specific parts of Hubble if a user really cares about performance.

gandro added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay feature/lb-only Impacts cilium running in lb-only datapath mode labels Apr 13, 2021

gandro requested a review from a team April 13, 2021 20:41

gandro requested review from a team as code owners April 13, 2021 20:41

gandro requested review from a team, jibi and rolinh and removed request for a team April 13, 2021 20:41

maintainer-s-little-helper bot assigned jibi and rolinh Apr 13, 2021

maintainer-s-little-helper bot added this to In progress in 1.10.0 Apr 13, 2021

gandro marked this pull request as draft April 13, 2021 20:42

gandro unassigned jibi and rolinh Apr 13, 2021

This was referenced Apr 13, 2021

hubble: Add Recorder API #15385

Closed

cmd: Add record subcommand cilium/hubble#530

Merged

gandro removed request for jibi and rolinh April 14, 2021 08:06

borkmann force-pushed the pr/recorder-mask branch 3 times, most recently from 017f2bb to 773d846 Compare April 14, 2021 08:54

borkmann mentioned this pull request Apr 14, 2021

cilium: pcap recorder agent management #15633

Merged

borkmann force-pushed the pr/recorder-mask branch 2 times, most recently from 0cd0232 to ceb5b86 Compare April 14, 2021 13:25

ti-mo reviewed Apr 15, 2021

View reviewed changes

gandro added 3 commits April 15, 2021 15:40

gandro force-pushed the pr/gandro/hubble-recorder-api branch from 4ded4cd to 39455da Compare April 15, 2021 13:41

gandro removed the dont-merge/blocked Another PR must be merged before this one. label Apr 15, 2021

gandro requested review from tklauser and removed request for a team and jrfastab April 15, 2021 13:46

maintainer-s-little-helper bot assigned tklauser Apr 15, 2021

gandro force-pushed the pr/gandro/hubble-recorder-api branch from 39455da to 512492f Compare April 15, 2021 14:46

ti-mo approved these changes Apr 15, 2021

View reviewed changes

jrfastab reviewed Apr 15, 2021

View reviewed changes

pkg/hubble/recorder/pcap/pcap.go Show resolved Hide resolved

jrfastab approved these changes Apr 15, 2021

View reviewed changes

gandro added 2 commits April 16, 2021 09:41

gandro force-pushed the pr/gandro/hubble-recorder-api branch from 512492f to 0989fee Compare April 16, 2021 07:41

gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 16, 2021

borkmann approved these changes Apr 16, 2021

View reviewed changes

borkmann merged commit a3ad0a1 into cilium:master Apr 16, 2021

1.10.0 automation moved this from In progress to Done Apr 16, 2021

borkmann mentioned this pull request Apr 16, 2021

pcap recorder / lb follow-ups #15712

Open

12 tasks

This was referenced Apr 19, 2021

Revert "cilium, recorder: rebuild upon wildcard mask change" #15766

Closed

daemon: Make Hubble Recorder API opt-out #15781

Merged

This was referenced Apr 28, 2021

Prepare for release v1.10.0-rc1 #15896

Closed

Prepare for release v1.10.0-rc1 #15897

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble: Add recorder API #15680

hubble: Add recorder API #15680

gandro commented Apr 13, 2021 •

edited

ti-mo left a comment

ti-mo Apr 15, 2021

gandro Apr 15, 2021 •

edited

ti-mo Apr 15, 2021

borkmann Apr 15, 2021

gandro Apr 15, 2021 •

edited

gandro commented Apr 15, 2021 •

edited

gandro commented Apr 15, 2021

ti-mo Apr 15, 2021

gandro Apr 15, 2021

jrfastab Apr 15, 2021 •

edited

gandro Apr 15, 2021 •

edited

borkmann Apr 15, 2021 •

edited

gandro Apr 16, 2021

borkmann Apr 16, 2021

jrfastab left a comment

gandro commented Apr 16, 2021

borkmann left a comment

borkmann Apr 16, 2021

gandro Apr 16, 2021

borkmann Apr 16, 2021 •

edited

gandro Apr 16, 2021

	/* For later pcap file generation, we export boot time to the RB
	* such that user space can later reconstruct a real time of day
	* timestamp in-place.
	*/
	cilium_capture(ctx, CAPTURE_INGRESS, rule_id,
	bpf_ktime_cache_set(boot_ns), cap_len);

hubble: Add recorder API #15680

hubble: Add recorder API #15680

Conversation

gandro commented Apr 13, 2021 • edited

ti-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro Apr 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro Apr 15, 2021 • edited

Choose a reason for hiding this comment

gandro commented Apr 15, 2021 • edited

gandro commented Apr 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrfastab Apr 15, 2021 • edited

Choose a reason for hiding this comment

gandro Apr 15, 2021 • edited

Choose a reason for hiding this comment

borkmann Apr 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrfastab left a comment

Choose a reason for hiding this comment

gandro commented Apr 16, 2021

borkmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borkmann Apr 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro commented Apr 13, 2021 •

edited

gandro Apr 15, 2021 •

edited

gandro Apr 15, 2021 •

edited

gandro commented Apr 15, 2021 •

edited

jrfastab Apr 15, 2021 •

edited

gandro Apr 15, 2021 •

edited

borkmann Apr 15, 2021 •

edited

borkmann Apr 16, 2021 •

edited