-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble: Add recorder API #15680
hubble: Add recorder API #15680
Conversation
017f2bb
to
773d846
Compare
0cd0232
to
ceb5b86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I have some concerns around the ktime_get_ns
offset calculation logic, feel free to ping offline if you need some more input.
pkg/hubble/recorder/sink/dispatch.go
Outdated
return d.startWallTime.Add(elapsedSinceStart) | ||
} | ||
|
||
func getTimeNow() (bootTime int64, wallTime time.Time, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented something similar here: https://github.com/ti-mo/conntracct/blob/master/pkg/boottime/boottime.go. I opted to expose runtime.nanotime()
in my package so I wouldn't have to make a slow syscall, and vDSO is used as a source for both timestamps. Feel free to take a look, it's well-documented.
A few traps I've fallen into while implementing this:
- There can be an arbitrary amount of time between the execution of
time.Now()
andunix.ClockGettime()
, especially under scheduler pressure during program startup. This can be slightly improved by running these 2 calls underruntime.LockOSThread()
, which at least takes the Go scheduler out of the equation. Ideally, these results need to be sampled and averaged to eliminate (OS) scheduler jitter. (and it still won't be accurate 😅) - Hardware timer pauses (during machine/laptop suspend, VM migrations or other virtualization-related artifacts) invalidate this offset.
CLOCK_BOOTTIME
does not advance during a hardware pause, so it's possible for 2 events that occurred right before and right after a (let's say, 1hr) pause window to have timestamps that are 1ns apart. This happens much more often than you'd expect. (the cloud is someone else's computer after all) This will result in events seemingly occurring earlier than they actually have, causing events to fall outside of a pcap window. (being a few seconds off can already be an issue if a user runs e.g. acurl
right after starting a pcap) The solution I came up with is refreshing the offset every 5/10 seconds or so, since userspace has no way of knowing when these pauses occur.
Hope this helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for the in-depth reply here. I haven't considered the scheduler issue, I might add a LockOSThread
and do a few repeats to reduce the error, but yeah, anything we do here will always be best-effort I'm afraid.
So the reason I picked CLOCK_BOOTTIME
, is because this is what the datapath is using, see:
Lines 78 to 83 in 2ff996c
/* For later pcap file generation, we export boot time to the RB | |
* such that user space can later reconstruct a real time of day | |
* timestamp in-place. | |
*/ | |
cilium_capture(ctx, CAPTURE_INGRESS, rule_id, | |
bpf_ktime_cache_set(boot_ns), cap_len); |
bpf_ktime_get_boot_ns
is CLOCK_BOOTTIME, see https://lkml.org/lkml/2020/4/20/1443
If CLOCK_BOOTTIME
is not what we want, then we need to fix that in the datapath first. But, I actually believe we do want CLOCK_BOOTTIME
. My reading is that CLOCK_BOOTTIME
is not affected by your concerns regarding suspension (while runtime.nanotime()
, which is CLOCK_MONOTONIC
, is affected)
CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2).
https://linux.die.net/man/2/clock_gettime
Unless I'm misinterpreting something here I'll need to stick with CLOCK_BOOTTIME
which runtime
unfortunately does not expose (see also golang/go#24595). But I like your linker trick to access it anyway :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bpf_ktime_get_boot_ns is CLOCK_BOOTTIME, see https://lkml.org/lkml/2020/4/20/1443
Oh, wasn't aware ktime_get_boot_ns
made it in, my implementation predates that helper, and so relies on monotonic being available only. 100% correct on the _MONOTONIC
vs. _BOOTTIME
, didn't catch that.
For posterity, the kernel commit that introduced it (71d19214776e) only landed in Linux 5.8, so this lb-only datapath code is not backwards-compatible. bpf_ktime_cache_set(boot_ns)
does indeed call ktime_get_boot_ns()
, so the 5.8 (or backport kernel) req should probably be documented in #15633.
@borkmann for future (and backwards) compat, userspace is going to need to know what the BPF event's clock source is (mono/boottime/timeofday, mentioning all three since we might want to target pre-5.8 at some point). Maybe we should already extend struct pcap_timeoff
and make sure userspace can make that distinction? (thinking about upgrade safety)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the 02e55a7 will check for availability and bail out if --enable-recorder=true
is used and kernel doesn't have the boot time source. Potentially if users don't care too much about the specific clock, we could also use monotonic and don't translate anything from hubble side (apart from timeval conversion). I would probably wait till such need comes up. After my blocker PRs are done, I'm planning to document all stand-alone LB features as a getting started guide for Cilium in detail, so I'll definitely include such kernel requirements there (right now only the agent will tell it in its error, but a GSG of similar scope and depth as https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/ would be super useful for the LB-only mode).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the mechanism to fetch the diff multiple times (to be more robust) and take the minimum. I also simplified the clock offset calculation a bit.
This commit is intended to make Hubble more usable when Cilium is running in LB-only mode. Identity lookup in the Hubble parser might fail for various reasons, for example, when running in the LB-datapath mode. Since the user cannot do anything about this (and the absent data is still detectable via Hubble API), stop emitting a warning in the logs and use a debug statement instead. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds the protobuf definition for the new Hubble Recorder API. It is intended to be used for low-level packet capture on the XDP datapath parts when Cilium is running in LB-mode. Therefore, it only supports 5-tuple filters instead of the more expressive Hubble flow metadata queries of the Hubble Observer API. To start a recording, the client has to send a `StartRecording` message via the `Record` method. To stop it, a `StopRecording` message must be sent. This means that the recording itself is bound to a client context and therefore allows the server to stop a recording if the client has disconnected. The stop message is explicit such that the client can wait for the final status report. This API has been designed to be possibly extended in the future to support other kinds of sinks. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Also remove the unneccesary invocation of `apk add`. The builder image already contains all necessary tools. The image version will be updated in a subsequent PR, as currently the infrastructure to push new versions of this image is not present in our GitHub Actions setup. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
4ded4cd
to
39455da
Compare
test-me-please Edit: Fixed linter |
39455da
to
512492f
Compare
test-me-please |
pkg/hubble/recorder/sink/dispatch.go
Outdated
bootTimeOffset = offset | ||
} | ||
} | ||
runtime.UnlockOSThread() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks! Originally I had more code before the return statement so I wanted to do an early unlock, but that's not needed anymore in the current version.
Will remove once CI is green (to avoid re-triggering it)
message Filter { | ||
// source_cidr. Must not be empty. | ||
// Set to 0.0.0.0/0 to match any IPv4 source address (::/0 for IPv6). | ||
string source_cidr = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking because this looks like it becomes API. How would we extend this in the future for ARP or non-{TCP/UDP} traffic. I assume zero these fields and set the protocol? Why not default them to zero if not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the current API is intentionally very restrictive in that it requires the user to explicitly pick between IPv4 or IPv6. This is why I don't default to zero, because it would not be clear if the user means IPv4 or IPv6 "zero". If the user wants any IPv4 and any IPv6 traffic captured, they need to submit two filters to make that explicit. We can always relax that in the future and make the API more ergonomic, but I'd prefer for the initial version to be a 1:1 mapping to the datapath filters to make it more transparent how expensive certain filters are.
You do bring up a good point in that I have not considered how this API would look like for non-IP traffic, because it does mean that an empty value here could either mean "match any ip traffic" or "match only non-ip traffic". Maybe we should wrap the L3 and L4 fields separately, such that a filter for a specific layer can explicitly be absent with well-defined semantics. I'll play around to see how complicated that would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thinking about it some more, I think the current API should be future proof enough. We can relax it similarly to how the normal Hubble FlowFilter
works, i.e. where a absent field just means "don't care about this, check if any other field is applicable". This should give us enough flexibility to support adding new fields to e.g. filter ARP traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple small nits but overall LGTM
This commit adds the API implementation and required plumbing for the Hubble Recorder API. It contains of three main components: - `pkg/hubble/recorder/pcap` is a minimalistic library to write pcap files into a `io.Writer`. - `pkg/hubble/recorder/sink` contains the recorder sinks. Whenever a new datapath recorder has been set up, the corresponding captures are pushed into the monitor perf event ring buffer. The `sink.Dispatch` type attaches to the monitor to receive and decode these kind of events and dispatches incoming packets to registered pcap writers. - `pkg/hubble/recorder` contains the API implementation. It is responsible to start and stop recordings on behalf of the client. It is responsible to allocate `ruleIDs` for the datapath filters. When a new recording is started, it calls into `pkg/recorder` to install the filters and into `pkg/hubble/recorder/sink` to set up the corresponding file sinks. Its options are defined in a separate `recorderoption` package. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds the required command-line arguments and setup code in the Cilium Agent to serve the Hubble Recorder API. It is only served on the Hubble Unix domain socket and needs to be explicitly enabled. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
512492f
to
0989fee
Compare
Addressed the nit by Timo: https://github.com/cilium/cilium/compare/512492f0c71d6ae64c892022e16a2fb239131eee..0989fee5bc8d168b33877552d1045a93c5095fd3 CI was all green, except https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/208/ (NodePort request failed with Exitcode: 42) which looks very much like an unrelated fluke. f0d79f5d_K8sServicesTest_Checks_service_across_nodes_with_L7_policy_Tests_NodePort_with_L7_Policy.zip I'm marking this ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit, but can be resolved after merge. Great work @gandro !
@@ -85,6 +85,7 @@ cilium-agent [flags] | |||
--enable-host-port Enable k8s hostPort mapping feature (requires enabling enable-node-port) (default true) | |||
--enable-host-reachable-services Enable reachability of services for host applications | |||
--enable-hubble Enable hubble server | |||
--enable-hubble-recorder-api Enable the Hubble recorder API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit, but can be done as follow-up. I would love to avoid adding yet another flag here. Given we already need --enable-recorder=true
and if users also have --enable-hubble=true
, then this could also automatically imply that the hubble recorder api will be enabled. So I would just remove this additional flag, given it's more hassle for users to configure, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can default the flag to true if the recorder is enabled, but I'd like to have a flag to turn off the API, since it's quite privileged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandro Ok, that works as well. In that sense the use case for --enable-recorder=true
and --enable-hubble-recorder-api=false
would be if someone would hook up their own agent to configure recorder objects for the datapath (e.g. using same API as the CLI) and then process the traffic from the perf RB instead of letting the agent do it. I presume the agent doesn't hook up anything to consume the perf RB w/o active users, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Maybe to extend on that a bit:
I mainly added the enable-hubble-recorder-api
because we might also want to add a enable-hubble-observer-api=false
flag to disable the "main" Hubble API. Right now, users of the Recorder API also have to enable normal Hubble which potentially has some non-negligible overhead to the agent's memory and cpu consumption.
So just from the Hubble side alone it seems reasonable to me to maybe be able to turn off specific parts of Hubble if a user really cares about performance.
This commit adds a new API and implementation for what we call
the Hubble Recorder API. It is intended to be used for low-level
packet capture on the XDP datapath parts when Cilium is running in
LB-mode. Therefore, it only supports 5-tuple filters instead of the more
expressive Hubble flow metadata queries of the Hubble Observer API.
To start a recording, the client has to send a
StartRecording
messagevia the
Record
method. To stop it, aStopRecording
message must besent. This means that the recording itself is bound to a client context
and therefore allows the server to stop a recording if the client has
disconnected. The stop message is explicit such that the client can wait
for the final status report.
This API has been designed to be possibly extended in the future to
support other kinds of sinks.
See commit messages for details.