Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble: Add recorder API #15680

Merged
merged 5 commits into from
Apr 16, 2021
Merged

Conversation

gandro
Copy link
Member

@gandro gandro commented Apr 13, 2021

This commit adds a new API and implementation for what we call
the Hubble Recorder API. It is intended to be used for low-level
packet capture on the XDP datapath parts when Cilium is running in
LB-mode. Therefore, it only supports 5-tuple filters instead of the more
expressive Hubble flow metadata queries of the Hubble Observer API.

To start a recording, the client has to send a StartRecording message
via the Record method. To stop it, a StopRecording message must be
sent. This means that the recording itself is bound to a client context
and therefore allows the server to stop a recording if the client has
disconnected. The stop message is explicit such that the client can wait
for the final status report.

This API has been designed to be possibly extended in the future to
support other kinds of sinks.

See commit messages for details.

@gandro gandro added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay feature/lb-only Impacts cilium running in lb-only datapath mode labels Apr 13, 2021
@gandro gandro requested a review from a team April 13, 2021 20:41
@gandro gandro requested review from a team as code owners April 13, 2021 20:41
@gandro gandro requested review from a team, jibi and rolinh and removed request for a team April 13, 2021 20:41
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Apr 13, 2021
@gandro gandro marked this pull request as draft April 13, 2021 20:42
@gandro gandro unassigned jibi and rolinh Apr 13, 2021
@gandro gandro removed request for jibi and rolinh April 14, 2021 08:06
@borkmann borkmann force-pushed the pr/recorder-mask branch 3 times, most recently from 017f2bb to 773d846 Compare April 14, 2021 08:54
@borkmann borkmann force-pushed the pr/recorder-mask branch 2 times, most recently from 0cd0232 to ceb5b86 Compare April 14, 2021 13:25
Copy link
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I have some concerns around the ktime_get_ns offset calculation logic, feel free to ping offline if you need some more input.

pkg/hubble/recorder/sink/dispatch.go Outdated Show resolved Hide resolved
return d.startWallTime.Add(elapsedSinceStart)
}

func getTimeNow() (bootTime int64, wallTime time.Time, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented something similar here: https://github.com/ti-mo/conntracct/blob/master/pkg/boottime/boottime.go. I opted to expose runtime.nanotime() in my package so I wouldn't have to make a slow syscall, and vDSO is used as a source for both timestamps. Feel free to take a look, it's well-documented.

A few traps I've fallen into while implementing this:

  • There can be an arbitrary amount of time between the execution of time.Now() and unix.ClockGettime(), especially under scheduler pressure during program startup. This can be slightly improved by running these 2 calls under runtime.LockOSThread(), which at least takes the Go scheduler out of the equation. Ideally, these results need to be sampled and averaged to eliminate (OS) scheduler jitter. (and it still won't be accurate 😅)
  • Hardware timer pauses (during machine/laptop suspend, VM migrations or other virtualization-related artifacts) invalidate this offset. CLOCK_BOOTTIME does not advance during a hardware pause, so it's possible for 2 events that occurred right before and right after a (let's say, 1hr) pause window to have timestamps that are 1ns apart. This happens much more often than you'd expect. (the cloud is someone else's computer after all) This will result in events seemingly occurring earlier than they actually have, causing events to fall outside of a pcap window. (being a few seconds off can already be an issue if a user runs e.g. a curl right after starting a pcap) The solution I came up with is refreshing the offset every 5/10 seconds or so, since userspace has no way of knowing when these pauses occur.

Hope this helps!

Copy link
Member Author

@gandro gandro Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for the in-depth reply here. I haven't considered the scheduler issue, I might add a LockOSThread and do a few repeats to reduce the error, but yeah, anything we do here will always be best-effort I'm afraid.

So the reason I picked CLOCK_BOOTTIME, is because this is what the datapath is using, see:

cilium/bpf/lib/pcap.h

Lines 78 to 83 in 2ff996c

/* For later pcap file generation, we export boot time to the RB
* such that user space can later reconstruct a real time of day
* timestamp in-place.
*/
cilium_capture(ctx, CAPTURE_INGRESS, rule_id,
bpf_ktime_cache_set(boot_ns), cap_len);

bpf_ktime_get_boot_ns is CLOCK_BOOTTIME, see https://lkml.org/lkml/2020/4/20/1443

If CLOCK_BOOTTIME is not what we want, then we need to fix that in the datapath first. But, I actually believe we do want CLOCK_BOOTTIME. My reading is that CLOCK_BOOTTIME is not affected by your concerns regarding suspension (while runtime.nanotime(), which is CLOCK_MONOTONIC, is affected)

CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2).
https://linux.die.net/man/2/clock_gettime

Unless I'm misinterpreting something here I'll need to stick with CLOCK_BOOTTIME which runtime unfortunately does not expose (see also golang/go#24595). But I like your linker trick to access it anyway :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bpf_ktime_get_boot_ns is CLOCK_BOOTTIME, see https://lkml.org/lkml/2020/4/20/1443

Oh, wasn't aware ktime_get_boot_ns made it in, my implementation predates that helper, and so relies on monotonic being available only. 100% correct on the _MONOTONIC vs. _BOOTTIME, didn't catch that.

For posterity, the kernel commit that introduced it (71d19214776e) only landed in Linux 5.8, so this lb-only datapath code is not backwards-compatible. bpf_ktime_cache_set(boot_ns) does indeed call ktime_get_boot_ns(), so the 5.8 (or backport kernel) req should probably be documented in #15633.

@borkmann for future (and backwards) compat, userspace is going to need to know what the BPF event's clock source is (mono/boottime/timeofday, mentioning all three since we might want to target pre-5.8 at some point). Maybe we should already extend struct pcap_timeoff and make sure userspace can make that distinction? (thinking about upgrade safety)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the 02e55a7 will check for availability and bail out if --enable-recorder=true is used and kernel doesn't have the boot time source. Potentially if users don't care too much about the specific clock, we could also use monotonic and don't translate anything from hubble side (apart from timeval conversion). I would probably wait till such need comes up. After my blocker PRs are done, I'm planning to document all stand-alone LB features as a getting started guide for Cilium in detail, so I'll definitely include such kernel requirements there (right now only the agent will tell it in its error, but a GSG of similar scope and depth as https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/ would be super useful for the LB-only mode).

Copy link
Member Author

@gandro gandro Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the mechanism to fetch the diff multiple times (to be more robust) and take the minimum. I also simplified the clock offset calculation a bit.

pkg/hubble/recorder/sink/dispatch.go Outdated Show resolved Hide resolved
This commit is intended to make Hubble more usable when Cilium is
running in LB-only mode.

Identity lookup in the Hubble parser might fail for various reasons, for
example, when running in the LB-datapath mode. Since the user cannot do
anything about this (and the absent data is still detectable via Hubble
API), stop emitting a warning in the logs and use a debug statement
instead.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds the protobuf definition for the new Hubble Recorder
API. It is intended to be used for low-level packet capture on the XDP
datapath parts when Cilium is running in LB-mode. Therefore, it only
supports 5-tuple filters instead of the more expressive Hubble flow
metadata queries of the Hubble Observer API.

To start a recording, the client has to send a `StartRecording` message
via the `Record` method. To stop it, a `StopRecording` message must be
sent. This means that the recording itself is bound to a client context
and therefore allows the server to stop a recording if the client has
disconnected. The stop message is explicit such that the client can wait
for the final status report.

This API has been designed to be possibly extended in the future to
support other kinds of sinks.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Also remove the unneccesary invocation of `apk add`. The builder image
already contains all necessary tools.

The image version will be updated in a subsequent PR, as currently the
infrastructure to push new versions of this image is not present in our
GitHub Actions setup.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/hubble-recorder-api branch from 4ded4cd to 39455da Compare April 15, 2021 13:41
@gandro gandro removed the dont-merge/blocked Another PR must be merged before this one. label Apr 15, 2021
@gandro gandro requested review from tklauser and removed request for a team and jrfastab April 15, 2021 13:46
@gandro
Copy link
Member Author

gandro commented Apr 15, 2021

test-me-please

Edit: Fixed linter

@gandro gandro force-pushed the pr/gandro/hubble-recorder-api branch from 39455da to 512492f Compare April 15, 2021 14:46
@gandro
Copy link
Member Author

gandro commented Apr 15, 2021

test-me-please

bootTimeOffset = offset
}
}
runtime.UnlockOSThread()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Originally I had more code before the return statement so I wanted to do an early unlock, but that's not needed anymore in the current version.

Will remove once CI is green (to avoid re-triggering it)

message Filter {
// source_cidr. Must not be empty.
// Set to 0.0.0.0/0 to match any IPv4 source address (::/0 for IPv6).
string source_cidr = 1;
Copy link
Contributor

@jrfastab jrfastab Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking because this looks like it becomes API. How would we extend this in the future for ARP or non-{TCP/UDP} traffic. I assume zero these fields and set the protocol? Why not default them to zero if not set?

Copy link
Member Author

@gandro gandro Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the current API is intentionally very restrictive in that it requires the user to explicitly pick between IPv4 or IPv6. This is why I don't default to zero, because it would not be clear if the user means IPv4 or IPv6 "zero". If the user wants any IPv4 and any IPv6 traffic captured, they need to submit two filters to make that explicit. We can always relax that in the future and make the API more ergonomic, but I'd prefer for the initial version to be a 1:1 mapping to the datapath filters to make it more transparent how expensive certain filters are.

You do bring up a good point in that I have not considered how this API would look like for non-IP traffic, because it does mean that an empty value here could either mean "match any ip traffic" or "match only non-ip traffic". Maybe we should wrap the L3 and L4 fields separately, such that a filter for a specific layer can explicitly be absent with well-defined semantics. I'll play around to see how complicated that would be.

Copy link
Member

@borkmann borkmann Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro @jrfastab Hmm, right now at the point where we add the cilium_capture_{in,out}() we will never see non-ip traffic. But I guess there could be a l3 filter of some sort, so maybe the api could be extended with an proto_l3 field, not sure yet. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thinking about it some more, I think the current API should be future proof enough. We can relax it similarly to how the normal Hubble FlowFilter works, i.e. where a absent field just means "don't care about this, check if any other field is applicable". This should give us enough flexibility to support adding new fields to e.g. filter ARP traffic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, sounds reasonable to me.

Copy link
Contributor

@jrfastab jrfastab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple small nits but overall LGTM

This commit adds the API implementation and required plumbing for the
Hubble Recorder API. It contains of three main components:

 - `pkg/hubble/recorder/pcap` is a minimalistic library to write pcap
   files into a `io.Writer`.

 - `pkg/hubble/recorder/sink` contains the recorder sinks. Whenever a
   new datapath recorder has been set up, the corresponding captures are
   pushed into the monitor perf event ring buffer. The `sink.Dispatch` type
   attaches to the monitor to receive and decode these kind of events and
   dispatches incoming packets to registered pcap writers.

 - `pkg/hubble/recorder` contains the API implementation. It is
   responsible to start and stop recordings on behalf of the client.
   It is responsible to allocate `ruleIDs` for the datapath filters.
   When a new recording is started, it calls into `pkg/recorder` to install
   the filters and into `pkg/hubble/recorder/sink` to set up the
   corresponding file sinks. Its options are defined in a separate
   `recorderoption` package.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds the required command-line arguments and setup code in
the Cilium Agent to serve the Hubble Recorder API. It is only served on
the Hubble Unix domain socket and needs to be explicitly enabled.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/hubble-recorder-api branch from 512492f to 0989fee Compare April 16, 2021 07:41
@gandro
Copy link
Member Author

gandro commented Apr 16, 2021

Addressed the nit by Timo: https://github.com/cilium/cilium/compare/512492f0c71d6ae64c892022e16a2fb239131eee..0989fee5bc8d168b33877552d1045a93c5095fd3

CI was all green, except https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/208/ (NodePort request failed with Exitcode: 42) which looks very much like an unrelated fluke. f0d79f5d_K8sServicesTest_Checks_service_across_nodes_with_L7_policy_Tests_NodePort_with_L7_Policy.zip

I'm marking this ready to merge.

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 16, 2021
Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit, but can be resolved after merge. Great work @gandro !

@@ -85,6 +85,7 @@ cilium-agent [flags]
--enable-host-port Enable k8s hostPort mapping feature (requires enabling enable-node-port) (default true)
--enable-host-reachable-services Enable reachability of services for host applications
--enable-hubble Enable hubble server
--enable-hubble-recorder-api Enable the Hubble recorder API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit, but can be done as follow-up. I would love to avoid adding yet another flag here. Given we already need --enable-recorder=true and if users also have --enable-hubble=true, then this could also automatically imply that the hubble recorder api will be enabled. So I would just remove this additional flag, given it's more hassle for users to configure, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can default the flag to true if the recorder is enabled, but I'd like to have a flag to turn off the API, since it's quite privileged.

Copy link
Member

@borkmann borkmann Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro Ok, that works as well. In that sense the use case for --enable-recorder=true and --enable-hubble-recorder-api=false would be if someone would hook up their own agent to configure recorder objects for the datapath (e.g. using same API as the CLI) and then process the traffic from the perf RB instead of letting the agent do it. I presume the agent doesn't hook up anything to consume the perf RB w/o active users, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Maybe to extend on that a bit:
I mainly added the enable-hubble-recorder-api because we might also want to add a enable-hubble-observer-api=false flag to disable the "main" Hubble API. Right now, users of the Recorder API also have to enable normal Hubble which potentially has some non-negligible overhead to the agent's memory and cpu consumption.

So just from the Hubble side alone it seems reasonable to me to maybe be able to turn off specific parts of Hubble if a user really cares about performance.

@borkmann borkmann merged commit a3ad0a1 into cilium:master Apr 16, 2021
1.10.0 automation moved this from In progress to Done Apr 16, 2021
@borkmann borkmann mentioned this pull request Apr 16, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/lb-only Impacts cilium running in lb-only datapath mode ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants