This repo describes a very rough proof-of-concept for integrating Zed into BPF. The idea is to arrange for BPF to emit events and aggregates directly in the super-structured Zed format and stream updates live into a Zed lake.
Zed's super-structured approach allows data to be completely self describing using its comprehensive type system so that external schemas do not need to be defined and declared for richly typed data.
Note that we are using the term "BPF" to refer to the modern eBPF Linux subsystem as the original BPF is now called "classic BPF" as explained by Brendan Gregg in his LISA21 talk.
The examples described here are directly embedded in the Python tooling from the BPF Compiler Collection (BCC), but the approach would be also applicable to bpftrace or to any custom BPF application.
Dealing with data is hard. To keep things simple, people often simplify their rich data with "go to" and ubiquitous formats like CSV and JSON. While simple in appearance, these formats can be frustrating in practice.
We were recently interacting with a colleague at a research university who was instrumenting database performance using BPF. The team there wrote a bunch of custom BCC apps, which generated huge numbers of large CSV files. They would then analyze this bulky data with custom Python/Pandas code, along with running SQL analytics on an RDBMS.
Our colleague was pulling his hair out: running out of disk space, dealing with thousands of CSV files in his local file system conforming to many different and changing column layouts, and updating tables in the SQL database to track changes in their collection methodology. It was a nightmare.
Then our friend remembered Zed and asked, "Isn't this where Zed is supposed to help?" Yes, of course!
The obvious approach here would be to simply load all the CSV files into a Zed lake (e.g., running on S3 to overcome storage limits), then run SQL and/or Zed queries on the Zed lake. Yet we wondered if there could be some benefit in a deeper integration with BPF.
One of the difficulties in data
management is that the source of data often has rich and detailed information
about the types and structure of this origin data, only to throw that information
away when serializing into formats like CSV or JSON. Indeed in BPF, the BCC tooling
has direct access to in-kernel C data structures with all of their type information
and potentially hierarchical structure. The BCC Python code, for instance,
accesses native BPF events via the
ctypes
library.
Yet our colleague's BCC tooling simply threw away this detailed type information and instead wrote its output as CSV records in a fixed set of hard-coded columns.
There must be a better way.
What if you could simply marshal any ctypes
struct into a Zed record
and efficiently serialize this record as
ZNG?
Then BPF code that captured
native C events could simply marshal the data directly and send it along to a Zed lake.
Because Zed doesn't require schema definitions nor does it organize data
into tables, all these native C events could be intermingled and efficiently
transmitted into a Zed lake with minimal effort.
Moreover, if we mapped the type name of the C struct to a Zed type, then we could use Zed type queries to pull out all of the records of a given type. So, it's sort of like having tables without ever having to create them.
To explore this idea of efficient serialization of BPF events as ZNG,
we first developed the simple proof-of-concept here. We don't have a
ctypes
marshaler working yet nor do we have Zed output serialiead as ZNG.
Rather we instrumented a couple of BCC tools (namely, execsnoop
and stackcount
)
with flags to generate the human-readable ZSON format and load this data
directly into a Zed lake.
ZSON much less efficient than ZNG, but we just wanted to try something simple here to see what people think and whether further work is warranted.
To explore this proof of concept, you will need a Linux environment with BPF enabled (the "Linux host") and a client environment (the "desktop host") to query the BPF data. The desktop host will query a Zed lake running on the Linux host.
If you happen to be running a BPF-enabled desktop or laptop, then the Linux host and desktop host could be one and the same.
Provision a Linux host with a recent kernel and recent version of BPF enabled. See below for some hints for running a recent Linux using Vagrant on a MacBook but any recent Linux running anywhere should work.
Install Zed (v0.33
or later) on your Linux host.
Follow the instructions in the repository's
GitHub Releases.
Install the latest Zed Python module on your Linux host:
pip3 install "git+https://github.com/brimdata/zed#subdirectory=python/zed"
Clone the Zed-modified BCC tooling. We forked the BCC repository
and made the modifications on a branch therein called zed
:
git clone https://github.com/brimdata/bcc.git
cd bcc
git checkout zed
A Zed lake service must run during the course of experiments described below. In a terminal on your Linux host, create this service as follows:
mkdir scratch
zed lake serve -R scratch
In another terminal on the Linux host, create a Zed data pool for the
BPF data and test that everything is working with zapi ls
:
zapi create bpf
zapi ls
The zapi ls
command should display the pool called bpf
.
Note that the Zed-enhanced BCC tools are configured to write
to the bpf
pool.
Note that in a production environment, the Linux host would post data to a Zed lake running at scale elsewhere. For this experiment, we are simply running the Zed lake directly on the Linux host.
To query the Zed lake running on the Linux host, you should install Zed and or Brim on your desktop/laptop.
Install these packages following the instructions in Zed releases (as described above for the Linux host) or Brim releases.
You can also build Zed from source.
By default, zapi
connects to the lake service API at http://localhost:9867
,
which is also the default port used by zed lake serve
.
Thus, you can run zapi
commands on the Linux host without any configuration.
On the desktop host, to point zapi
at the lake running on Linux host,
you can use the -lake
command-line option or simply set
the ZED_LAKE
environment variable, e.g.,
export ZED_LAKE=http://linux-host:9867
where linux-host is the IP or DNS name of the Linux host.
For the Vagrant setup described
described below, the desktop port 8098 is
forwarded to the Linux port 9876, so you should use this for ZED_LAKE
:
export ZED_LAKE=http://localhost:8098
The Brim app is a desktop application based on Electron, similar to the Slack desktop model. Brim is a nice way to look at Zed data.
To open a lake inside Brim, click on the current lake name in the upper left of Brim's window. This will bring up a drop-down menu and you should click on the "Add Lake..." option at the bottom of the menu. A form will appear and you can enter a name (e.g., "Linux BPF Lake") and a URL for the lake. The URL should be one of the two options described above:
http://linux-host:9867
, orhttp://localhost:8098
.
To run a BPF/Zed capture experiment on the Linux host,
cd
into the top-level directory of the forked BCC repo
(remember you need to be on the zed
git branch).
Then, to run an experiment, specify the -z
flag for Zed,
and try either execsnoop:
sudo python3 ./tools/execsnoop.py -z
Or stackcount:
sudo python3 ./tools/stackcount.py -i 1 -P -z ip_output
For stackcount, we have specified "ip_output" as the kernel function to trace,
but you can try any BPF-traceable function. We also specified
-i 1
so that data is transmitted to the Zed lake every second,
and -P
so we get per-process stats (in particular, the process name
for each stack is present).
In either case, you can hit ctrl-C to terminate.
Note that these two programs have the pool "bpf" hardwired.
Now, you can look at the data in the pool with zapi query
run either on the Linux host or the desktop host
(configured as described above):
zapi query -use bpf@main -Z "sum(count) by stack,name | sort sum"
Note that execsnoop
labels uses the Zed type name "exec" for its records
and stackcount
likewise uses the name "stack". Thus, you can query by type
to get the records that came from each different command:
zapi query -use bpf@main -Z "is(type(exec))"
zapi query -use bpf@main -Z "is(type(stack))"
For the examples below, we started both execsnoop and stackcount
on the Linux host, then and the bash script workload
to generate synthetic activity.
With your BPF trace data in a Zed lake, searching and analyzing your data is a piece of cake. You can also slice and dice the data in the lake and export it in most any format you'd like (JSON, Parquet, CSV, or any Zed format).
First, make sure zapi
query uses the bpf
pool for your queries, as above,
by running
zapi use bpf@main
Here is a simple query that counts the number of times the
string "__tcp_transmit_skb" appears in a stack trace (from stackcount
):
zapi query '"__tcp_transmit_skb" in stack | count()'
and you'll get a result that looks like this as the default format is ZSON:
{count:650(uint64)}
If you want JSON output, just add -f json
:
zapi query -f json '"__tcp_transmit_skb" in stack | count()'
{"count":650}
Here's another cool one. This counts up stuff by the name
field where
the name could be in either a stack record or a process record:
zapi query "count() by name"
And you get output like this:
{name:"kworker/u4:3",count:4(uint64)}
{name:"kworker/u4:1",count:1(uint64)}
{name:"ps",count:2(uint64)}
{name:"python3",count:29(uint64)}
{name:"zq",count:1(uint64)}
{name:"zed",count:79(uint64)}
{name:"ksoftirqd/0",count:1(uint64)}
{name:"ls",count:10(uint64)}
{name:"swapper/1",count:21(uint64)}
{name:"find",count:3(uint64)}
{name:"ksoftirqd/1",count:11(uint64)}
{name:"sshd",count:54(uint64)}
{name:"systemd-resolve",count:6(uint64)}
{name:"grep",count:1(uint64)}
{name:"curl",count:31(uint64)}
{name:"swapper/0",count:2(uint64)}
Or if you want a table, you can specify -f table
:
zapi query -f table "count() by name"
to get
name count
python3 29
curl 31
zed 79
kworker/u4:1 1
ksoftirqd/1 11
kworker/u4:3 4
ls 10
grep 1
zq 1
ps 2
systemd-resolve 6
swapper/0 2
sshd 54
find 3
ksoftirqd/0 1
swapper/1 21
Here is a more sophisticated query where we sum up the counts from every 1 second sampling interval and we use the "stack" and the process "name" as group-by keys. Note that the Zed query language is perfectly happy using any Zed value (including arrays) as a group-by key.
zapi query -Z "sum(count) by stack,name | sort -r sum | head 2"
This will give the top two stack traces and will look something like this
in the pretty-printed (-Z
) ZSON output:
{
stack: [
"ip_output",
"ip_queue_xmit",
"__tcp_transmit_skb",
"tcp_write_xmit",
"__tcp_push_pending_frames",
"tcp_push",
"tcp_sendmsg_locked",
"tcp_sendmsg",
"inet_sendmsg",
"sock_sendmsg",
"sock_write_iter",
"new_sync_write",
"vfs_write",
"ksys_write",
"__x64_sys_write",
"do_syscall_64",
"entry_SYSCALL_64_after_hwframe"
],
name: "sshd",
sum: 334
}
{
stack: [
"ip_output",
"ip_queue_xmit",
"__tcp_transmit_skb",
"tcp_write_xmit",
"__tcp_push_pending_frames",
"tcp_push",
"tcp_sendmsg_locked",
"tcp_sendmsg",
"inet6_sendmsg",
"sock_sendmsg",
"sock_write_iter",
"new_sync_write",
"vfs_write",
"ksys_write",
"__x64_sys_write",
"do_syscall_64",
"entry_SYSCALL_64_after_hwframe"
],
name: "zed",
sum: 136
}
Here is a more sophisticated example.
This query computes the unique set of stacks grouped by the parent caller
of the traced function (in this case ip_output
):
zapi query -Z "has(stack[1]) | stacks:=union(stack) by callee:=stack[1]"
giving output like this
{
callee: "tcp_v4_send_synack",
stacks: |[
[
"ip_output",
"tcp_v4_send_synack",
"tcp_conn_request",
"tcp_v4_conn_request",
"tcp_v6_conn_request",
"tcp_rcv_state_process",
"tcp_v4_do_rcv",
"tcp_v4_rcv",
"ip_protocol_deliver_rcu",
"ip_local_deliver_finish",
"ip_local_deliver",
"ip_sublist_rcv_finish",
"ip_list_rcv_finish.constprop.0",
"ip_sublist_rcv",
"ip_list_rcv",
"__netif_receive_skb_list_core",
"__netif_receive_skb_list",
"netif_receive_skb_list_internal",
"napi_complete_done",
"e1000_clean",
"napi_poll",
"net_rx_action",
"__softirqentry_text_start",
"asm_call_sysvec_on_stack",
"do_softirq_own_stack",
"irq_exit_rcu",
"common_interrupt",
"asm_common_interrupt"
],
[
"ip_output",
"tcp_v4_send_synack",
"tcp_conn_request",
"tcp_v4_conn_request",
"tcp_v6_conn_request",
"tcp_rcv_state_process",
"tcp_v4_do_rcv",
"tcp_v4_rcv",
"ip_protocol_deliver_rcu",
"ip_local_deliver_finish",
"ip_local_deliver",
"ip_rcv_finish",
"ip_rcv",
"__netif_receive_skb_one_core",
"__netif_receive_skb",
"process_backlog",
"napi_poll",
"net_rx_action",
"__softirqentry_text_start",
"asm_call_sysvec_on_stack",
"do_softirq_own_stack",
"do_softirq",
"__local_bh_enable_ip",
"ip_finish_output2",
"__ip_finish_output",
"ip_finish_output",
"ip_output",
"__ip_queue_xmit",
"ip_queue_xmit",
"__tcp_transmit_skb",
"tcp_connect",
"tcp_v4_connect",
"__inet_stream_connect",
"inet_stream_connect",
"__sys_connect_file",
"__sys_connect",
"__x64_sys_connect",
"do_syscall_64",
"entry_SYSCALL_64_after_hwframe"
]
]|
}
...
Zed is a bit different and lets you put super-structured data all in one location. It's kind of like rich database tables without having to define tables and schemas ahead of time. While this may sound to you like a NoSQL store, e.g., Mongo or CouchDB, it's quite different because Zed data is super-structured instead of semi-structured: super-structured data has a well-defined type for every value whereas semi-structured data has implied types and the "shape" of a semi-structured data value can only be determined by traversing that value.
The power of super-structured data and Zed is that types are first class.
This means you can put a type anywhere a value can go, and in particular,
the Zed query language includes a typeof()
operator that returns the type
of a value as a value. So you can say things like
zapi query -Z 'count() by typeof(this)'
to see the "shape" of all the values in a data pool, e.g., giving output that looks like this:
{
typeof: (stack=({ts:time,name:string,ustack:[string],stack:[string],count:int64})),
count: 256 (uint64)
}
{
typeof: (exec=({ts:time,pcomm:string,pid:int64,ppid:int64,ret:int64,args:[string]})),
count: 30 (uint64)
}
Or you could get a sample of each shape by saying this:
zapi query -Z 'val:=any(this) by typeof(this) | cut val'
giving a result like this:
{
val: {
ts: 2021-11-22T14:45:09Z,
name: "ksoftirqd/1",
ustack: [] ([string]),
stack: [
"ip_output",
"ip_queue_xmit",
"__tcp_transmit_skb",
"tcp_write_xmit",
"tcp_tsq_write.part.0",
"tcp_tsq_handler",
"tcp_tasklet_func",
"tasklet_action_common.constprop.0",
"tasklet_action",
"__softirqentry_text_start",
"run_ksoftirqd",
"smpboot_thread_fn",
"kthread",
"ret_from_fork"
],
count: 1
} (=stack)
}
{
val: {
ts: 2021-11-22T14:45:08Z,
pcomm: "find",
pid: 204580,
ppid: 204536,
ret: 0,
args: [
"/usr/bin/find",
"../bcc"
]
} (=exec)
}
We think Zed could be a really great way to store and query telemetry data as it unifies events and metrics. It's not just a single format but a family of formats that all adhere to exactly the same data model. ZNG is the efficient row-based format and ZST is the columnar format.
The ZST implementation is still a bit early...
The Zed lake is designed to ingest row-based ZNG then run indexing, compaction, and columnar-conversion as data objects stabilize (there's a nice API for querying objects and their metadata --- of course using ZNG --- so these indexing workflows can be driven by external agents and different agents can be developed for different use cases).
Unlike search systems, all the queries work whether indexed or not and indexes simply speed things up. And if you change indexing rules, you don't need to reindex everything. Just the stuff you want the new rules to apply to.
I ran the experiments above using a Linux host configured on VirtualBox using vagrant running on a MacBook.
I adjusted a few things from the instructions here, mainly to use a newer version of Ubuntu Hirsute (21.04) along with the more up-to-date BPF tooling referenced in BCC issue #2678.
I used the PPA here, but for these experiments, installed only bpfcc:
sudo add-apt-repository ppa:hadret/bpftrace
sudo add-apt update
sudo add-apt install bpfcc
Also, I added a port forwarding rule in the Vagrantfile:
Vagrant.configure("2") do |config|
...
config.vm.network "forwarded_port", guest: 9867, host: 8098
If we have enough time, this section contains a quick outline of a demo that I will run through during the OpenObservability Podcast at 11am on November 23, 2021.
Motivated by colleague at CMU.
Exploit super-structured Zed data to make BPF instrumentation easier.
Everything is open source.
- Based on this
zbpf
repository- can reproduce everything here
- Modified two BCC tools to produce Zed directly
- snoopexec, stacktrace
- The diffs are in a forked repo
- Ran experiments on my MacBook using vagrant and recent Linux
- Zed lake runs on Linux host
- Modified BCC tooling streams super-structured Zed data straight to lake
- Queries run on desktop host
zapi
- Brim app
TL;DR if you want to just play around with this BPF data without setting up a Linux environment or even a Zed lake, you can just run Brim using
- The bpf.zson file from our BPF experiment, and
- sample queries for Brim.
You can drag the sample data into the Brim app and drag the queries into
the side panel where the query library is located. Brim will automatically
launch a local lake that zapi
will connect to. To have the zapi
commands
connect to the pool created from dragging bpf.zson
into the app, issue
the command
zed use bpf.zson@main
ZSON is a superset of JSON, and much more powerful
echo '{"ts":"11/22/2021 8:01am", "addr":"10.0.0.1", "number":"98.6"}' | zq -Z 'this:=cast(this, type({ts:time,addr:ip,number:float64}))' -
We can get the JSON back if you'd like (with a standard time)...
echo '{"ts":"11/22/2021 8:01am", "addr":"10.0.0.1", "number":"98.6"}' | zq -Z 'this:=cast(this, type({ts:time,addr:ip,number:float64}))' - | zq -f json - | jq
We can output Zed in an efficient binary form. Then get back the type information...
echo '{"ts":"11/22/2021 8:01am", "addr":"10.0.0.1", "number":"98.6"}' | zq -f zng 'this:=cast(this, type({ts:time,addr:ip,number:float64}))' - > binary.zng
cat binary.zng
hexdump binary.zng
zq 'cut typeof(this)' binary.zng
There is much more to it, but I will refer you to the Zed docs for more info...
Zed lake service is running on the Linux guest VM.
Earlier ran a simple workload while execsnoop and stackcount were running.
We'll tell the zapi
CLI command (for hitting Zed lake API) to use the
"main" branch of a pool called "bpf":
zapi use bpf@main
It's quite a small data set but Zed lakes are designed to scale...
zapi query "count()"
Have a look at some pretty-printed (-Z) records...
zapi query -Z "* | head 42"
That's interesting... count up all the stacks containing _tcp_transmit_skb:
zapi query '"__tcp_transmit_skb" in stack | count()'
Are they always in the grandparent caller position?
zapi query 'stack[2]=="__tcp_transmit_skb" | count()'
One isn't! Find it and pretty print it...
zapi query -Z 'stack[2]!="__tcp_transmit_skb" and "__tcp_transmit_skb" in stack'
How about listing the events from execsnoop?
zapi query -Z 'is(type(exec))'
What are all the commands that were run?
zapi query -Z 'commands:=union(pcomm)'
And what about their paths? Let's take a union over record expressions:
zapi query -Z 'commands:=union({name:pcomm,path:args[0]})'
Pretty cool!
We'll cut over to the app here and run through the queries in the library...