-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble: Add support for SockLB tracing #21685
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
296bd2b
to
f2f10a8
Compare
468d720
to
117c11c
Compare
117c11c
to
9e6191e
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 I only looked at the commit "hubble: Add support for SockLB tracing" that pertains to parsing socket-lb trace events. The parsing logic looks good. Let's follow up on pruning events from non-pod entities.
Can we print cgroup id like metadata in the verbose json output? Will you be adding json summary in a follow-up?
This already done. The JSON schema is derived from the protobuf schema, which is updated in this PR to contain the cgroup id. Once the schema is synced to the Hubble CLI (which we can only do once this PR is merged), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @gandro! I have a few comments, mostly cosmetic. Feel free to ignore nits and comments on code that has only moved (I only noticed after review).
pkg/hubble/parser/sock/parser.go
Outdated
return fmt.Errorf("failed to parse sock trace event: %w", err) | ||
} | ||
|
||
isRevNat := decodeRevNat(sock.XlatePoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I thought decodeRevNat
returning a *wrapperspb.BoolValue
was a left-over from a previous version of the patch, because isRevNat
is never added to the decoded flow and this line is the only call site. We check for isRevNat.GetValue()
a few lines below, but we only need a true
/false
signal at that point.
Looking at the screenshots of the Hubble CLI output, I noticed that we're missing the arrows for TraceSock
flows (i.e. we see <>
), so related question: would setting IsReply
and TrafficDirection
based on isRevNat
make sense for TraceSock
events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your suspicion is correct, an earlier version of the PR used the result of decodeRevNat
as the reply flag. I've changed it to a bool
in the most recent version.
Unfortunately, trying to derive the traffic direction / reply state from the observation point does not work: IsReply
cannot be meaningfully populated for TraceSock events because their occur on the socket granularity level, not a packet level.
The screenshot in the PR description only shows the traces from the node where pod-to-a
is hosted, where rev nat is indeed only used for the reply. But if we include also the events emitted by the kube-dns host, then the picture looks slightly different, we also see rev nat being used for the non-reply packet:
The screenshot above is using xwing
, but it's a single UDP request packet being sent from xwing to kubedns, with a single reply packet being sent back. Please note the timetsamps - as you can see, kind-worker2
(where coredns is hostd) emits a pre-xlate-rev
for the initial incoming packet (there is no post event, as there is no translation). The reply packet (the last 4 events) happens half a second later, and does not emit any SockLB event, probably because it's sent back on the same "connected UDP socket".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev
is only executed on recvmsg
and getpeername
, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).
pre/post-xlate-fwd
on the other hand is executed for connect
and sendmsg
. connect
is typically used to initiate an egress connection, but the same cannot necessary be said for sendmsg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandro ah that make sense, thanks for the detailed explanation 👍 one more question about the example:
The reply packet (the last 4 events) happens half a second later, and does not emit any SockLB event, probably because it's sent back on the same "connected UDP socket".
To make sure I understand correctly: the reply packet does not emit any SockLB event on kind-worker2
(i.e. coredns
host), but it does in kind-worker
(i.e. xwing
host) correct? The screenshot's last two events are a couple of pre/post-xlate-rev
analogous to the request packet pre/post-xlate-fwd
events (which make sense to me), hence the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, yes. My guess here is that coredns
is using Golang's net.Conn
to send the reply packet, meaning it's using the write
system call on a "connected UDP socket". Thus, none of the above cgroup hooks is hit. dig
(used inside xwing
) on the other hand might be using readmsg
/sendmsg
syscalls, thus causing a Hubble event for each syscall.
Maybe @aditighag has deeper insight here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, connected UDP sockets are only useful for clients. Checking a basic UDP echo server, it seems Go is using recvfrom
and sendto
, so not sure why the sendto
does not cause a Hubble event (it should for unconnected udp sockets, according to https://manpages.debian.org/experimental/bpftool/bpftool-cgroup.8.en.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recvmsg hook is executed only for regular UDP. TCP (and connected UDP) sockets are connected, so TCP clients send and receive data using send
/recv
socket calls.
@gandro Your observation is correct. Golang uses a connected UDP client by default. Regarding, sendto
for connected UDP, here is a snippet from the man page -
If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).
When podA on nodeA talks to podB on nodeB, both fwd and rev translation happen on the nodeA (for UDP). For TCP and connected UDP, it would just be fwd translation.
The {pre, post}-fwd/rev markers already convey the direction. We shouldn't try to interpret this with the same context as policy terminology (ingress/egress).
fwd: svc vip -> backend ip
rev: backend ip -> svc vip
Hope that makes sense.
2c480c5
to
a9878db
Compare
Thanks for the feedback @aditighag and @kaworu - I think I addressed or commented on all of it. |
/test Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 2 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24) Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/test Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 2 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24) |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit "hubble: Add support for SockLB tracing" looks good to me. Nice work!
One minor comment around setting the verdict field.
pkg/hubble/parser/sock/parser.go
Outdated
return fmt.Errorf("failed to parse sock trace event: %w", err) | ||
} | ||
|
||
isRevNat := decodeRevNat(sock.XlatePoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recvmsg hook is executed only for regular UDP. TCP (and connected UDP) sockets are connected, so TCP clients send and receive data using send
/recv
socket calls.
@gandro Your observation is correct. Golang uses a connected UDP client by default. Regarding, sendto
for connected UDP, here is a snippet from the man page -
If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0)
pkg/hubble/parser/sock/parser.go
Outdated
return fmt.Errorf("failed to parse sock trace event: %w", err) | ||
} | ||
|
||
isRevNat := decodeRevNat(sock.XlatePoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).
When podA on nodeA talks to podB on nodeB, both fwd and rev translation happen on the nodeA (for UDP). For TCP and connected UDP, it would just be fwd translation.
The {pre, post}-fwd/rev markers already convey the direction. We shouldn't try to interpret this with the same context as policy terminology (ingress/egress).
fwd: svc vip -> backend ip
rev: backend ip -> svc vip
Hope that makes sense.
This adds the relevant fields, including the translation point, socket cookie and cgroup id. The cgroup id allocated by the kernel can never be zero, so using zero as the "absent" value is possible here. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit moves out useful functions out of the threefour parser package into a common package to be shared among parsers. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This allows us to call this helper function from other parsers as well. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Suggested-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Suggested-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Suggested-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds a Hubble parser for `TraceSockNotify` events. Those events are emitted whenever SockLB is performed, see #20492 for details about the datapath tracing. This parser is similar to the L3/L4 event parser. But because the events are traced on the socket level, we do not know the source endpoint id, source endpoint IP address or source TCP/UDP port of the event. We only know the source cgroup ID. From the cgroup, we can derive which local pod emitted the event (via `GetPodMetadataForContainer`). From the pod IP we can then perform an IP-based lookup in either the local endpoint list or the IPCache to determine the source endpoint metadata (endpoint id, numeric identity, labels, etc). The lookup for destination IP is almost identical to the threefour parser, with the exception that we do not have any datapath identity available for fallback. The resulting flow looks similar to regular `FlowType_L3_L4`, with a couple of key differences: - The cgroup id of the container is available - Information if the event is pre or post translation point is available - The socket cookie is available - The TCP/UDP source port is _not_ known - TCP flags are _not_ known - Ethernet headers are _not_ known Unfortunately, sock events can also be performed by non-pod entities (e.g. the host cgroup). In that case, we cannot perform a mapping to any local endpoint and therefore the source IP and endpoint data are left empty. The cgroup id may also not be available on older kernels or systems with the systemd cgroup driver (though this will likely be fixed in a follow-up PR). This fact that the source IP may be empty/unknown is also key difference to the flows which have been produced by Hubble so far. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds a new agent flag which instructs Hubble to skip TraceSock events with unknown cgroup ids, i.e. cgroup ids for which we are unable to find a matching pod. Those skipped events are not added to the ring buffer and can only be observed via `cilium monitor`, unless the flag is disabled. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
To avoid cluttering the ConfigMap with default values, it is only added to the ConfigMap if explicitly set (i.e. not is not `nil`). Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
93edc54
to
c537251
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my files!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Marking ready to merge. |
This PR adds a Hubble parser for
TraceSockNotify
events. Thoseevents are emitted whenever SockLB is performed, see #20492
for details about the datapath tracing.
This parser is similar to the L3/L4 event parser. But because the events
are traced on the socket level, we do not know the source endpoint id,
source endpoint IP address or source TCP/UDP port of the event. We only
know the source cgroup ID. From the cgroup, we can derive which local
pod emitted the event (via
GetPodMetadataForContainer
). From the podIP we can then perform an IP-based lookup in either the local endpoint
list or the IPCache to determine the source endpoint metadata (endpoint
id, numeric identity, labels, etc). The lookup for destination IP is
almost identical to the threefour parser, with the exception that we do
not have any datapath identity available for fallback.
The resulting flow looks similar to regular
FlowType_L3_L4
, with acouple of key differences:
available
Unfortunately, sock events can also be performed by non-pod entities
(e.g. the host cgroup). In that case, we cannot perform a mapping to any
local endpoint and therefore the source IP and endpoint data are left
empty. The cgroup id may also not be available on older kernels or
systems with the systemd cgroup driver (though this will likely be fixed
in a follow-up PR). This fact that the source IP may be empty/unknown is
also key difference to the flows which have been produced by Hubble so
far.
Here's an example of a pod performing a DNS request and response. The first three events are the DNS request: We first see two the SockLB events (pre and post translation, meaning before and after service backend selection):
And here are the events for the response: We see the response packet arriving at the endpoint, but before it is delivered over the socket, reverse NAT being applied, i.e. the pod IP is once again translated to the service IP (which is where the userspace sent the packet to):