hubble: Add support for SockLB tracing #21685

gandro · 2022-10-12T12:44:15Z

This PR adds a Hubble parser for TraceSockNotify events. Those
events are emitted whenever SockLB is performed, see #20492
for details about the datapath tracing.

This parser is similar to the L3/L4 event parser. But because the events
are traced on the socket level, we do not know the source endpoint id,
source endpoint IP address or source TCP/UDP port of the event. We only
know the source cgroup ID. From the cgroup, we can derive which local
pod emitted the event (via GetPodMetadataForContainer). From the pod
IP we can then perform an IP-based lookup in either the local endpoint
list or the IPCache to determine the source endpoint metadata (endpoint
id, numeric identity, labels, etc). The lookup for destination IP is
almost identical to the threefour parser, with the exception that we do
not have any datapath identity available for fallback.

The resulting flow looks similar to regular FlowType_L3_L4, with a
couple of key differences:

The cgroup id of the container is available
Information if the event is pre or post translation point is
available
The socket cookie is available
The TCP/UDP source port is not known
TCP flags are not known
Ethernet headers are not known

Unfortunately, sock events can also be performed by non-pod entities
(e.g. the host cgroup). In that case, we cannot perform a mapping to any
local endpoint and therefore the source IP and endpoint data are left
empty. The cgroup id may also not be available on older kernels or
systems with the systemd cgroup driver (though this will likely be fixed
in a follow-up PR). This fact that the source IP may be empty/unknown is
also key difference to the flows which have been produced by Hubble so
far.

Here's an example of a pod performing a DNS request and response. The first three events are the DNS request: We first see two the SockLB events (pre and post translation, meaning before and after service backend selection):

And here are the events for the response: We see the response packet arriving at the endpoint, but before it is delivered over the socket, reverse NAT being applied, i.e. the pod IP is once again translated to the service IP (which is where the userspace sent the packet to):

gandro · 2022-10-13T19:25:37Z

/test

aditighag

🚀 I only looked at the commit "hubble: Add support for SockLB tracing" that pertains to parsing socket-lb trace events. The parsing logic looks good. Let's follow up on pruning events from non-pod entities.

Can we print cgroup id like metadata in the verbose json output? Will you be adding json summary in a follow-up?

pkg/hubble/parser/sock/parser.go

gandro · 2022-10-25T09:46:49Z

Can we print cgroup id like metadata in the verbose json output? Will you be adding json summary in a follow-up?

This already done. The JSON schema is derived from the protobuf schema, which is updated in this PR to contain the cgroup id. Once the schema is synced to the Hubble CLI (which we can only do once this PR is merged), hubble observe -o jsonpb will print it too.

kaworu

Awesome work @gandro! I have a few comments, mostly cosmetic. Feel free to ignore nits and comments on code that has only moved (I only noticed after review).

pkg/hubble/parser/common/labels.go

pkg/hubble/parser/common/endpoint.go

pkg/hubble/parser/common/labels.go

pkg/hubble/parser/parser.go

kaworu · 2022-10-28T14:47:54Z

pkg/hubble/parser/sock/parser.go

+		return fmt.Errorf("failed to parse sock trace event: %w", err)
+	}
+
+	isRevNat := decodeRevNat(sock.XlatePoint)


At first I thought decodeRevNat returning a *wrapperspb.BoolValue was a left-over from a previous version of the patch, because isRevNat is never added to the decoded flow and this line is the only call site. We check for isRevNat.GetValue() a few lines below, but we only need a true/false signal at that point.

Looking at the screenshots of the Hubble CLI output, I noticed that we're missing the arrows for TraceSock flows (i.e. we see <>), so related question: would setting IsReply and TrafficDirection based on isRevNat make sense for TraceSock events?

Yes, your suspicion is correct, an earlier version of the PR used the result of decodeRevNat as the reply flag. I've changed it to a bool in the most recent version.

Unfortunately, trying to derive the traffic direction / reply state from the observation point does not work: IsReply cannot be meaningfully populated for TraceSock events because their occur on the socket granularity level, not a packet level.

The screenshot in the PR description only shows the traces from the node where pod-to-a is hosted, where rev nat is indeed only used for the reply. But if we include also the events emitted by the kube-dns host, then the picture looks slightly different, we also see rev nat being used for the non-reply packet:

The screenshot above is using xwing, but it's a single UDP request packet being sent from xwing to kubedns, with a single reply packet being sent back. Please note the timetsamps - as you can see, kind-worker2 (where coredns is hostd) emits a pre-xlate-rev for the initial incoming packet (there is no post event, as there is no translation). The reply packet (the last 4 events) happens half a second later, and does not emit any SockLB event, probably because it's sent back on the same "connected UDP socket".

Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).

pre/post-xlate-fwd on the other hand is executed for connect and sendmsg. connect is typically used to initiate an egress connection, but the same cannot necessary be said for sendmsg.

@gandro ah that make sense, thanks for the detailed explanation 👍 one more question about the example:

The reply packet (the last 4 events) happens half a second later, and does not emit any SockLB event, probably because it's sent back on the same "connected UDP socket".

To make sure I understand correctly: the reply packet does not emit any SockLB event on kind-worker2 (i.e. coredns host), but it does in kind-worker (i.e. xwing host) correct? The screenshot's last two events are a couple of pre/post-xlate-rev analogous to the request packet pre/post-xlate-fwd events (which make sense to me), hence the question.

That's correct, yes. My guess here is that coredns is using Golang's net.Conn to send the reply packet, meaning it's using the write system call on a "connected UDP socket". Thus, none of the above cgroup hooks is hit. dig (used inside xwing) on the other hand might be using readmsg/sendmsg syscalls, thus causing a Hubble event for each syscall.

Maybe @aditighag has deeper insight here.

On second thought, connected UDP sockets are only useful for clients. Checking a basic UDP echo server, it seems Go is using recvfrom and sendto, so not sure why the sendto does not cause a Hubble event (it should for unconnected udp sockets, according to https://manpages.debian.org/experimental/bpftool/bpftool-cgroup.8.en.html)

The recvmsg hook is executed only for regular UDP. TCP (and connected UDP) sockets are connected, so TCP clients send and receive data using send/recv socket calls.
@gandro Your observation is correct. Golang uses a connected UDP client by default. Regarding, sendto for connected UDP, here is a snippet from the man page -

If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0)

Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).

When podA on nodeA talks to podB on nodeB, both fwd and rev translation happen on the nodeA (for UDP). For TCP and connected UDP, it would just be fwd translation.

The {pre, post}-fwd/rev markers already convey the direction. We shouldn't try to interpret this with the same context as policy terminology (ingress/egress).

fwd: svc vip -> backend ip
rev: backend ip -> svc vip

Hope that makes sense.

pkg/hubble/parser/sock/parser.go

gandro · 2022-11-01T16:03:58Z

Thanks for the feedback @aditighag and @kaworu - I think I addressed or commented on all of it.

gandro · 2022-11-02T10:57:26Z

/test

Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 2 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24)

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Timed out after 30.000s.

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

gandro · 2022-11-02T17:36:48Z

/test

Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 2 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24)

gandro · 2022-11-03T09:04:36Z

k8s-1.24-kernel-4.19 hit known flake CI: K8sDatapathConfig Host firewall With VXLAN on K8s 1.24 / Kernel 4.19: host EP is not ready #21979
CI Multicluster seems to have hit a variant of CI: Multicluster / Cluster mesh (ci-multicluster-1.11, ci-multicluster-1.12, ci-multicluster-1.13): allow-all-except-world connectivity test #21655 (in any case, it's a datapath flake which this PR does not touch).

aditighag

Commit "hubble: Add support for SockLB tracing" looks good to me. Nice work!

One minor comment around setting the verdict field.

aditighag · 2022-11-03T21:19:19Z

pkg/hubble/parser/sock/parser.go

+		return fmt.Errorf("failed to parse sock trace event: %w", err)
+	}
+
+	isRevNat := decodeRevNat(sock.XlatePoint)


The recvmsg hook is executed only for regular UDP. TCP (and connected UDP) sockets are connected, so TCP clients send and receive data using send/recv socket calls.
@gandro Your observation is correct. Golang uses a connected UDP client by default. Regarding, sendto for connected UDP, here is a snippet from the man page -

If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0)

aditighag · 2022-11-03T21:26:59Z

pkg/hubble/parser/sock/parser.go

+		return fmt.Errorf("failed to parse sock trace event: %w", err)
+	}
+
+	isRevNat := decodeRevNat(sock.XlatePoint)


Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).

When podA on nodeA talks to podB on nodeB, both fwd and rev translation happen on the nodeA (for UDP). For TCP and connected UDP, it would just be fwd translation.

The {pre, post}-fwd/rev markers already convey the direction. We shouldn't try to interpret this with the same context as policy terminology (ingress/egress).

fwd: svc vip -> backend ip
rev: backend ip -> svc vip

Hope that makes sense.

pkg/hubble/parser/sock/parser.go

This adds the relevant fields, including the translation point, socket cookie and cgroup id. The cgroup id allocated by the kernel can never be zero, so using zero as the "absent" value is possible here. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit moves out useful functions out of the threefour parser package into a common package to be shared among parsers. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This allows us to call this helper function from other parsers as well. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

Suggested-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit adds a Hubble parser for `TraceSockNotify` events. Those events are emitted whenever SockLB is performed, see #20492 for details about the datapath tracing. This parser is similar to the L3/L4 event parser. But because the events are traced on the socket level, we do not know the source endpoint id, source endpoint IP address or source TCP/UDP port of the event. We only know the source cgroup ID. From the cgroup, we can derive which local pod emitted the event (via `GetPodMetadataForContainer`). From the pod IP we can then perform an IP-based lookup in either the local endpoint list or the IPCache to determine the source endpoint metadata (endpoint id, numeric identity, labels, etc). The lookup for destination IP is almost identical to the threefour parser, with the exception that we do not have any datapath identity available for fallback. The resulting flow looks similar to regular `FlowType_L3_L4`, with a couple of key differences: - The cgroup id of the container is available - Information if the event is pre or post translation point is available - The socket cookie is available - The TCP/UDP source port is _not_ known - TCP flags are _not_ known - Ethernet headers are _not_ known Unfortunately, sock events can also be performed by non-pod entities (e.g. the host cgroup). In that case, we cannot perform a mapping to any local endpoint and therefore the source IP and endpoint data are left empty. The cgroup id may also not be available on older kernels or systems with the systemd cgroup driver (though this will likely be fixed in a follow-up PR). This fact that the source IP may be empty/unknown is also key difference to the flows which have been produced by Hubble so far. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit adds a new agent flag which instructs Hubble to skip TraceSock events with unknown cgroup ids, i.e. cgroup ids for which we are unable to find a matching pod. Those skipped events are not added to the ring buffer and can only be observed via `cilium monitor`, unless the flag is disabled. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

To avoid cluttering the ConfigMap with default values, it is only added to the ConfigMap if explicitly set (i.e. not is not `nil`). Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2022-11-08T09:18:04Z

/test

squeed

Looks good for my files!

rolinh

gandro · 2022-11-14T14:28:36Z

Removed api code owner, as it has been superseeded by sig-hubble-api and is thus not required anymore (if it was required, I could not have removed the review request).

Marking ready to merge.

gandro added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels Oct 12, 2022

This comment was marked as resolved.

Sign in to view

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 12, 2022

gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch from 296bd2b to f2f10a8 Compare October 13, 2022 15:48

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 13, 2022

gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch 2 times, most recently from 468d720 to 117c11c Compare October 13, 2022 16:11

gandro marked this pull request as ready for review October 13, 2022 16:18

gandro requested review from a team as code owners October 13, 2022 16:18

gandro requested review from michi-covalent, rolinh and aditighag October 13, 2022 16:18

gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch from 117c11c to 9e6191e Compare October 13, 2022 16:42

aditighag requested changes Oct 13, 2022

View reviewed changes

gandro marked this pull request as draft October 24, 2022 13:21

kaworu reviewed Oct 28, 2022

View reviewed changes

gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch 2 times, most recently from 2c480c5 to a9878db Compare November 1, 2022 16:03

gandro marked this pull request as ready for review November 1, 2022 16:03

gandro requested review from a team as code owners November 1, 2022 16:03

gandro requested a review from jibi November 1, 2022 16:03

gandro requested review from aditighag and kaworu November 1, 2022 16:04

aditighag requested changes Nov 3, 2022

View reviewed changes

gandro added 9 commits November 7, 2022 12:54

hubble/parser: Extract functions to be shared among parsers

bd4b5aa

This commit moves out useful functions out of the threefour parser package into a common package to be shared among parsers. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

hubble/parser: Extract resolveEndpoint helper

0c4d9f0

This allows us to call this helper function from other parsers as well. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

hubble: Code cleanup in parser/common/endpoint.go

255b135

Suggested-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

hubble: Code cleanup in parser/common/labels.go

cf1762b

Suggested-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

hubble: Code simplification in parser/common/labels.go

c0386a9

Suggested-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

helm: Add hubble.skipUnknownCGroupIDs value

c537251

To avoid cluttering the ConfigMap with default values, it is only added to the ConfigMap if explicitly set (i.e. not is not `nil`). Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch from 93edc54 to c537251 Compare November 7, 2022 11:54

gandro requested a review from a team as a code owner November 7, 2022 11:54

gandro requested a review from aditighag November 7, 2022 15:10

qmonnet approved these changes Nov 7, 2022

View reviewed changes

aditighag approved these changes Nov 8, 2022

View reviewed changes

gandro mentioned this pull request Nov 8, 2022

CI: Travis: ClusterMeshServicesTestSuite.TestClusterMeshServicesUpdate: obtained k8s.CacheAction = 0 ("service-updated") #21730

Closed

squeed approved these changes Nov 8, 2022

View reviewed changes

rolinh approved these changes Nov 14, 2022

View reviewed changes

gandro removed the request for review from a team November 14, 2022 14:27

gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 14, 2022

tklauser merged commit fd2909a into master Nov 16, 2022

tklauser deleted the pr/gandro/socket-lb-traces-with-hubble branch November 16, 2022 10:12

This was referenced Nov 16, 2022

Add support for SockLB events cilium/hubble#816

Merged

Hubble unable to observe service IP when HostReachableServices is enabled #12954

Closed

gandro self-assigned this Nov 21, 2022

aditighag mentioned this pull request Jan 5, 2023

Socket-lb tracing follow-ups #21510

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble: Add support for SockLB tracing #21685

hubble: Add support for SockLB tracing #21685

gandro commented Oct 12, 2022 •

edited

Loading

This comment was marked as resolved.

gandro commented Oct 13, 2022

aditighag left a comment

gandro commented Oct 25, 2022

kaworu left a comment

kaworu Oct 28, 2022

gandro Nov 1, 2022

gandro Nov 1, 2022

kaworu Nov 2, 2022

gandro Nov 2, 2022 •

edited

Loading

gandro Nov 2, 2022 •

edited

Loading

aditighag Nov 3, 2022

aditighag Nov 3, 2022

gandro commented Nov 1, 2022

gandro commented Nov 2, 2022 •

edited by maintainer-s-little-helper bot

Loading

Test Name

Failure Output

gandro commented Nov 2, 2022 •

edited by maintainer-s-little-helper bot

Loading

gandro commented Nov 3, 2022

aditighag left a comment

aditighag Nov 3, 2022

aditighag Nov 3, 2022

gandro commented Nov 8, 2022

squeed left a comment

rolinh left a comment

gandro commented Nov 14, 2022

hubble: Add support for SockLB tracing #21685

hubble: Add support for SockLB tracing #21685

Conversation

gandro commented Oct 12, 2022 • edited Loading

This comment was marked as resolved.

gandro commented Oct 13, 2022

aditighag left a comment

Choose a reason for hiding this comment

gandro commented Oct 25, 2022

kaworu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

gandro Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro commented Nov 1, 2022

gandro commented Nov 2, 2022 • edited by maintainer-s-little-helper bot Loading

Test Name

Failure Output

gandro commented Nov 2, 2022 • edited by maintainer-s-little-helper bot Loading

gandro commented Nov 3, 2022

aditighag left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro commented Nov 8, 2022

squeed left a comment

Choose a reason for hiding this comment

rolinh left a comment

Choose a reason for hiding this comment

gandro commented Nov 14, 2022

gandro commented Oct 12, 2022 •

edited

Loading

gandro Nov 2, 2022 •

edited

Loading

gandro Nov 2, 2022 •

edited

Loading

gandro commented Nov 2, 2022 •

edited by maintainer-s-little-helper bot

Loading

gandro commented Nov 2, 2022 •

edited by maintainer-s-little-helper bot

Loading