Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble: Add support for SockLB tracing #21685

Merged
merged 9 commits into from
Nov 16, 2022

Conversation

gandro
Copy link
Member

@gandro gandro commented Oct 12, 2022

This PR adds a Hubble parser for TraceSockNotify events. Those
events are emitted whenever SockLB is performed, see #20492
for details about the datapath tracing.

This parser is similar to the L3/L4 event parser. But because the events
are traced on the socket level, we do not know the source endpoint id,
source endpoint IP address or source TCP/UDP port of the event. We only
know the source cgroup ID. From the cgroup, we can derive which local
pod emitted the event (via GetPodMetadataForContainer). From the pod
IP we can then perform an IP-based lookup in either the local endpoint
list or the IPCache to determine the source endpoint metadata (endpoint
id, numeric identity, labels, etc). The lookup for destination IP is
almost identical to the threefour parser, with the exception that we do
not have any datapath identity available for fallback.

The resulting flow looks similar to regular FlowType_L3_L4, with a
couple of key differences:

  • The cgroup id of the container is available
  • Information if the event is pre or post translation point is
    available
  • The socket cookie is available
  • The TCP/UDP source port is not known
  • TCP flags are not known
  • Ethernet headers are not known

Unfortunately, sock events can also be performed by non-pod entities
(e.g. the host cgroup). In that case, we cannot perform a mapping to any
local endpoint and therefore the source IP and endpoint data are left
empty. The cgroup id may also not be available on older kernels or
systems with the systemd cgroup driver (though this will likely be fixed
in a follow-up PR). This fact that the source IP may be empty/unknown is
also key difference to the flows which have been produced by Hubble so
far.


Here's an example of a pod performing a DNS request and response. The first three events are the DNS request: We first see two the SockLB events (pre and post translation, meaning before and after service backend selection):

image

And here are the events for the response: We see the response packet arriving at the endpoint, but before it is delivered over the socket, reverse NAT being applied, i.e. the pod IP is once again translated to the service IP (which is where the userspace sent the packet to):

image

@gandro gandro added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels Oct 12, 2022
@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 12, 2022
@gandro gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch from 296bd2b to f2f10a8 Compare October 13, 2022 15:48
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 13, 2022
@gandro gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch 2 times, most recently from 468d720 to 117c11c Compare October 13, 2022 16:11
@gandro gandro marked this pull request as ready for review October 13, 2022 16:18
@gandro gandro requested review from a team as code owners October 13, 2022 16:18
@gandro gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch from 117c11c to 9e6191e Compare October 13, 2022 16:42
@gandro
Copy link
Member Author

gandro commented Oct 13, 2022

/test

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 I only looked at the commit "hubble: Add support for SockLB tracing" that pertains to parsing socket-lb trace events. The parsing logic looks good. Let's follow up on pruning events from non-pod entities.

Can we print cgroup id like metadata in the verbose json output? Will you be adding json summary in a follow-up?

pkg/hubble/parser/sock/parser.go Outdated Show resolved Hide resolved
pkg/hubble/parser/sock/parser.go Show resolved Hide resolved
pkg/hubble/parser/sock/parser.go Show resolved Hide resolved
pkg/hubble/parser/sock/parser.go Show resolved Hide resolved
pkg/hubble/parser/sock/parser.go Show resolved Hide resolved
pkg/hubble/parser/sock/parser.go Show resolved Hide resolved
@gandro gandro marked this pull request as draft October 24, 2022 13:21
@gandro
Copy link
Member Author

gandro commented Oct 25, 2022

Can we print cgroup id like metadata in the verbose json output? Will you be adding json summary in a follow-up?

This already done. The JSON schema is derived from the protobuf schema, which is updated in this PR to contain the cgroup id. Once the schema is synced to the Hubble CLI (which we can only do once this PR is merged), hubble observe -o jsonpb will print it too.

Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @gandro! I have a few comments, mostly cosmetic. Feel free to ignore nits and comments on code that has only moved (I only noticed after review).

pkg/hubble/parser/common/labels.go Outdated Show resolved Hide resolved
pkg/hubble/parser/common/endpoint.go Outdated Show resolved Hide resolved
pkg/hubble/parser/common/labels.go Outdated Show resolved Hide resolved
pkg/hubble/parser/parser.go Outdated Show resolved Hide resolved
return fmt.Errorf("failed to parse sock trace event: %w", err)
}

isRevNat := decodeRevNat(sock.XlatePoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought decodeRevNat returning a *wrapperspb.BoolValue was a left-over from a previous version of the patch, because isRevNat is never added to the decoded flow and this line is the only call site. We check for isRevNat.GetValue() a few lines below, but we only need a true/false signal at that point.

Looking at the screenshots of the Hubble CLI output, I noticed that we're missing the arrows for TraceSock flows (i.e. we see <>), so related question: would setting IsReply and TrafficDirection based on isRevNat make sense for TraceSock events?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your suspicion is correct, an earlier version of the PR used the result of decodeRevNat as the reply flag. I've changed it to a bool in the most recent version.

Unfortunately, trying to derive the traffic direction / reply state from the observation point does not work: IsReply cannot be meaningfully populated for TraceSock events because their occur on the socket granularity level, not a packet level.

The screenshot in the PR description only shows the traces from the node where pod-to-a is hosted, where rev nat is indeed only used for the reply. But if we include also the events emitted by the kube-dns host, then the picture looks slightly different, we also see rev nat being used for the non-reply packet:

image

The screenshot above is using xwing, but it's a single UDP request packet being sent from xwing to kubedns, with a single reply packet being sent back. Please note the timetsamps - as you can see, kind-worker2 (where coredns is hostd) emits a pre-xlate-rev for the initial incoming packet (there is no post event, as there is no translation). The reply packet (the last 4 events) happens half a second later, and does not emit any SockLB event, probably because it's sent back on the same "connected UDP socket".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).

pre/post-xlate-fwd on the other hand is executed for connect and sendmsg. connect is typically used to initiate an egress connection, but the same cannot necessary be said for sendmsg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro ah that make sense, thanks for the detailed explanation 👍 one more question about the example:

The reply packet (the last 4 events) happens half a second later, and does not emit any SockLB event, probably because it's sent back on the same "connected UDP socket".

To make sure I understand correctly: the reply packet does not emit any SockLB event on kind-worker2 (i.e. coredns host), but it does in kind-worker (i.e. xwing host) correct? The screenshot's last two events are a couple of pre/post-xlate-rev analogous to the request packet pre/post-xlate-fwd events (which make sense to me), hence the question.

Copy link
Member Author

@gandro gandro Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, yes. My guess here is that coredns is using Golang's net.Conn to send the reply packet, meaning it's using the write system call on a "connected UDP socket". Thus, none of the above cgroup hooks is hit. dig (used inside xwing) on the other hand might be using readmsg/sendmsg syscalls, thus causing a Hubble event for each syscall.

Maybe @aditighag has deeper insight here.

Copy link
Member Author

@gandro gandro Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, connected UDP sockets are only useful for clients. Checking a basic UDP echo server, it seems Go is using recvfrom and sendto, so not sure why the sendto does not cause a Hubble event (it should for unconnected udp sockets, according to https://manpages.debian.org/experimental/bpftool/bpftool-cgroup.8.en.html)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recvmsg hook is executed only for regular UDP. TCP (and connected UDP) sockets are connected, so TCP clients send and receive data using send/recv socket calls.
@gandro Your observation is correct. Golang uses a connected UDP client by default. Regarding, sendto for connected UDP, here is a snippet from the man page -

If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).

When podA on nodeA talks to podB on nodeB, both fwd and rev translation happen on the nodeA (for UDP). For TCP and connected UDP, it would just be fwd translation.

The {pre, post}-fwd/rev markers already convey the direction. We shouldn't try to interpret this with the same context as policy terminology (ingress/egress).

fwd: svc vip -> backend ip
rev: backend ip -> svc vip

Hope that makes sense.

pkg/hubble/parser/sock/parser.go Show resolved Hide resolved
@gandro gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch 2 times, most recently from 2c480c5 to a9878db Compare November 1, 2022 16:03
@gandro gandro marked this pull request as ready for review November 1, 2022 16:03
@gandro gandro requested review from a team as code owners November 1, 2022 16:03
@gandro gandro requested a review from jibi November 1, 2022 16:03
@gandro
Copy link
Member Author

gandro commented Nov 1, 2022

Thanks for the feedback @aditighag and @kaworu - I think I addressed or commented on all of it.

@gandro
Copy link
Member Author

gandro commented Nov 2, 2022

/test

Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 2 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24)

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Timed out after 30.000s.

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

@gandro
Copy link
Member Author

gandro commented Nov 2, 2022

/test

Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 2 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24)

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit "hubble: Add support for SockLB tracing" looks good to me. Nice work!

One minor comment around setting the verdict field.

return fmt.Errorf("failed to parse sock trace event: %w", err)
}

isRevNat := decodeRevNat(sock.XlatePoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recvmsg hook is executed only for regular UDP. TCP (and connected UDP) sockets are connected, so TCP clients send and receive data using send/recv socket calls.
@gandro Your observation is correct. Golang uses a connected UDP client by default. Regarding, sendto for connected UDP, here is a snippet from the man page -

If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0)

return fmt.Errorf("failed to parse sock trace event: %w", err)
}

isRevNat := decodeRevNat(sock.XlatePoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to formulate this: I think (but I might be mistaken, cc @aditighag) that pre/post-xlate-rev is only executed on recvmsg and getpeername, both system calls don't really imply anything of overall the "traffic direction" (e.g. if it was an ingress or egress connection in policy lingo).

When podA on nodeA talks to podB on nodeB, both fwd and rev translation happen on the nodeA (for UDP). For TCP and connected UDP, it would just be fwd translation.

The {pre, post}-fwd/rev markers already convey the direction. We shouldn't try to interpret this with the same context as policy terminology (ingress/egress).

fwd: svc vip -> backend ip
rev: backend ip -> svc vip

Hope that makes sense.

pkg/hubble/parser/sock/parser.go Show resolved Hide resolved
pkg/hubble/parser/sock/parser.go Outdated Show resolved Hide resolved
This adds the relevant fields, including the translation point, socket
cookie and cgroup id. The cgroup id allocated by the kernel can never be
zero, so using zero as the "absent" value is possible here.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit moves out useful functions out of the threefour parser
package into a common package to be shared among parsers.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This allows us to call this helper function from other parsers as well.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Suggested-by: Alexandre Perrin <alex@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Suggested-by: Alexandre Perrin <alex@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Suggested-by: Alexandre Perrin <alex@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds a Hubble parser for `TraceSockNotify` events. Those
events are emitted whenever SockLB is performed, see #20492
for details about the datapath tracing.

This parser is similar to the L3/L4 event parser. But because the events
are traced on the socket level, we do not know the source endpoint id,
source endpoint IP address or source TCP/UDP port of the event. We only
know the source cgroup ID. From the cgroup, we can derive which local
pod emitted the event (via `GetPodMetadataForContainer`). From the pod
IP we can then perform an IP-based lookup in either the local endpoint
list or the IPCache to determine the source endpoint metadata (endpoint
id, numeric identity, labels, etc). The lookup for destination IP is
almost identical to the threefour parser, with the exception that we do
not have any datapath identity available for fallback.

The resulting flow looks similar to regular `FlowType_L3_L4`, with a
couple of key differences:

 - The cgroup id of the container is available
 - Information if the event is pre or post translation point is
   available
 - The socket cookie is available
 - The TCP/UDP source port is _not_ known
 - TCP flags are _not_ known
 - Ethernet headers are _not_ known

Unfortunately, sock events can also be performed by non-pod entities
(e.g. the host cgroup). In that case, we cannot perform a mapping to any
local endpoint and therefore the source IP and endpoint data are left
empty. The cgroup id may also not be available on older kernels or
systems with the systemd cgroup driver (though this will likely be fixed
in a follow-up PR). This fact that the source IP may be empty/unknown is
also key difference to the flows which have been produced by Hubble so
far.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit adds a new agent flag which instructs Hubble to skip
TraceSock events with unknown cgroup ids, i.e. cgroup ids for which we
are unable to find a matching pod. Those skipped events are not added to
the ring buffer and can only be observed via `cilium monitor`, unless
the flag is disabled.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
To avoid cluttering the ConfigMap with default values, it is only added
to the ConfigMap if explicitly set (i.e. not is not `nil`).

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/socket-lb-traces-with-hubble branch from 93edc54 to c537251 Compare November 7, 2022 11:54
@gandro gandro requested a review from a team as a code owner November 7, 2022 11:54
@gandro gandro requested a review from aditighag November 7, 2022 15:10
@gandro
Copy link
Member Author

gandro commented Nov 8, 2022

/test

Copy link
Contributor

@squeed squeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my files!

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@gandro gandro removed the request for review from a team November 14, 2022 14:27
@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 14, 2022
@gandro
Copy link
Member Author

gandro commented Nov 14, 2022

Removed api code owner, as it has been superseeded by sig-hubble-api and is thus not required anymore (if it was required, I could not have removed the review request).

Marking ready to merge.

@tklauser tklauser merged commit fd2909a into master Nov 16, 2022
@tklauser tklauser deleted the pr/gandro/socket-lb-traces-with-hubble branch November 16, 2022 10:12
@gandro gandro self-assigned this Nov 21, 2022
@aditighag aditighag mentioned this pull request Jan 5, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants