Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{source,destination}_service is not always set #21251

Open
2 tasks done
chancez opened this issue Sep 8, 2022 · 8 comments
Open
2 tasks done

{source,destination}_service is not always set #21251

chancez opened this issue Sep 8, 2022 · 8 comments
Labels
kind/bug This is a bug in the Cilium logic. kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. pinned These issues are not marked stale by our issue bot. sig/agent Cilium agent related. sig/hubble Impacts hubble server or relay

Comments

@chancez
Copy link
Contributor

chancez commented Sep 8, 2022

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

When looking at Hubble flows, I only see the source_service and destination_service set on flows designated to a service sometimes.

It seems that the source/destination IP is sometimes the clusterIP and sometimes it's the podIP, even though all traffic to the pod is going through the clusterIP service. Given that hubble uses the IP and port to lookup the underlying service, the fact that the source/destination IP is sometimes the clusterIP, and sometimes not, this makes sense.

The behavior is relatively predictable as well. The clusterIP is in the flows at the beginning of when everything is being started/created, or whenever I restart the backend pod. This leads me to believe that perhaps the clusterIP is used during initial connections, and some of the future flows are using the podIP. Or something of that sort.

Here's two flows from the same source, to the same destination to illustrate the problem:

Flow with the clusterIP as the destination IP (10.96.200.135) and destination_service (elasticsearch-master)) correctly set:

{"flow":{"time":"2022-09-08T21:51:55.392890096Z","verdict":"FORWARDED","ethernet":{"source":"46:11:eb:90:a2:53","destination":"be:43:3c:88:46:ec"},"IP":{"source":"10.0.0.227","destination":"10.96.200.135","ipVersion":"IPv4"},"l4":{"TCP":{"source_port":35446,"destination_port":9200,"flags":{"SYN":true}}},"source":{"ID":1563,"identity":37427,"namespace":"tenant-jobs","labels":["k8s:app=coreapi","k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=tenant-jobs","k8s:io.cilium.k8s.namespace.labels.name=tenant-jobs","k8s:io.cilium.k8s.policy.cluster=default","k8s:io.cilium.k8s.policy.serviceaccount=default","k8s:io.kubernetes.pod.namespace=tenant-jobs"],"pod_name":"coreapi-546797cd76-jtcc2","workloads":[{"name":"coreapi-546797cd76","kind":"ReplicaSet"}]},"destination":{"identity":2,"labels":["reserved:world"]},"Type":"L3_L4","node_name":"kind-control-plane","event_type":{"type":4,"sub_type":3},"destination_service":{"name":"elasticsearch-master","namespace":"tenant-jobs"},"traffic_direction":"EGRESS","trace_observation_point":"TO_STACK","is_reply":false,"Summary":"TCP Flags: SYN"},"node_name":"kind-control-plane","time":"2022-09-08T21:51:55.392890096Z"}

Flow with the elasticsearch podIP (10.0.0.141) instead of clusterIP, and thus a missing destination_service:

{"flow":{"time":"2022-09-08T21:52:45.408625862Z","verdict":"FORWARDED","ethernet":{"source":"f6:a6:f5:c6:e8:78","destination":"26:60:1b:d2:2f:6c"},"IP":{"source":"10.0.0.227","destination":"10.0.0.141","ipVersion":"IPv4"},"l4":{"TCP":{"source_port":55808,"destination_port":9200,"flags":{"PSH":true,"ACK":true}}},"source":{"ID":1563,"identity":37427,"namespace":"tenant-jobs","labels":["k8s:app=coreapi","k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=tenant-jobs","k8s:io.cilium.k8s.namespace.labels.name=tenant-jobs","k8s:io.cilium.k8s.policy.cluster=default","k8s:io.cilium.k8s.policy.serviceaccount=default","k8s:io.kubernetes.pod.namespace=tenant-jobs"],"pod_name":"coreapi-546797cd76-jtcc2","workloads":[{"name":"coreapi-546797cd76","kind":"ReplicaSet"}]},"destination":{"ID":469,"identity":11367,"namespace":"tenant-jobs","labels":["k8s:app=elasticsearch-master","k8s:chart=elasticsearch","k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=tenant-jobs","k8s:io.cilium.k8s.namespace.labels.name=tenant-jobs","k8s:io.cilium.k8s.policy.cluster=default","k8s:io.cilium.k8s.policy.serviceaccount=default","k8s:io.kubernetes.pod.namespace=tenant-jobs","k8s:release=jobs-app","k8s:statefulset.kubernetes.io/pod-name=elasticsearch-master-0"],"pod_name":"elasticsearch-master-0","workloads":[{"name":"elasticsearch-master","kind":"StatefulSet"}]},"Type":"L3_L4","node_name":"kind-control-plane","event_type":{"type":4},"traffic_direction":"INGRESS","trace_observation_point":"TO_ENDPOINT","is_reply":false,"interface":{"index":89,"name":"lxc58b226dba99b"},"Summary":"TCP Flags: ACK, PSH"},"node_name":"kind-control-plane","time":"2022-09-08T21:52:45.408625862Z"}

Cilium Version

Client: 1.12.1 4c9a630 2022-08-15T16:29:39-07:00 go version go1.18.5 linux/arm64
Daemon: 1.12.1 4c9a630 2022-08-15T16:29:39-07:00 go version go1.18.5 linux/arm64

Kernel Version

Linux lima-docker 5.15.0-47-generic #51-Ubuntu SMP Fri Aug 12 08:18:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:26:19Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-19T15:42:59Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/arm64"}

Sysdump

cilium-sysdump-20220908-145348.zip

Relevant log output

No response

Anything else?

This also happens when using kube-proxy proxy replacement. I retested with KPR disabled and it still happened.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@chancez chancez added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels Sep 8, 2022
@github-actions
Copy link

github-actions bot commented Nov 8, 2022

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Nov 8, 2022
@gandro
Copy link
Member

gandro commented Nov 8, 2022

Related: cilium/hubble#713

This leads me to believe that perhaps the clusterIP is used during initial connections, and some of the future flows are using the podIP. Or something of that sort.

Yes, Cilium translates the clusterIP to a podIP as early as possible (even on the socket level if SockLB is enabled). Therefore, the actual traffic on the wire will always contain the podIP.

While it's easy to map clusterIP to service, it's less obvious for podIPs. One problem is that a pod can have multiple services selecting it. Since we process each flow individually, the node where the second flow arrives might not know what (if any) service clusterIP was used to access the pod.

We could just add all matching services to the second flow, but that might also be confusing to users. Since even if you would directly connect to a podIP, Hubble would still tell you that the flow event is associated with a service (even if there was no service involved at all). On the other hand, the current behavior is also confusing, as indicated by the number of reports we get.

@gandro gandro added sig/hubble Impacts hubble server or relay and removed needs/triage This issue requires triaging to establish severity and next steps. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. labels Nov 8, 2022
@aanm aanm added the sig/agent Cilium agent related. label Nov 9, 2022
@chancez
Copy link
Contributor Author

chancez commented Nov 14, 2022

@gandro Right, I expect the mapping to always be clusterIP -> service (rather than the podIP -> service), but that would require doing the lookup for every packet I assume, which is probably why the flows only show the clusterIP early in the connection, and pod later.

I assume we also have no way to inform cilium the traffic's original destination was the clusterIP after it's been translated to the podIP? That feels a bit like connection tracking, which I believe cilium does already, so I'm curious if this is a performance trade off, or a complexity trade off, or just not possible with how we implement it.

@gandro
Copy link
Member

gandro commented Nov 15, 2022

@chancez

Right, I expect the mapping to always be clusterIP -> service (rather than the podIP -> service), but that would require doing the lookup for every packet I assume, which is probably why the flows only show the clusterIP early in the connection, and pod later.

Ignoring Hubble for a moment:

With SockLB, yes we perform the translation of the clusterIP to podIP as early as possible, so we don't have to do it for every packet. But if SockLB is not available (it's an optional feature), Cilium can also do the translation on the packet level, which indeed means it will do the translation for every packet. But we only do this on the bpf_lxc trace point, so once the packet leaves the container, it's already rewritten to contain the pod IP as the destination address.

I assume we also have no way to inform cilium the traffic's original destination was the clusterIP after it's been translated to the podIP? That feels a bit like connection tracking, which I believe cilium does already, so I'm curious if this is a performance trade off, or a complexity trade off, or just not possible with how we implement it.

So yes, to be able to perform reverse NAT for reply packets, we do maintain NAT table (with SockLB) or a CT table (without SockLB) that tells if the connection was NATed. While I think in theory we could perform a lookup on every trace point in that table, we currently don't do it because it's not necessary to perform the core tasks of the datapath (i.e. policy enforcement, load balancing, encryption) and every additional map lookup does incur a per-packet overhead.

But there is also a more fundamental limitation when it comes to cross-node traffic: The above tables are local to the node where the flow originated. Once a packet is NATed (i.e. the destination clusterIP has been replaced with a destination podIP) and sent to another node, all the remote node sees is the podIP. The remote node does not have access to the NAT tables used to rewrite the packet, and thus cannot check if that particular packet was ever NATeted or not. The remote node cannot know the original destination IP (unless we introduce some form of packet encapsulation or something similar).

@chancez
Copy link
Contributor Author

chancez commented Nov 15, 2022

@gandro I think when it comes to cross-node, that's primarily going to effect the source_service, which IMO conceptually never made sense to me, so I think that's relatively acceptable. I think that based on what you said, destination_service would work as expected without SockLB, and could work with it as well, but currently the "works-sometimes" makes it pretty unusable.

Perhaps short-term we should just document this limitation (if not already) and in the future we may revisit with SockLB so the destination_service metadata is always set correctly.

@gandro
Copy link
Member

gandro commented Nov 16, 2022

@gandro I think when it comes to cross-node, that's primarily going to effect the source_service, which IMO conceptually never made sense to me, so I think that's relatively acceptable.

I'm not sure I follow. Imagine the following chain of events, where xwing-pod-1 is running on node1 and deathstar-pod-2 is running on node2.

[k8s-node1] xwing-pod-1 -> deathstar-service (pre-translation)  // destination_service is set
[k8s-node1] xwing-pod-1 -> deathstar-pod-2   (post-translation) // destination_service is empty, but could technically be recovered
[k8s-node1] xwing-pod-1 -> deathstar-pod-2   (from-endpoint)    // destination_service is empty, but could technically be recovered
[k8s-node1] xwing-pod-1 -> deathstar-pod-2   (to-stack)         // routed to node2, otherwise same as above

[k8s-node2] xwing-pod-1 -> deathstar-pod-2   (from-stack)       // arriving at node2, destination_service is empty, and _not_ recoverable
[k8s-node2] xwing-pod-1 -> deathstar-pod-2   (to-endpoint)      // destination_service is empty, and _not_ recoverable

[k8s-node2] deathstar-pod-2 -> xwing-pod-1   (from-endpoint)    // reply packet, now the IPs are swapped and no service IP is involved yet
[k8s-node2] deathstar-pod-2 -> xwing-pod-1   (to-stack)         // reply packet, same as above

[k8s-node1] deathstar-pod-2 -> xwing-pod-1   (from-stack)       // source_service is empty, but could technically be recovered
[k8s-node1] deathstar-pod-2 -> xwing-pod-1   (to-endpoint)      // source_service is empty, but could technically be recovered
[k8s-node1] deathstar-pod-2 -> xwing-pod-1   (pre-translation)  // source_service is empty, but could technically be recovered
[k8s-node1] deathstar-service -> xwing-pod-1 (post-translation) // source_service is set

I think this should demonstrate that k8s-node1 could probably recover the original service field (albeit with a high performance cost), but k8s-node2 cannot, because it never saw saw the NAT happening and thus can only guess if deathstar-pod2 was accessed via PodIP, ClusterIP (it could have multiple) or maybe even NodePort.

It also demonstrated that source_service is used for regular traffic. It is technically the destination service of the connection, but since Hubble does have a per-packet view for trace events, it is the source_service of the event, since the clusterIP is the source IP of the reply packet when it is delivered to the xwing application.

@chancez
Copy link
Contributor Author

chancez commented Nov 16, 2022

Ah right, I was thinking only about egress on the source node, not ingress on the destination when it came to destination_service.

@github-actions

This comment was marked as resolved.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 16, 2023
@gandro gandro added kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. pinned These issues are not marked stale by our issue bot. and removed stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. labels Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. pinned These issues are not marked stale by our issue bot. sig/agent Cilium agent related. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

No branches or pull requests

3 participants