Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspecting wrong backend selection while calling k8s services #17859

Closed
sterchelen opened this issue Nov 12, 2021 · 10 comments
Closed

Suspecting wrong backend selection while calling k8s services #17859

sterchelen opened this issue Nov 12, 2021 · 10 comments
Assignees
Labels
kind/bug This is a bug in the Cilium logic. need-more-info More information is required to further debug or fix the issue. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@sterchelen
Copy link

sterchelen commented Nov 12, 2021

Bug report

General Information

  • Cilium version v1.10.5 via kops addon
  • Kernel version 5.8.0-1042-aws
  • Orchestration system version in use v1.21.2 via kops
  • AWS Cloud used
  • encryption enable via IPSec
  • ipv4 only
  • hubble activated
  • node-local-dns installed via kops addon
  • partial kube-proxy replacement in place
  • cni chaining portmap
  • pods pass through kubernetes services to communicate

Summary

Periodically, a source pod receives a 404 error when calling a kubernetes service.
Our traces system shows us that a pod from a deployment B is called instead of a pod from a deployment A. Which induces
a 404 error returned by the wrong pod because it can't handle the request.

How to reproduce?

It is very hard to reproduce the issue and needs a lot of luck because the occurences are very rare and transient.
We succeeded to find a pod generating the issue. When connecting to it and executing a curl loop to simulate http calls to the same k8s service
we were able to reproduce the 404 issue. At the same time, we had 50 pods attached to the kubernetes service.

Findings

  • error still transient via curl calls (1 out of 50)
  • ips from cilium endpoint and kubernetes endpoint were identical and corresponded to valid pod
  • no error while calling directly cilium endpoint ip (500 http calls were made for each endpoint ip)
  • no error while doing the same curl calls from a pod that didn't experience 404 issue
  • after activating l7 monitoring via the io.cilium.proxy-visibility annotation no error seen from the point of view of hubbe monitor
    • same number number of requests sent vs the number of response
  • with cilium monitor we were able to see that all the requests were sent to the right service ip but without knowing the destination pod ip

See in attachment a sysdump for a node who had the issue.

And here the cilium config:

apiVersion: v1
data:
  auto-direct-node-routes: "false"
  bpf-ct-global-any-max: "262144"
  bpf-ct-global-tcp-max: "524288"
  cgroup-root: /sys/fs/cgroup/unified
  cluster-name: default
  cni-chaining-mode: portmap
  container-runtime: none
  debug: "false"
  enable-hubble: "true"
  enable-ipsec: "true"
  enable-ipv4: "true"
  enable-ipv6: "false"
  enable-k8s-event-handover: "true"
  enable-metrics: "true"
  enable-node-port: "false"
  enable-remote-node-identity: "true"
  etcd-config: |-
    ---
    endpoints:
      - https://xxxxxxxx:4003

    trusted-ca-file: '/var/lib/etcd-secrets/etcd-ca.crt'
    key-file: '/var/lib/etcd-secrets/etcd-client-cilium.key'
    cert-file: '/var/lib/etcd-secrets/etcd-client-cilium.crt'
  hubble-disable-tls: "false"
  hubble-listen-address: :4244
  hubble-metrics: dns drop flow http icmp port-distribution
  hubble-metrics-server: :9091
  hubble-socket-path: /var/run/cilium/hubble.sock
  hubble-tls-cert-file: /var/lib/cilium/tls/hubble/tls.crt
  hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/ca.crt
  hubble-tls-key-file: /var/lib/cilium/tls/hubble/tls.key
  identity-allocation-mode: kvstore
  install-iptables-rules: "true"
  ipam: kubernetes
  ipsec-key-file: /etc/ipsec/keys
  kube-proxy-replacement: partial
  kvstore: etcd
  kvstore-opt: '{"etcd.config": "/var/lib/etcd-config/etcd.config"}'
  masquerade: "true"
  monitor-aggregation: medium
  operator-prometheus-serve-addr: :6942
  preallocate-bpf-maps: "false"
  prometheus-serve-addr: :9090
  sidecar-istio-proxy-image: cilium/istio_proxy
  tofqdns-dns-reject-response-code: refused
  tofqdns-enable-poller: "false"
  tunnel: vxlan
  wait-bpf-mount: "false"
kind: ConfigMap
metadata:
  labels:
    addon.kops.k8s.io/name: networking.cilium.io
    addon.kops.k8s.io/version: 1.10.0-kops.1
    app.kubernetes.io/managed-by: kops
    role.kubernetes.io/networking: "1"
  name: cilium-config
  namespace: kube-system
  resourceVersion: "2015"

Thank you 🙇🏼

ping @gmiam @grams @alex-raoul @valouille @IxDay @choutone

@sterchelen sterchelen added the kind/bug This is a bug in the Cilium logic. label Nov 12, 2021
@gmiam
Copy link

gmiam commented Nov 12, 2021

Maybe a precision: we switch identity-allocation-mode from crd to kvstore in the morning. I don't know if it may be confusing in the sysdump

@brb
Copy link
Member

brb commented Nov 12, 2021

Thanks for opening the issue. This needs more info:

  • What service name?
  • What client and backend IP addrs?
  • Accessed from inside or outside cluster?

@brb brb added need-more-info More information is required to further debug or fix the issue. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Nov 12, 2021
@sterchelen
Copy link
Author

* What service name?

Calls should land on badoom3 service but sometimes land on something else.
Calls coming to badoom3 are from doorman and pastrami.

* What client and backend IP addrs?

It is really hard to answer this question as our pods come and go :/ ; Tell me if you want me to give you the IP based on the sysdump.

* Accessed from inside or outside cluster?

Accessed from inside.


One additional information, we don't have any network policies in place.

@sterchelen
Copy link
Author

We ran into the same situation and made multiple sysdumps to give you additional info:

The pod doorman-7f5f67c946-j47nx should have contacted the pastrami service but we found that the call was directly made to the pod badoom3-598698858d-49zl6.

We made one sysdump for the node hosting doorman-7f5f67c946-j47nx called receiver-1.zip and another one for the node hosting badoom3-598698858d-49zl6 called transmitter-1.zip

Hope that will help.

@brb
Copy link
Member

brb commented Nov 19, 2021

Thanks for the sysdumps!

Some investigation brain dump below.

In-cluster bpf_lxc per-packet LB Client "doorman-7f5f67c946-j47nx" (100.97.222.42) connects to "pastrami" service (100.66.190.186, revnat=57) but gets wrong server "badoom3-598698858d-49zl6" (100.98.135.52, belong to svc=100.69.235.31, revnat=62). The session affinity is disabled.

From the CT on the client node there are 4 entries of the wrong selection:

cat cilium-bpf-ct-list-global.md | grep -v ICMP | grep --color=auto
100.97.222.42 | grep --color=auto 100.98.135.52 | grep RevNAT=57

TCP OUT 100.97.222.42:33660 -> 100.98.135.52:80 expires=96650 RxPackets=15 RxBytes=103572 RxFlagsSeen=0x03 LastRxReport=96640 TxPackets=15 TxBytes=3728 TxFlagsSeen=0x03 LastTxReport=96640 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=57 SourceSecurityID=46028 IfIndex=0
TCP OUT 100.97.222.42:53390 -> 100.98.135.52:80 expires=96613 RxPackets=14 RxBytes=100482 RxFlagsSeen=0x03 LastRxReport=96603 TxPackets=15 TxBytes=3101 TxFlagsSeen=0x03 LastTxReport=96603 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=57 SourceSecurityID=46028 IfIndex=0
TCP OUT 100.97.222.42:51678 -> 100.98.135.52:80 expires=96720 RxPackets=16 RxBytes=103603 RxFlagsSeen=0x03 LastRxReport=96710 TxPackets=17 TxBytes=2479 TxFlagsSeen=0x03 LastTxReport=96710 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=57 SourceSecurityID=46028 IfIndex=0
TCP OUT 100.97.222.42:50788 -> 100.98.135.52:80 expires=96603 RxPackets=15 RxBytes=100962 RxFlagsSeen=0x03 LastRxReport=96593 TxPackets=15 TxBytes=3104 TxFlagsSeen=0x03 LastTxReport=96593 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=57 SourceSecurityID=46028 IfIndex=0

Let's take port "100.97.222.42:33660", there are the following entries in the
CT:

> ack '100.97.222.42:33660' .
cilium-bpf-ct-list-global.md
9678:TCP OUT 100.97.222.42:33660 -> 100.98.135.52:80 expires=96650 RxPackets=15 RxBytes=103572 RxFlagsSeen=0x03 LastRxReport=96640 TxPackets=15 TxBytes=3728 TxFlagsSeen=0x03 LastTxReport=96640 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=57 SourceSecurityID=46028 IfIndex=0
33586:TCP OUT 100.69.235.31:80 -> 100.97.222.42:33660 service expires=115387 RxPackets=0 RxBytes=333 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x03 LastTxReport=93787 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=62 SourceSecurityID=0 IfIndex=0
66569:TCP OUT 100.64.68.14:80 -> 100.97.222.42:33660 service expires=105991 RxPackets=0 RxBytes=495 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x03 LastTxReport=84391 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=70 SourceSecurityID=0 IfIndex=0
82781:TCP OUT 100.66.190.186:80 -> 100.97.222.42:33660 service expires=118240 RxPackets=0 RxBytes=328 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x03 LastTxReport=96640 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=57 SourceSecurityID=0 IfIndex=0

From the above we can see that the client with the same sport connected first to
revnat=70, revnat=62 (the wrong backend belong to it), and then to revnat=57
service.

The revnat=57 entry has selected backend_id=328 (the RxBytes hack):

key: 48 01  value: 64 62 87 34 00 50 00 00

This is the wrong backend 100.98.135.52.

How is it possible that lb4_local() selected the wrong backend?

I expect that it entered this branch https://github.com/cilium/cilium/blob/v1.10.5/bpf/lib/lb.h#L1370, as client->svc tuple should have been new (i.e., CT_NEW), and thus the correct backend should have been selected.

@sterchelen some questions:

  1. Is it possible that 100.98.135.52 has belonged to a backend which belonged to pastrami service, and during the testing the backend was removed, and the IP was assigned to the new pod which belongs to a diffrent svc?
  2. Would it be possible to run for you a custom image based on v1.10.5? I'd add some debug statements to the datapath.

@alex-raoul
Copy link

Hello,
I'm working with @sterchelen

Is it possible that 100.98.135.52 has belonged to a backend which belonged to pastrami service, and during the testing the backend was removed, and the IP was assigned to the new pod which belongs to a diffrent svc?

It is possible, but we don't have the logs about ip assignements to pods.

We have a pool of let's say :

  • 20 doorman (emitter)
  • 25 pastrami (supposed receiver)
  • 30 badoom (one of them being actual unexpected receiver)
    Each doorman is doing curl like calls to pastrami svc, which is spread accross the 25 pods.

If we take the podname you found.
Accross doorman-7f5f67c946-j47nx lifetime (~7h), it did around 40k calls to pastrami service.

We estimate (we did some kubectl exec curl loops, spying in hubble-ui the real backend accessed on another occurence) that around 1/(pastrami-pool-size + 1) (1/26 in my example) calls were misrouted (might have been 1/25 as well, we didn't run thousands of calls to confirm the exact probability).

Each (detected) wrong routing went to badoom3-598698858d-49zl6, and not to any other badoom pod (or any other pod at all).
No other doorman pod had any issue during this period.

As soon as we detect a faulty pair, like here, a doorman pod calling through pastrami svc a badoom pod, it'll stay until either end is deleted, even if the hpa+rs for doorman, pastrami or badoom added/removed pods (as long as it didn't killed either end of the faulty pair).

Some additional notes :

  • We detected the issue on other pod to svc interaction (not only doorman pod > pastrami svc).
  • Badoom pods are exposed through another svc as well.
  • We ran a few scripts to compare CiliumEndpoints to svc Endpoints and bpf map content (decoding bpftool + /sys/fs/bpf/tc/globals/cilium_lb4_backends & cilium_lb4_services_v2 on cilium DS pod) and we found nothing strange.
  • We had 30+ hours span with no fault detection, and sometimes 50+ faulty pairs (we automated pod deletion with a custom script) within 6h with no apparent connection correlation between those (node, traffic, hpa activity ...).

Would it be possible to run for you a custom image based on v1.10.5? I'd add some debug statements to the datapath.

Yes sure, but we have yet to find a reliable way to reproduce on demand the issue.

@brb
Copy link
Member

brb commented Nov 19, 2021

As soon as we detect a faulty pair, like here, a doorman pod calling through pastrami svc a badoom pod, it'll stay until either end is deleted, even if the hpa+rs for doorman, pastrami or badoom added/removed pods (as long as it didn't killed either end of the faulty pair).

Are you saying that once the request is routed to the wrong backend, each subsequent request is routed to the same wrong backend from the same client pod?

Yes sure, but we have yet to find a reliable way to reproduce on demand the issue.

Could you ping me in the Cilium slack on Monday (I'm martynas there)?

@alex-raoul
Copy link

alex-raoul commented Nov 19, 2021

Are you saying that once the request is routed to the wrong backend, each subsequent request is routed to the same wrong backend from the same client pod?

No, only a small proportion, like 1 out of approximate size of backend pool. And as soon as we detect it, this small error rate will be visible until either pod are killed.
It's as-if the wrong backend joined the pool of legit backend but only for this emitter.

Could you ping me in the Cilium slack on Monday (I'm martynas there)?

Will do 👍

@brb brb self-assigned this Nov 23, 2021
@brb
Copy link
Member

brb commented Nov 23, 2021

Could you try using quay.io/cilium/cilium-ci:d75aa385a9fee027b9d35f96e63e5f78c12c8f50@sha256:cb8b6e70273604b7833547cb6c9bfb4d9cab4f4230f1a4f01f773e01af23dff9 (built by https://github.com/cilium/cilium/runs/4301548530?check_suite_focus=true) to see whether it resolves your issue?

@brb
Copy link
Member

brb commented Dec 10, 2021

Closing, as it should be fixed by #18113. Please reopen, if the issue persists.

@brb brb closed this as completed Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. need-more-info More information is required to further debug or fix the issue. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

4 participants