Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complexity Issue with cilium v1.9.5 when enable-endpoint-routes=true #16144

Closed
houminz opened this issue May 14, 2021 · 13 comments
Closed

Complexity Issue with cilium v1.9.5 when enable-endpoint-routes=true #16144

houminz opened this issue May 14, 2021 · 13 comments
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/complexity-issue Relates to BPF complexity or program size issues need-more-info More information is required to further debug or fix the issue. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@houminz
Copy link

houminz commented May 14, 2021

Bug report

General Information

  • Cilium version (run cilium version)
Cilium 1.9.5 0d18eedf2 2021-04-14T07:25:27+00:00 go version go1.15.11 linux/amd64
  • Kernel version (run uname -a): 5.4.87
  • Orchestration system version in use (e.g. kubectl version, ...): v1.18.4
  • Link to relevant artifacts (policies, deployments scripts, ...)
  • Generate and upload a system zip:
curl -sLO https://git.io/cilium-sysdump-latest.zip && python cilium-sysdump-latest.zip

How to reproduce the issue

I came across the complexity issue with cilium 1.9.5. In my scenario, CNI create network for pod failed, the log shows Unable to create endpoint: Cilium API client timeout exceeded

  Warning  FailedCreatePodSandBox  4s    kubelet, 9.51.26.208  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e84ae603bce1ea60d495cca27731d761f862b78b3de53b6fd39dda826efc571c" network for pod "busybox-546df6b849-22mg8": networkPlugin cni failed to set up pod "busybox-546df6b849-22mg8_default" network: Unable to create endpoint: Cilium API client timeout exceeded
  Normal   SandboxChanged          4s    kubelet, 9.51.26.208  Pod sandbox changed, it will be killed and re-created.

Then I checked logs of the cilium-agent, it showed that cilium-agent create endpoint failed when it executed tc command to load BPF program

level=info msg="Create endpoint request" addressing="&{9.166.64.223 e9568ae0-b315-11eb-a884-0425c502b8a3  }" containerID=dc19a3d486289e8215757fdceaa6cb071dd9ee841a58b7ce09abb8a448a772b3 datapathConfiguration="&{false true false true 0xc002a81a02}" interface=lxcefc1adc1539b k8sPodName=default/busybox-546df6b849-qtvm4 labels="[]" subsys=daemon sync-build=true
level=info msg="New endpoint" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2729 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=info msg="Resolving identity labels (blocking)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2729 identityLabels="k8s:app=busybox,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=default" ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=info msg="Identity of endpoint changed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2729 identity=59504 identityLabels="k8s:app=busybox,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=default,k8s:io.kubernetes.pod.namespace=default" ipv4= ipv6= k8sPodName=/ oldIdentity="no identity" subsys=endpoint
level=info msg="Waiting for endpoint to be generated" containerID= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2729 identity=59504 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=info msg="Compiled new BPF template" BPFCompilationTime=1.87285162s file-path=/var/run/cilium/state/templates/9d559f3a23be1b6b0dd3ef9d479b1fc66265e3ba/bpf_lxc.o subsys=datapath-loader
level=error msg="Command execution failed" cmd="[tc filter replace dev lxcefc1adc1539b ingress prio 1 handle 1 bpf da obj 2729_next/bpf_lxc.o sec from-container]" error="exit status 1" subsys=datapath-loader
level=warning msg="Log buffer too small to dump verifier log 16777215 bytes (10 tries)!" subsys=datapath-loader
level=warning msg="Error fetching program/map!" subsys=datapath-loader
level=warning msg="Unable to load program" subsys=datapath-loader
level=warning msg="JoinEP: Failed to load program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2729 error="Failed to load tc filter: exit status 1" file-path=2729_next/bpf_lxc.o identity=59504 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=lxcefc1adc1539b
level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2729 error="Failed to load tc filter: exit status 1" identity=59504 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2729 file-path=2729_next_fail identity=59504 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="Regeneration of endpoint failed" bpfCompilation=1.87285162s bpfLoadProg=20.252887977s bpfWaitForELF=1.873052008s bpfWriteELF="298.402µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2729 error="Failed to load tc filter: exit status 1" identity=59504 ipv4= ipv6= k8sPodName=/ mapSync="39.684µs" policyCalculation="49.736µs" prepareBuild="351.527µs" proxyConfiguration="7.027µs" proxyPolicyCalculation=266ns proxyWaitForAck=0s reason="updated security labels" subsys=endpoint total=22.134114349s waitingForCTClean=6.354728ms waitingForLock=941ns
level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2729 error="Failed to load tc filter: exit status 1" identity=59504 ipv4= ipv6= k8sPodName=/ subsys=endpoint

I kubectl execed into the cilium pod, executed command as below

# kubectl exec -it cilium-2zxn4 bash -n kube-system
/home/cilium# cd /var/run/cilium/state
/var/run/cilium/state# tc filter replace dev lxc_health ingress prio 1 handle 1 bpf da obj 3191_next/bpf_lxc.o sec from-container^C
/var/run/cilium/state# ls
3207_next       3436       3684_next_fail  95                  bpf_overlay.o  globals              netdev_config.h
3207_next_fail  3684_next  6_next_fail     bpf_alignchecker.o  device.state   health-endpoint.pid  templates
/var/run/cilium/state# tc filter replace dev lxc_health ingress prio 1 handle 1 bpf da obj 4091_next/bpf_lxc.o sec from-container
Log buffer too small to dump verifier log 16777215 bytes (10 tries)!
Error fetching program/map!
Unable to load program

After I change the configuration, i.e. comment the configuration below, it worked with the version 1.9.5

  enable-endpoint-routes: "false"
  enable-local-node-route: "true"
@houminz houminz added the kind/bug This is a bug in the Cilium logic. label May 14, 2021
@houminz
Copy link
Author

houminz commented May 14, 2021

With enable-endpoint-routes=true, we can reproduce the problem for cilium v1.9.5, here is my cilium configuration in detail

level=info msg="Skipped reading configuration file" reason="Config File \"ciliumd\" Not Found in \"[/root]\"" subsys=config
level=info msg="Started gops server" address="127.0.0.1:9890" subsys=daemon
level=info msg="Memory available for map entries (0.003% of 134229590016B): 335573975B" subsys=config
level=info msg="option bpf-ct-global-tcp-max set by dynamic sizing to 1177452" subsys=config
level=info msg="option bpf-ct-global-any-max set by dynamic sizing to 588726" subsys=config
level=info msg="option bpf-nat-global-max set by dynamic sizing to 1177452" subsys=config
level=info msg="option bpf-neigh-global-max set by dynamic sizing to 1177452" subsys=config
level=info msg="option bpf-sock-rev-map-max set by dynamic sizing to 588726" subsys=config
level=info msg="  --agent-health-port='9876'" subsys=daemon
level=info msg="  --agent-labels=''" subsys=daemon
level=info msg="  --allow-icmp-frag-needed='true'" subsys=daemon
level=info msg="  --allow-localhost='auto'" subsys=daemon
level=info msg="  --annotate-k8s-node='true'" subsys=daemon
level=info msg="  --api-rate-limit='map[]'" subsys=daemon
level=info msg="  --auto-create-cilium-node-resource='true'" subsys=daemon
level=info msg="  --auto-direct-node-routes='false'" subsys=daemon
level=info msg="  --blacklist-conflicting-routes='false'" subsys=daemon
level=info msg="  --bpf-compile-debug='false'" subsys=daemon
level=info msg="  --bpf-ct-global-any-max='262144'" subsys=daemon
level=info msg="  --bpf-ct-global-tcp-max='524288'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-fin='10s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-regular-tcp-syn='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-any='1m0s'" subsys=daemon
level=info msg="  --bpf-ct-timeout-service-tcp='6h0m0s'" subsys=daemon
level=info msg="  --bpf-fragments-map-max='8192'" subsys=daemon
level=info msg="  --bpf-lb-acceleration='disabled'" subsys=daemon
level=info msg="  --bpf-lb-algorithm='random'" subsys=daemon
level=info msg="  --bpf-lb-maglev-hash-seed='JLfvgnHc2kaSUFaI'" subsys=daemon
level=info msg="  --bpf-lb-maglev-table-size='16381'" subsys=daemon
level=info msg="  --bpf-lb-map-max='65536'" subsys=daemon
level=info msg="  --bpf-lb-mode='snat'" subsys=daemon
level=info msg="  --bpf-map-dynamic-size-ratio='0.0025'" subsys=daemon
level=info msg="  --bpf-nat-global-max='524288'" subsys=daemon
level=info msg="  --bpf-neigh-global-max='524288'" subsys=daemon
level=info msg="  --bpf-policy-map-max='16384'" subsys=daemon
level=info msg="  --bpf-root=''" subsys=daemon
level=info msg="  --bpf-sock-rev-map-max='262144'" subsys=daemon
level=info msg="  --certificates-directory='/var/run/cilium/certs'" subsys=daemon
level=info msg="  --cgroup-root=''" subsys=daemon
level=info msg="  --cluster-id=''" subsys=daemon
level=info msg="  --cluster-name='default'" subsys=daemon
level=info msg="  --clustermesh-config='/var/lib/cilium/clustermesh/'" subsys=daemon
level=info msg="  --cmdref=''" subsys=daemon
level=info msg="  --config=''" subsys=daemon
level=info msg="  --config-dir='/tmp/cilium/config-map'" subsys=daemon
level=info msg="  --conntrack-gc-interval='0s'" subsys=daemon
level=info msg="  --crd-wait-timeout='5m0s'" subsys=daemon
level=info msg="  --datapath-mode='veth'" subsys=daemon
level=info msg="  --debug='false'" subsys=daemon
level=info msg="  --debug-verbose=''" subsys=daemon
level=info msg="  --device=''" subsys=daemon
level=info msg="  --devices=''" subsys=daemon
level=info msg="  --direct-routing-device=''" subsys=daemon
level=info msg="  --disable-cnp-status-updates='true'" subsys=daemon
level=info msg="  --disable-conntrack='false'" subsys=daemon
level=info msg="  --disable-endpoint-crd='false'" subsys=daemon
level=info msg="  --disable-envoy-version-check='false'" subsys=daemon
level=info msg="  --disable-iptables-feeder-rules=''" subsys=daemon
level=info msg="  --dns-max-ips-per-restored-rule='1000'" subsys=daemon
level=info msg="  --egress-masquerade-interfaces=''" subsys=daemon
level=info msg="  --egress-multi-home-ip-rule-compat='false'" subsys=daemon
level=info msg="  --enable-auto-protect-node-port-range='true'" subsys=daemon
level=info msg="  --enable-bandwidth-manager='false'" subsys=daemon
level=info msg="  --enable-bpf-clock-probe='true'" subsys=daemon
level=info msg="  --enable-bpf-masquerade='true'" subsys=daemon
level=info msg="  --enable-bpf-tproxy='false'" subsys=daemon
level=info msg="  --enable-endpoint-health-checking='true'" subsys=daemon
level=info msg="  --enable-endpoint-routes='true'" subsys=daemon
level=info msg="  --enable-external-ips='true'" subsys=daemon
level=info msg="  --enable-health-check-nodeport='true'" subsys=daemon
level=info msg="  --enable-health-checking='true'" subsys=daemon
level=info msg="  --enable-host-firewall='false'" subsys=daemon
level=info msg="  --enable-host-legacy-routing='false'" subsys=daemon
level=info msg="  --enable-host-port='true'" subsys=daemon
level=info msg="  --enable-host-reachable-services='false'" subsys=daemon
level=info msg="  --enable-hubble='true'" subsys=daemon
level=info msg="  --enable-identity-mark='true'" subsys=daemon
level=info msg="  --enable-ip-masq-agent='false'" subsys=daemon
level=info msg="  --enable-ipsec='false'" subsys=daemon
level=info msg="  --enable-ipv4='true'" subsys=daemon
level=info msg="  --enable-ipv4-fragment-tracking='true'" subsys=daemon
level=info msg="  --enable-ipv6='false'" subsys=daemon
level=info msg="  --enable-ipv6-ndp='false'" subsys=daemon
level=info msg="  --enable-k8s-api-discovery='false'" subsys=daemon
level=info msg="  --enable-k8s-endpoint-slice='true'" subsys=daemon
level=info msg="  --enable-k8s-event-handover='false'" subsys=daemon
level=info msg="  --enable-l7-proxy='false'" subsys=daemon
level=info msg="  --enable-local-node-route='false'" subsys=daemon
level=info msg="  --enable-local-redirect-policy='false'" subsys=daemon
level=info msg="  --enable-monitor='true'" subsys=daemon
level=info msg="  --enable-node-port='false'" subsys=daemon
level=info msg="  --enable-policy='default'" subsys=daemon
level=info msg="  --enable-remote-node-identity='true'" subsys=daemon
level=info msg="  --enable-selective-regeneration='true'" subsys=daemon
level=info msg="  --enable-session-affinity='true'" subsys=daemon
level=info msg="  --enable-svc-source-range-check='true'" subsys=daemon
level=info msg="  --enable-tracing='false'" subsys=daemon
level=info msg="  --enable-well-known-identities='false'" subsys=daemon
level=info msg="  --enable-xt-socket-fallback='true'" subsys=daemon
level=info msg="  --encrypt-interface=''" subsys=daemon
level=info msg="  --encrypt-node='false'" subsys=daemon
level=info msg="  --endpoint-interface-name-prefix='lxc+'" subsys=daemon
level=info msg="  --endpoint-queue-size='25'" subsys=daemon
level=info msg="  --endpoint-status=''" subsys=daemon
level=info msg="  --envoy-log=''" subsys=daemon
level=info msg="  --exclude-local-address=''" subsys=daemon
level=info msg="  --fixed-identity-mapping='map[]'" subsys=daemon
level=info msg="  --flannel-master-device=''" subsys=daemon
level=info msg="  --flannel-uninstall-on-exit='false'" subsys=daemon
level=info msg="  --force-local-policy-eval-at-source='true'" subsys=daemon
level=info msg="  --gops-port='9890'" subsys=daemon
level=info msg="  --host-reachable-services-protos='tcp,udp'" subsys=daemon
level=info msg="  --http-403-msg=''" subsys=daemon
level=info msg="  --http-idle-timeout='0'" subsys=daemon
level=info msg="  --http-max-grpc-timeout='0'" subsys=daemon
level=info msg="  --http-request-timeout='3600'" subsys=daemon
level=info msg="  --http-retry-count='3'" subsys=daemon
level=info msg="  --http-retry-timeout='0'" subsys=daemon
level=info msg="  --hubble-disable-tls='false'" subsys=daemon
level=info msg="  --hubble-event-queue-size='0'" subsys=daemon
level=info msg="  --hubble-flow-buffer-size='4095'" subsys=daemon
level=info msg="  --hubble-listen-address=':4244'" subsys=daemon
level=info msg="  --hubble-metrics=''" subsys=daemon
level=info msg="  --hubble-metrics-server=''" subsys=daemon
level=info msg="  --hubble-socket-path='/var/run/cilium/hubble.sock'" subsys=daemon
level=info msg="  --hubble-tls-cert-file='/var/lib/cilium/tls/hubble/server.crt'" subsys=daemon
level=info msg="  --hubble-tls-client-ca-files='/var/lib/cilium/tls/hubble/client-ca.crt'" subsys=daemon
level=info msg="  --hubble-tls-key-file='/var/lib/cilium/tls/hubble/server.key'" subsys=daemon
level=info msg="  --identity-allocation-mode='crd'" subsys=daemon
level=info msg="  --identity-change-grace-period='5s'" subsys=daemon
level=info msg="  --install-iptables-rules='true'" subsys=daemon
level=info msg="  --ip-allocation-timeout='2m0s'" subsys=daemon
level=info msg="  --ip-masq-agent-config-path='/etc/config/ip-masq-agent'" subsys=daemon
level=info msg="  --ipam='crd'" subsys=daemon
level=info msg="  --ipsec-key-file=''" subsys=daemon
level=info msg="  --iptables-lock-timeout='5s'" subsys=daemon
level=info msg="  --iptables-random-fully='false'" subsys=daemon
level=info msg="  --ipv4-node='auto'" subsys=daemon
level=info msg="  --ipv4-pod-subnets=''" subsys=daemon
level=info msg="  --ipv4-range='auto'" subsys=daemon
level=info msg="  --ipv4-service-loopback-address='169.254.42.1'" subsys=daemon
level=info msg="  --ipv4-service-range='auto'" subsys=daemon
level=info msg="  --ipv6-cluster-alloc-cidr='f00d::/64'" subsys=daemon
level=info msg="  --ipv6-mcast-device=''" subsys=daemon
level=info msg="  --ipv6-node='auto'" subsys=daemon
level=info msg="  --ipv6-pod-subnets=''" subsys=daemon
level=info msg="  --ipv6-range='auto'" subsys=daemon
level=info msg="  --ipv6-service-range='auto'" subsys=daemon
level=info msg="  --ipvlan-master-device='undefined'" subsys=daemon
level=info msg="  --join-cluster='false'" subsys=daemon
level=info msg="  --k8s-api-server=''" subsys=daemon
level=info msg="  --k8s-force-json-patch='false'" subsys=daemon
level=info msg="  --k8s-heartbeat-timeout='30s'" subsys=daemon
level=info msg="  --k8s-kubeconfig-path=''" subsys=daemon
level=info msg="  --k8s-namespace='kube-system'" subsys=daemon
level=info msg="  --k8s-require-ipv4-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-require-ipv6-pod-cidr='false'" subsys=daemon
level=info msg="  --k8s-service-cache-size='128'" subsys=daemon
level=info msg="  --k8s-service-proxy-name=''" subsys=daemon
level=info msg="  --k8s-sync-timeout='3m0s'" subsys=daemon
level=info msg="  --k8s-watcher-endpoint-selector='metadata.name!=kube-scheduler,metadata.name!=kube-controller-manager,metadata.name!=etcd-operator,metadata.name!=gcp-controller-manager'" subsys=daemon
level=info msg="  --k8s-watcher-queue-size='1024'" subsys=daemon
level=info msg="  --keep-config='false'" subsys=daemon
level=info msg="  --kube-proxy-replacement='probe'" subsys=daemon
level=info msg="  --kube-proxy-replacement-healthz-bind-address=''" subsys=daemon
level=info msg="  --kvstore=''" subsys=daemon
level=info msg="  --kvstore-connectivity-timeout='2m0s'" subsys=daemon
level=info msg="  --kvstore-lease-ttl='15m0s'" subsys=daemon
level=info msg="  --kvstore-opt='map[]'" subsys=daemon
level=info msg="  --kvstore-periodic-sync='5m0s'" subsys=daemon
level=info msg="  --label-prefix-file=''" subsys=daemon
level=info msg="  --labels=''" subsys=daemon
level=info msg="  --lib-dir='/var/lib/cilium'" subsys=daemon
level=info msg="  --log-driver=''" subsys=daemon
level=info msg="  --log-opt='map[]'" subsys=daemon
level=info msg="  --log-system-load='false'" subsys=daemon
level=info msg="  --masquerade='true'" subsys=daemon
level=info msg="  --max-controller-interval='0'" subsys=daemon
level=info msg="  --metrics=''" subsys=daemon
level=info msg="  --monitor-aggregation='medium'" subsys=daemon
level=info msg="  --monitor-aggregation-flags='all'" subsys=daemon
level=info msg="  --monitor-aggregation-interval='5s'" subsys=daemon
level=info msg="  --monitor-queue-size='0'" subsys=daemon
level=info msg="  --mtu='0'" subsys=daemon
level=info msg="  --nat46-range='0:0:0:0:0:FFFF::/96'" subsys=daemon
level=info msg="  --native-routing-cidr=''" subsys=daemon
level=info msg="  --node-port-acceleration='disabled'" subsys=daemon
level=info msg="  --node-port-algorithm='random'" subsys=daemon
level=info msg="  --node-port-bind-protection='true'" subsys=daemon
level=info msg="  --node-port-mode='snat'" subsys=daemon
level=info msg="  --node-port-range='30000,32767'" subsys=daemon
level=info msg="  --policy-audit-mode='false'" subsys=daemon
level=info msg="  --policy-queue-size='100'" subsys=daemon
level=info msg="  --policy-trigger-interval='1s'" subsys=daemon
level=info msg="  --pprof='false'" subsys=daemon
level=info msg="  --preallocate-bpf-maps='false'" subsys=daemon
level=info msg="  --prefilter-device='undefined'" subsys=daemon
level=info msg="  --prefilter-mode='native'" subsys=daemon
level=info msg="  --prepend-iptables-chains='true'" subsys=daemon
level=info msg="  --prometheus-serve-addr=''" subsys=daemon
level=info msg="  --proxy-connect-timeout='1'" subsys=daemon
level=info msg="  --proxy-prometheus-port='0'" subsys=daemon
level=info msg="  --read-cni-conf=''" subsys=daemon
level=info msg="  --restore='true'" subsys=daemon
level=info msg="  --sidecar-istio-proxy-image='cilium/istio_proxy'" subsys=daemon
level=info msg="  --single-cluster-route='false'" subsys=daemon
level=info msg="  --skip-crd-creation='false'" subsys=daemon
level=info msg="  --socket-path='/var/run/cilium/cilium.sock'" subsys=daemon
level=info msg="  --sockops-enable='false'" subsys=daemon
level=info msg="  --state-dir='/var/run/cilium'" subsys=daemon
level=info msg="  --tofqdns-dns-reject-response-code='refused'" subsys=daemon
level=info msg="  --tofqdns-enable-dns-compression='true'" subsys=daemon
level=info msg="  --tofqdns-endpoint-max-ip-per-hostname='50'" subsys=daemon
level=info msg="  --tofqdns-idle-connection-grace-period='0s'" subsys=daemon
level=info msg="  --tofqdns-max-deferred-connection-deletes='10000'" subsys=daemon
level=info msg="  --tofqdns-min-ttl='0'" subsys=daemon
level=info msg="  --tofqdns-pre-cache=''" subsys=daemon
level=info msg="  --tofqdns-proxy-port='0'" subsys=daemon
level=info msg="  --tofqdns-proxy-response-max-delay='100ms'" subsys=daemon
level=info msg="  --trace-payloadlen='128'" subsys=daemon
level=info msg="  --tunnel='vxlan'" subsys=daemon
level=info msg="  --version='false'" subsys=daemon
level=info msg="  --write-cni-conf-when-ready=''" subsys=daemon
level=info msg="     _ _ _" subsys=daemon
level=info msg=" ___|_| |_|_ _ _____" subsys=daemon
level=info msg="|  _| | | | | |     |" subsys=daemon
level=info msg="|___|_|_|_|___|_|_|_|" subsys=daemon
level=info msg="Cilium 1.9.5 0d18eedf2 2021-04-14T07:25:27+00:00 go version go1.15.11 linux/amd64" subsys=daemon
level=info msg="cilium-envoy  version: e7430b113e09ee4fe900949af1f8e296e485269e/1.17.1/Distribution/RELEASE/BoringSSL" subsys=daemon
level=info msg="clang (10.0.0) and kernel (5.4.87) versions: OK!" subsys=linux-datapath
level=info msg="linking environment: OK!" subsys=linux-datapath

@houminz
Copy link
Author

houminz commented May 14, 2021

As I dive into the cilium code, I found some clue for the problem:

  1. tc filter replace dev lxcefc1adc1539b ingress prio 1 handle 1 bpf da obj bpf_lxc_BAD_CASE.o sec from-container failed because complexity issue
  2. when enable-endpoint-routes=false, command tc filter replace dev lxcefc1adc1539b ingress prio 1 handle 1 bpf da obj bpf_lxc_GOOD_CASE.o sec from-container succeed
  3. differences between the bpf_lxc_BAD_CASE.o and bpf_lxc_GOOD_CASE.o are only 3 configuration in ep_config.h:
BAD_CASE GOOD_CASE
USE_BPF_PROG_FOR_INGRESS_POLICY 1 0
ENABLE_ENDPOINT_ROUTES 1 0
ENABLE_ROUTING 0 1

@houminz houminz changed the title Complexity Issue with cilium v1.9.5 when enable-endpoint-routes option is true Complexity Issue with cilium v1.9.5 when enable-endpoint-routes=true May 14, 2021
@houminz
Copy link
Author

houminz commented May 14, 2021

@pchaigno any idea for this issue?

@pchaigno pchaigno added kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/complexity-issue Relates to BPF complexity or program size issues needs/triage This issue requires triaging to establish severity and next steps. labels May 14, 2021
@pchaigno
Copy link
Member

@SimpCosm Did you check if v1.10-rc1 has the same issue? A sysdump would also be useful to get the full datapath config.

@houminz
Copy link
Author

houminz commented May 17, 2021

@SimpCosm Did you check if v1.10-rc1 has the same issue? A sysdump would also be useful to get the full datapath config.

I have check with the version v1.10.0-rc1, it doesn't have the same issue.

@pchaigno
Copy link
Member

Ok. Then it might be worth checking with the just-released 1.9.7 as we had a couple changes impacting complexity since 1.9.5.

@houminz
Copy link
Author

houminz commented May 20, 2021

Ok. Then it might be worth checking with the just-released 1.9.7 as we had a couple changes impacting complexity since 1.9.5.

@pchaigno version 1.9.7 does not work, it has the same version with version 1.9.5

@pchaigno
Copy link
Member

Could you share a sysdump (or at least a bugtool) for one of the failing nodes? There are definitely complexity issues affecting Linux 5.4 on Cilium v1.9, but I'm unable to reproduce the endpoint-route aspect so far :-(

@aanm aanm added the need-more-info More information is required to further debug or fix the issue. label May 26, 2021
@errordeveloper
Copy link
Contributor

@SimpCosm do you think you can get a sysdump and share it with us please? If there are any concerns, you can share privately on Slack.

@stale
Copy link

stale bot commented Aug 28, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 28, 2021
@aanm aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Jan 6, 2022
@stale stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 6, 2022
@soulseen
Copy link
Contributor

same issue in cilium v1.10.11

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. and removed stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. labels Jul 25, 2022
@joestringer
Copy link
Member

@soulseen / @carjessu-zerohash if you see an issue like this, can you file a new issue with the exact output from the Cilium logs + configuration parameters + sysdump? Typically this type of issue can come up in an environment with a combination of specific Cilium configuration plus a particular kernel, and we need to be able to take a look at each instance independently to check for the underlying cause. It is not enough to bump this old issue.

Given that this issue was filed against 1.9.5 and we don't support 1.9.x releases in the Cilium community any more, I will close this issue out. However, you are welcome to file new issues for similar issues that you encounter while using newer versions of Cilium. Thank you!

@joestringer joestringer closed this as not planned Won't fix, can't repro, duplicate, stale Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/complexity-issue Relates to BPF complexity or program size issues need-more-info More information is required to further debug or fix the issue. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

6 participants