Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS - Random connection reset by peer #21853

Closed
2 tasks done
m00lecule opened this issue Oct 22, 2022 · 5 comments
Closed
2 tasks done

EKS - Random connection reset by peer #21853

m00lecule opened this issue Oct 22, 2022 · 5 comments
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@m00lecule
Copy link

m00lecule commented Oct 22, 2022

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Hello!

After running Cilium on EKS for 3 months we have noticed random issues that are correlated with network failures in Kubernetes. We are running Cilium alongside with kube-proxy. The issue is appearing as Connection reset by peer error.

I was looking for suggestions in https://docs.cilium.io/en/stable/operations/performance/tuning/ but only kubeproxyless scenarios are described.

It might be correlated with OS conntrack, reporting issues during packets inserting (insert_failed field). The NAT table size seems to be big enough. We were trying to debug it on our own, but It seems that are two layers of conntrack - on the OS level and Cilium CNI. I was trying to increase the NAT table size, but there were not signs of nf_conntrack: table full, dropping packets in dmesg.

I would kindly ask for guidance, about increasing Cilium reliability.

## conntrack stats
ip-XXXXX ~ # conntrack -S
cpu=0   	found=916 invalid=13835 ignore=793211 insert=0 insert_failed=25354 drop=0 early_drop=0 error=30 search_restart=2005895 
cpu=1   	found=1841 invalid=16135 ignore=639896 insert=0 insert_failed=25334 drop=0 early_drop=0 error=37 search_restart=1857410 
cpu=2   	found=1680 invalid=12426 ignore=758815 insert=0 insert_failed=26310 drop=1 early_drop=0 error=24 search_restart=1765398 
cpu=3   	found=2642 invalid=12952 ignore=752122 insert=0 insert_failed=25869 drop=0 early_drop=0 error=30 search_restart=1912938 
ip-XXXXX ~ # cat /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal 
0
ip-XXXXX ~ # cat /proc/sys/net/netfilter/nf_conntrack_max 
262144
ip-XXXXX ~ # cat /proc/sys/net/netfilter/nf_conntrack_count 
2303

Cilium Version

v1.11.9

Kernel Version

uname -a
Linux ip-XXXX.eu-central-1.compute.internal 5.4.204-113.362.amzn2.x86_64 #1 SMP Wed Jul 13 21:34:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.10-eks-15b7512", GitCommit:"cd6399691d9b1fed9ec20c9c5e82f5993c3f42cb", GitTreeState:"clean", BuildDate:"2022-08-31T19:17:01Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

No response

Relevant log output

$ cilium status
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.23+ (v1.23.10-eks-15b7512) [linux/amd64]
Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy "]
KubeProxyReplacement:   Disabled   
Host firewall:          Disabled
Cilium:                 Ok   1.11.9 (v1.11.9-4409e95)
NodeMonitor:            Listening for events on 4 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok   
IPAM:                   IPv4: 31/254 allocated from XXXX, 
BandwidthManager:       Disabled
Host Routing:           Legacy
Masquerading:           IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status:      151/151 healthy
Proxy Status:           OK, ip XXXXX, 0 redirects active on ports 10000-20000
Hubble:                 Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 181.51   Metrics: Disabled
Encryption:             Disabled
Cluster health:         17/17 reachable   (2022-10-22T10:31:10Z)

 cilium helm chart values
set {
    name  = "image.repository"
    value = format("%s/quay.io/cilium/cilium", var.ecr_registry)
  }
  set {
    name  = "image.useDigest"
    value = "false"
  }
  set {
    name  = "image.tag"
    value = "v1.11.9"
  set {
    name  = "operator.image.repository"
    value = format("%s/quay.io/cilium/operator", var.ecr_registry)
  }
  set {
    name  = "hubble.relay.image.repository"
    value = format("%s/quay.io/cilium/hubble-relay", var.ecr_registry)
  }
  set {
    name  = "operator.image.tag"
    value = "v1.11.9"
  }
  set {
    name  = "operator.image.useDigest"
    value = "false"
  }
  set {
    name  = "egressMasqueradeInterfaces"
    value = "eth0"
  }
  set {
    name  = "prometheus.enabled"
    value = "true"
  }
  set {
    name  = "operator.prometheus.enabled"
    value = "true"
  }
  set {
    name  = "upgradeCompatibility"
    value = "1.11"
  }
  set {
    name  = "extraArgs[0]"
    value = "--api-rate-limit=endpoint-create=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[1]"
    value = "--api-rate-limit=endpoint-delete=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[2]"
    value = "--api-rate-limit=endpoint-get=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[3]"
    value = "--api-rate-limit=endpoint-patch=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[4]"
    value = "--api-rate-limit=endpoint-list=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "updateStrategy.rollingUpdate.maxUnavailable"
    value = "25%"
    type  = "string"
  }

Anything else?

connection-reset

Code of Conduct

  • I agree to follow this project's Code of Conduct
@m00lecule m00lecule added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels Oct 22, 2022
@yusufgungor
Copy link

yusufgungor commented Oct 29, 2022

@sergeimonakhov
Copy link
Contributor

Hi,
same behavior, only we use bare metal, without kube-proxy...

@aanm aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Nov 8, 2022
@m00lecule
Copy link
Author

After upgrading Cilium from 1.11.9 to 1.12.3 the error has vanished.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 17, 2023
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

4 participants