EKS - Random connection reset by peer #21853

m00lecule · 2022-10-22T11:38:09Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

Hello!

After running Cilium on EKS for 3 months we have noticed random issues that are correlated with network failures in Kubernetes. We are running Cilium alongside with kube-proxy. The issue is appearing as Connection reset by peer error.

I was looking for suggestions in https://docs.cilium.io/en/stable/operations/performance/tuning/ but only kubeproxyless scenarios are described.

It might be correlated with OS conntrack, reporting issues during packets inserting (insert_failed field). The NAT table size seems to be big enough. We were trying to debug it on our own, but It seems that are two layers of conntrack - on the OS level and Cilium CNI. I was trying to increase the NAT table size, but there were not signs of nf_conntrack: table full, dropping packets in dmesg.

I would kindly ask for guidance, about increasing Cilium reliability.

## conntrack stats
ip-XXXXX ~ # conntrack -S
cpu=0   	found=916 invalid=13835 ignore=793211 insert=0 insert_failed=25354 drop=0 early_drop=0 error=30 search_restart=2005895 
cpu=1   	found=1841 invalid=16135 ignore=639896 insert=0 insert_failed=25334 drop=0 early_drop=0 error=37 search_restart=1857410 
cpu=2   	found=1680 invalid=12426 ignore=758815 insert=0 insert_failed=26310 drop=1 early_drop=0 error=24 search_restart=1765398 
cpu=3   	found=2642 invalid=12952 ignore=752122 insert=0 insert_failed=25869 drop=0 early_drop=0 error=30 search_restart=1912938 
ip-XXXXX ~ # cat /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal 
0
ip-XXXXX ~ # cat /proc/sys/net/netfilter/nf_conntrack_max 
262144
ip-XXXXX ~ # cat /proc/sys/net/netfilter/nf_conntrack_count 
2303

Cilium Version

v1.11.9

Kernel Version

uname -a
Linux ip-XXXX.eu-central-1.compute.internal 5.4.204-113.362.amzn2.x86_64 #1 SMP Wed Jul 13 21:34:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.10-eks-15b7512", GitCommit:"cd6399691d9b1fed9ec20c9c5e82f5993c3f42cb", GitTreeState:"clean", BuildDate:"2022-08-31T19:17:01Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

No response

Relevant log output

$ cilium status
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.23+ (v1.23.10-eks-15b7512) [linux/amd64]
Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy "]
KubeProxyReplacement:   Disabled   
Host firewall:          Disabled
Cilium:                 Ok   1.11.9 (v1.11.9-4409e95)
NodeMonitor:            Listening for events on 4 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok   
IPAM:                   IPv4: 31/254 allocated from XXXX, 
BandwidthManager:       Disabled
Host Routing:           Legacy
Masquerading:           IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status:      151/151 healthy
Proxy Status:           OK, ip XXXXX, 0 redirects active on ports 10000-20000
Hubble:                 Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 181.51   Metrics: Disabled
Encryption:             Disabled
Cluster health:         17/17 reachable   (2022-10-22T10:31:10Z)

 cilium helm chart values
set {
    name  = "image.repository"
    value = format("%s/quay.io/cilium/cilium", var.ecr_registry)
  }
  set {
    name  = "image.useDigest"
    value = "false"
  }
  set {
    name  = "image.tag"
    value = "v1.11.9"
  set {
    name  = "operator.image.repository"
    value = format("%s/quay.io/cilium/operator", var.ecr_registry)
  }
  set {
    name  = "hubble.relay.image.repository"
    value = format("%s/quay.io/cilium/hubble-relay", var.ecr_registry)
  }
  set {
    name  = "operator.image.tag"
    value = "v1.11.9"
  }
  set {
    name  = "operator.image.useDigest"
    value = "false"
  }
  set {
    name  = "egressMasqueradeInterfaces"
    value = "eth0"
  }
  set {
    name  = "prometheus.enabled"
    value = "true"
  }
  set {
    name  = "operator.prometheus.enabled"
    value = "true"
  }
  set {
    name  = "upgradeCompatibility"
    value = "1.11"
  }
  set {
    name  = "extraArgs[0]"
    value = "--api-rate-limit=endpoint-create=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[1]"
    value = "--api-rate-limit=endpoint-delete=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[2]"
    value = "--api-rate-limit=endpoint-get=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[3]"
    value = "--api-rate-limit=endpoint-patch=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[4]"
    value = "--api-rate-limit=endpoint-list=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "updateStrategy.rollingUpdate.maxUnavailable"
    value = "25%"
    type  = "string"
  }

Anything else?

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

yusufgungor · 2022-10-29T15:11:30Z

Hi, we have facing the similar issue. Is it realted with this https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/ ?

kubernetes/kubernetes@a07169b

sergeimonakhov · 2022-10-31T14:44:21Z

Hi,
same behavior, only we use bare metal, without kube-proxy...

m00lecule · 2022-11-17T18:49:41Z

After upgrading Cilium from 1.11.9 to 1.12.3 the error has vanished.

github-actions · 2023-01-17T02:00:17Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2023-01-31T02:02:38Z

This issue has not seen any activity since it was marked stale.
Closing.

m00lecule added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels Oct 22, 2022

aanm added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Nov 8, 2022

github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS - Random connection reset by peer #21853

EKS - Random connection reset by peer #21853

m00lecule commented Oct 22, 2022 •

edited

yusufgungor commented Oct 29, 2022 •

edited

sergeimonakhov commented Oct 31, 2022

m00lecule commented Nov 17, 2022

github-actions bot commented Jan 17, 2023

github-actions bot commented Jan 31, 2023

EKS - Random connection reset by peer #21853

EKS - Random connection reset by peer #21853

Comments

m00lecule commented Oct 22, 2022 • edited

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

yusufgungor commented Oct 29, 2022 • edited

sergeimonakhov commented Oct 31, 2022

m00lecule commented Nov 17, 2022

github-actions bot commented Jan 17, 2023

github-actions bot commented Jan 31, 2023

m00lecule commented Oct 22, 2022 •

edited

yusufgungor commented Oct 29, 2022 •

edited