Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal kubernetes cluster communication issues using AWS CNI 1.3.0 and m4/m5/c4/c5/r5/i3 instances #318

Closed
recollir opened this issue Feb 8, 2019 · 15 comments

Comments

@recollir
Copy link

recollir commented Feb 8, 2019

Internal kubernetes cluster communication issues using AWS CNI 1.3.0 and m4/m5/c4/c5/r5/i3 instances

Kubernetes cluster

  • Kubernetes 1.10.12 deployed with kops 1.10.1.
  • AWS CNI plugin 1.3.0
  • Kubernetes masters: m5.large
  • Kubernetes nodes: several instance groups with 3 instances of r4/r5/c4/c5/m4/m5
  • Kubernetes nodes OS images: kops default debian jessie/stretch amis and latest Amazon Linux 2 20190115 ami

What are the issues

r5/m4/m5/c4/c5/i3 instances:

When pods are getting an IP associated to the secondary, tertiary or quaternary ENI of an instance cluster internal communication is not working, e.g.g pods can not communicate with kube-dns, Kubernetes service in default namespace or any other cluster-ip service. Pods on the primary ENI have no problems to talk to those internal services. All pods on all ENIs can talk to the "internet". This is both with kops default jessie/stretch amis as well as with the latest Amazon Linux 2 ami.

r4 instances:

All pods on all ENIs can talk to the cluster internal services.

How we reproduce the issue:

  • we launch several instancegroups with kops, config exampel
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
 labels:
   kops.k8s.io/cluster: cluster.team.example.com
 name: c5-eu-west-1
spec:
 kubelet:
   maxPods: 29
 additionalSecurityGroups:
   - sg-0332feaa999999999
 image: amazon.com/amzn2-ami-hvm-2.0.20190115-x86_64-gp2
 machineType: c5.large
 maxSize: 3
 minSize: 3
 nodeLabels:
   kops.k8s.io/instancegroup: c5-eu-west-1
   team.example.com/ec2-class: c5
   team.example.com/instance-class: c5class
 role: Node
 taints:
 - dedicated=compute:NoSchedule
 subnets:
 - eu-west-1a
 - eu-west-1b
 - eu-west-1c
  • we launch enough pods through a deployment in that instance group that tolerates the taint and has a nodeselector for the instance class. the pod runs an alpine image with curl.
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: c5class
  namespace: monitoring
  labels:
    k8s-app: c5class
spec:
  replicas: 100
  selector:
    matchLabels:
      k8s-app: c5class
  template:
    metadata:
      labels:
        k8s-app: c5class
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: c5class
        image: "jcsorvasi/alpine-bash-curl-jq:2"
        command: ["/bin/bash"]
        args: [
          "-c",
          "while true; do curl -s -k https://kubernetes.default | jq -c .; sleep 5; done"
        ]
        env:
        - name: INCREMENT_ME_TO_DEPLOY
          value: "1"
      tolerations:
      - operator: Exists
      nodeSelector:
        team.example.com/instance-class: c5class
  • we loop through all launched pods and try to curl the https://kubernetes.default service. This either succeeds or not.
  • we retrieve information about the pod - IP address, which ENI it is associated with and if the ENI is the primary for the instance or not.
#!/usr/bin/env bash

csv_header="instance_class,instance_type,instance_id,k8s_nodename,pod_name,pod_ip,eni_id,eni_primary,eni_device_number,curl_exit_code,is_working"
echo $csv_header
#classes="c4class c5class i3class m4class m5class r4class r5class"
classes="c5class"

[[ ! -z $1 ]] && classes=$1

for class in $(echo $classes); do
  nodes=$(kubectl get nodes --no-headers -l team.example.com/instance-class=$class -o custom-columns=NAME:metadata.name | tr "\n" " ")
  for node in $nodes; do
    node_info_tuple=$(kubectl get node $node -o json | jq -r ". | \"\(.metadata.labels[\"beta.kubernetes.io/instance-type\"]),\(.spec.externalID)\"")
    pods=$(kubectl get pod --field-selector=spec.nodeName=$node -o custom-columns=NAME:metadata.name | grep $class | tr "\n" " ")
    eni_info=$(dsh $node "curl -s localhost:61678/v1/enis")
    for pod in $pods; do

      pod_ip=$(kubectl get pod $pod -o jsonpath='{.status.podIP}')
      eni_tuple=$(echo $eni_info | jq -r ".ENIIPPools | to_entries[].value | select(.IPv4Addresses | keys[] == \"$pod_ip\") | \"\(.ID),\(.IsPrimary),\(.DeviceNumber)\"")
      curl_exit_code=$(kubectl exec -it $pod -- bash -c "curl -o /dev/null -s -k https://kubernetes.default" 2>/dev/null; echo $?)

      is_working="false"
      [[ $curl_exit_code -eq 0 ]] && is_working="true"

      echo "$class,$node_info_tuple,$node,$pod,$pod_ip,$eni_tuple,$curl_exit_code,$is_working"

    done
  done
done

Result

The result of above experiment is a csv file

instance_class,instance_type,instance_id,k8s_nodename,pod_name,pod_ip,eni_id,eni_primary,eni_device_number,curl_exit_code,is_working
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-2w2fp,10.105.20.63,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-4mffp,10.105.20.127,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-4mmjn,10.105.21.171,eni-05e5adb0626972cdd,true,0,0,true
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-56z4w,10.105.23.207,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-64ptl,10.105.23.66,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-7ds2h,10.105.20.76,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-8hx79,10.105.23.139,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-9449w,10.105.16.138,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-9jgqv,10.105.22.31,eni-05e5adb0626972cdd,true,0,0,true
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-gbl8q,10.105.20.179,eni-05e5adb0626972cdd,true,0,0,true
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-hqkrp,10.105.17.156,eni-05e5adb0626972cdd,true,0,0,true
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-ldm4q,10.105.18.6,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-n8df8,10.105.19.14,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-njnfc,10.105.20.18,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-nr4dz,10.105.22.133,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-p6w4p,10.105.17.44,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-qsfd6,10.105.22.181,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-rphrw,10.105.22.185,eni-05e5adb0626972cdd,true,0,0,true
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-rz2bk,10.105.19.28,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-sbpnx,10.105.20.233,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-sdph8,10.105.16.178,eni-0e06652ea4e14866f,false,3,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-slsdr,10.105.21.126,eni-05e5adb0626972cdd,true,0,0,true
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-xx8l9,10.105.22.3,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0ce2badef326ca01b,ip-10-105-18-196.eu-west-1.compute.internal,c5class-5658859ddc-zfrrf,10.105.20.241,eni-00d8277f0328e853b,false,2,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-26kwg,10.105.55.84,eni-0c6564c7e0caaf635,false,2,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-2chjc,10.105.48.164,eni-0c6564c7e0caaf635,false,2,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-5ntzt,10.105.52.94,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-5qbmc,10.105.53.200,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-6wnrc,10.105.53.96,eni-0c6564c7e0caaf635,false,2,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-8h2hl,10.105.54.187,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-9gg5v,10.105.51.101,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-bjp54,10.105.55.201,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-d9z2p,10.105.49.138,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-dv7t2,10.105.52.192,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-fcbft,10.105.51.208,eni-0c6564c7e0caaf635,false,2,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-hg68b,10.105.52.78,eni-0c6564c7e0caaf635,false,2,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-jjlxh,10.105.54.185,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-k645j,10.105.53.148,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-kks67,10.105.50.231,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-lzl8t,10.105.53.243,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-pglsq,10.105.55.124,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-qwpbn,10.105.55.24,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-r6n24,10.105.51.37,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-sb6q9,10.105.49.120,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-stxnq,10.105.55.2,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-vxsk5,10.105.53.114,eni-0915ab8359b5037ac,true,0,0,true
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-wpvt8,10.105.48.55,eni-07baa21c687bc69de,false,3,6,false
c5class,c5.large,i-0e42a1c4f002a7fb0,ip-10-105-54-169.eu-west-1.compute.internal,c5class-5658859ddc-z96cp,10.105.55.187,eni-0c6564c7e0caaf635,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-677lx,10.105.86.11,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-7hhbh,10.105.85.185,eni-010c2993f7b0b78cf,true,0,0,true
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-7wmjl,10.105.86.215,eni-010c2993f7b0b78cf,true,0,0,true
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-cwcpq,10.105.84.138,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-czjn4,10.105.87.167,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-fxbfc,10.105.87.45,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-h7nhl,10.105.84.84,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-hgf8z,10.105.84.80,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-j29c8,10.105.80.82,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-k5v7s,10.105.85.91,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-knjgm,10.105.87.63,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-krf52,10.105.85.6,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-l2dq4,10.105.86.166,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-mphz6,10.105.84.61,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-nggb7,10.105.87.174,eni-010c2993f7b0b78cf,true,0,0,true
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-nns8v,10.105.86.95,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-plcq2,10.105.82.88,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-qfn5q,10.105.82.197,eni-010c2993f7b0b78cf,true,0,0,true
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-s2w6t,10.105.82.55,eni-010c2993f7b0b78cf,true,0,0,true
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-wd6wv,10.105.83.73,eni-070920a243e88ec75,false,3,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-wrdd2,10.105.81.39,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-xzj88,10.105.82.66,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-zgdrv,10.105.80.229,eni-085ff55486913475e,false,2,6,false
c5class,c5.large,i-0c323533c0368bdf9,ip-10-105-80-231.eu-west-1.compute.internal,c5class-5658859ddc-zpmj2,10.105.86.94,eni-010c2993f7b0b78cf,true,0,0,true
```

The last column (is_working) indication if the pod can communicate to a cluster internal service. And column 8 indicating the type of ENI (primary or non-primary)

It is visible in this example that all the non-working pods are associated to a non-primary ENI, while the working ones are on the primary.

We can provide results for the other instance classes as well, that show that pods on the non-primary ENIs fail.

This is both the the Kops default debian based AMIs as well as with the Amazon Linux 2 AMI.
@nickdgriffin
Copy link

This sounds a lot like #263, is Calico being used for network policies?

One thing to try is to disable the source/destination check to prove if the issue is related to packets exiting a different adapter to the one they came in on.

@recollir
Copy link
Author

AWS support wanted me to add the following information to the ticket:

  1. Was the issue that you were facing directly as soon as you updated to 1.3.0 or did this issue uncovered later?

The issue was uncovered on a cluster that is runnig aws cni 1.3.0 as we wanted to add a new instance type (c5 )to the cluster.

  1. If this was an intermittent or continuous issue?

Continuous issue. That is also reproducible each time on new instances.

  1. Was also an issue with the 1.2.1 plugin?
  2. Were you able to downgrade the plugin and see if you still faced the issue?

I have now tested with version 1.2.1 (downgraded the cluster to it). And the issue also exists in 1.2.1.

  1. Can you also confirm if you are facing this after upgrading to the latest CNI 1.3.2

The issue also exists in 1.3.2. Though to test I had to create my own images. I don't see any publicly available images for 1.3.2.

Out of curiosity, I have also tried the current master branch of the aws cni plugin. And there it seems to work. That is m4/m5/r5/c4/c5/i3 instances don't have the communication problem. And they seem to work as the r4 instances.

@recollir
Copy link
Author

@nickdgriffin No, we don't use Calico.

As for the "to disable the source/destination check", do you mean https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#change_source_dest_check or do you refer to the srcdst app (which we have removed). I can give it a try to disable the check on the ENIs of the c5 instances. Why would this be different and not matter on the working r4 instances (the check is enabled there).

As for further tests, I have compiled and tested with the current master branch. A quick test showed that the prev broken instances are working with master. I need to verify this.

If this is the case, it would be nice to understand which commit did this and why it is needed for the cX, mX and r5 instances. As the r4 instances are working with prev versions of the plugin. Thanks.

@nickdgriffin
Copy link

I do mean that, yes. The issue I am referring to comes about when a pod is allocated an IP on a secondary adapter, so you can check if that is the case across your various tests - it might be that your r4 test only had pods being allocated IPs from the primary interface, which is why it worked.

If you still have problems with pods that are allocated IPs from the primary adapter then it cannot be the issue I am referring to.

@recollir
Copy link
Author

For the tests we saturated the instance with as many pods as the total number of IPs the instance supported. To make sure that we have pods on all ENIs.

@recollir
Copy link
Author

I have now rerun the tests we did to verify internal cluster communication of the pods assigned to the non-primary ENIs on r5/c4/c5/m4/m5/i3. This time with the aws cni based of master (commit 4a51ed4). No errors now. All the pods on the non-primary ENIs can talk to kubernetes.default (as well as resolv it). This did clearly not work with version 1.2.1, 1.3.0 and 1.3.2.

I wonder if commits 6be0029 and/or 96a86f5 are fixing the issue we have seen? I also wonder if we are the only ones seeing this? When can we expect a new release of the aws cni?

@recollir
Copy link
Author

Any news on this?

@mogren
Copy link
Contributor

mogren commented Mar 7, 2019

Hi, sorry for the late reply.

We will investigate this issue, and if the current master works, that's great. I've created a 1.4 Milestone and we will start working on that soon.

@recollir
Copy link
Author

recollir commented Apr 8, 2019

@mogren We saw there is a 1.4.0-rc1 candidate out there. We have tested it and it seems that it solves this issue. Wondering why this issue is not in the milestone for 1.4 and/or why it was not addressed with any update here?

@mogren
Copy link
Contributor

mogren commented Apr 8, 2019

Hey @recollir, thanks a lot for testing this. I didn't want to include this issue since I was not sure the changes solved your problems. I'll try to get this release out as soon as possible.

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 19, 2019

v1.4.0 has been released! Has this issue been resolved with v1.4.0? @recollir

@recollir
Copy link
Author

v1.4.0 has been released! Has this issue been resolved with v1.4.0? @recollir

Currently testing it.

@recollir
Copy link
Author

v1.4.0 has been released! Has this issue been resolved with v1.4.0? @recollir

I have now tested the 1.4.0 version with c5, r4, r5, m5 and i3 instances - for all those: large, 2xlarge and 4xlarge.

It seems that the issue is now resolved. We still don't know why though. It would be nice to get an explanation of this. @Jeffwan

@mogren
Copy link
Contributor

mogren commented Apr 24, 2019

Hi @recollir,

The PR you commented a while ago, #346, is the most probably cause for this that I can think of. #305 might affect the Debian images, but should not cause issues for the AL ones.

Aside from that, since you're not using Calico, not that many changes were made to the CNI in regard to setting up routes on secondary ENIs before 4a51ed4.

To figure out what happened to those ENIs I'd have to take another look at the logs for the v1.3.0 that failed.

@mogren
Copy link
Contributor

mogren commented Apr 24, 2019

Looking at the iptables files you provided, there are a lot of rules that are set up by some kubernetes firewall script or something, and there are a lot of old rules and stuff in the cluster from a few days earlier. Are you sure nothing has changed with that script? Also, did you use any other tool like ec2config-cli that can modify the routes or iptables?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants