Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elevated rate of Client.PrivateIpAddressLimitExceeded errors after 1.5.x to 1.6.1 upgrade #989

Closed
uthark opened this issue May 21, 2020 · 14 comments

Comments

@uthark
Copy link
Contributor

uthark commented May 21, 2020

After we upgraded to 1.6.1 version, we see increased rate of Client.PrivateIpAddressLimitExceeded errors for AssignPrivateIpAddresses call.

Error message:

"errorCode": "Client.PrivateIpAddressLimitExceeded",
    "errorMessage": "Number of private addresses will exceed limit.",
    "requestParameters": {
        "networkInterfaceId": "eni-XXXX",
        "privateIpAddressesSet": {},
        "secondaryPrivateIpAddressCount": 29
    },

We use m5.12xlarge which allows to have 30 IP Addresses per interface.

Screen Shot 2020-05-21 at 12 07 58 PM

@mogren
Copy link
Contributor

mogren commented May 21, 2020

@uthark Interesting, thanks for reporting! Are you using WARM_IP_TARGET or is this when attaching a new ENI with 29 new secondary IPs? A log dump would be most helpful to figure out what the issue is.

@uthark
Copy link
Contributor Author

uthark commented May 21, 2020

we don't use WARM_IP_TARGET .
We use WARM_ENI_TARGET with value 2.
Looking for logs and check if those are new instances. will post update later.

@uthark
Copy link
Contributor Author

uthark commented May 21, 2020

This is on new instances, please see screenshot with CloudTrail events.
Screen Shot 2020-05-21 at 12 51 28 PM

ipamd.log:

cat ipamd.log
{"level":"error","ts":"2020-05-21T18:36:13.918Z","caller":"ipamd/ipamd.go:354","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 4: iptables: Resource temporarily unavailable.\n"}
{"level":"error","ts":"2020-05-21T18:36:13.918Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to set up host network: host network setup: failed to add chain: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 4: iptables: Resource temporarily unavailable.\n"}
{"level":"warn","ts":"2020-05-21T18:36:52.541Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"warn","ts":"2020-05-21T18:36:55.541Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"warn","ts":"2020-05-21T18:36:58.542Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"warn","ts":"2020-05-21T18:37:01.542Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"warn","ts":"2020-05-21T18:37:04.542Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"error","ts":"2020-05-21T18:37:13.345Z","caller":"datastore/data_store.go:288","msg":"DataStore has no available IP addresses"}
{"level":"error","ts":"2020-05-21T18:37:13.503Z","caller":"datastore/data_store.go:288","msg":"DataStore has no available IP addresses"}
{"level":"warn","ts":"2020-05-21T18:37:13.678Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:13.679Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:14.633Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:14.633Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:15.633Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:15.633Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"error","ts":"2020-05-21T18:37:16.312Z","caller":"datastore/data_store.go:288","msg":"DataStore has no available IP addresses"}
{"level":"warn","ts":"2020-05-21T18:37:16.444Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:16.444Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:16.628Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:16.628Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"error","ts":"2020-05-21T18:37:17.404Z","caller":"datastore/data_store.go:288","msg":"DataStore has no available IP addresses"}
{"level":"warn","ts":"2020-05-21T18:37:17.540Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:17.540Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:17.651Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:17.651Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"error","ts":"2020-05-21T18:37:18.420Z","caller":"datastore/data_store.go:288","msg":"DataStore has no available IP addresses"}
{"level":"warn","ts":"2020-05-21T18:37:18.612Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:18.612Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:18.663Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:18.663Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"error","ts":"2020-05-21T18:37:19.523Z","caller":"datastore/data_store.go:288","msg":"DataStore has no available IP addresses"}
{"level":"warn","ts":"2020-05-21T18:37:19.656Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:19.657Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:20.688Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:20.688Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"error","ts":"2020-05-21T18:37:21.471Z","caller":"datastore/data_store.go:288","msg":"DataStore has no available IP addresses"}
{"level":"warn","ts":"2020-05-21T18:37:21.602Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:21.602Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:22.679Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:22.679Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:23.710Z","caller":"ipamd/rpc_handler.go:88","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"warn","ts":"2020-05-21T18:37:23.710Z","caller":"ipamd/rpc_handler.go:95","msg":"UnassignPodIPv4Address: Failed to find pod REDACTED
{"level":"error","ts":"2020-05-21T18:37:35.433Z","caller":"ipamd/ipamd.go:762","msg":"Failed to allocate a private IP addresses on ENI eni-02da60df4d99a40db: PrivateIpAddressLimitExceeded: Number of private addresses will exceed limit.\n\tstatus code: 400, request id: 7874f1e2-3565-45e2-bbd2-2871ea0f67ae"}
{"level":"error","ts":"2020-05-21T18:37:48.567Z","caller":"ipamd/ipamd.go:762","msg":"Failed to allocate a private IP addresses on ENI eni-0c247214789d21112: PrivateIpAddressLimitExceeded: Number of private addresses will exceed limit.\n\tstatus code: 400, request id: e2e9c4e8-9d15-445f-ae77-9df1e71fea5b"}

@mogren
Copy link
Contributor

mogren commented May 21, 2020

Hmm... The first line is concerning, why is iptables not responding?

{"level":"error","ts":"2020-05-21T18:36:13.918Z","caller":"ipamd/ipamd.go:354","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 4: iptables: Resource temporarily unavailable.\n"}

@mogren
Copy link
Contributor

mogren commented May 21, 2020

@uthark I could see 4 ENIs attached to that instance, and they all have 29 secondary IPs per ENI. The CNI reads the instance metadata to fetch the attached IPs, so there must be some mismatch there.

@uthark
Copy link
Contributor Author

uthark commented May 21, 2020

Hi! 2 extra ENIs were added later by CNI plugin.
About iptables issue - I'll take a look, but can't quickly identify the cause.

# iptables --version
iptables v1.6.0
# uname -r
4.15.0-1065-aws

@uthark
Copy link
Contributor Author

uthark commented May 21, 2020

Also, we use kube2iam on the host with --iptables=true, might be related to iptables error.

CNI errors started to appear after rollout 1.6.x

@kgtw
Copy link
Contributor

kgtw commented Aug 30, 2020

We have also just experienced a similar error with CNI 1.7.1, there is also kube2iam in running in our clusters.

Not seeing the same problem as reported by @uthark - but we are seeing issues with iptables being updated.

It might be worth adding a retry on iptables updates, in the chance that some other resource is modifying them at the same time. Current implementation causes pods to crash & restart.

{"level":"info","ts":"2020-08-30T06:17:37.670Z","caller":"entrypoint.sh","msg":"Copying CNI plugin binaries ... "}
{"level":"info","ts":"2020-08-30T06:17:37.745Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2020-08-30T06:17:37.851Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2020-08-30T06:17:37.852Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2020-08-30T06:17:38.162Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.7.1  ..."}
{"level":"info","ts":"2020-08-30T06:17:38.220Z","caller":"aws-k8s-agent/main.go:41","msg":"Testing communication with server"}
{"level":"info","ts":"2020-08-30T06:17:38.220Z","caller":"aws-k8s-agent/main.go:41","msg":"Successful communication with the Cluster! Cluster Version is: v1.16. git version: v1.16.7. git tree state: clean. commit: be3d344ed06bff7a4fc60656200a93c74f31f9a4. platform: linux/amd64"}
{"level":"debug","ts":"2020-08-30T06:17:38.223Z","caller":"ipamd/ipamd.go:318","msg":"Discovered region: eu-central-1"}
{"level":"debug","ts":"2020-08-30T06:17:38.224Z","caller":"awsutils/awsutils.go:281","msg":"Found availability zone: eu-central-1b "}
{"level":"debug","ts":"2020-08-30T06:17:38.224Z","caller":"awsutils/awsutils.go:281","msg":"Discovered the instance primary ip address: 172.22.86.5"}
{"level":"debug","ts":"2020-08-30T06:17:38.225Z","caller":"awsutils/awsutils.go:281","msg":"Found instance-id: i-07c6605d8fd107fd0 "}
{"level":"debug","ts":"2020-08-30T06:17:38.225Z","caller":"awsutils/awsutils.go:281","msg":"Found instance-type: r5.4xlarge "}
{"level":"debug","ts":"2020-08-30T06:17:38.226Z","caller":"awsutils/awsutils.go:281","msg":"Found primary interface's MAC address: 06:49:f0:f3:ea:d6"}
{"level":"debug","ts":"2020-08-30T06:17:38.226Z","caller":"awsutils/awsutils.go:341","msg":"Discovered 1 interfaces."}
{"level":"debug","ts":"2020-08-30T06:17:38.227Z","caller":"awsutils/awsutils.go:341","msg":"Found device-number: 0 "}
{"level":"debug","ts":"2020-08-30T06:17:38.227Z","caller":"awsutils/awsutils.go:341","msg":"Found eni: eni-0f152de421b2bee93 "}
{"level":"debug","ts":"2020-08-30T06:17:38.227Z","caller":"awsutils/awsutils.go:341","msg":"eni-0f152de421b2bee93 is the primary ENI of this instance"}
{"level":"debug","ts":"2020-08-30T06:17:38.227Z","caller":"awsutils/awsutils.go:281","msg":"Found subnet-id: subnet-02dd602db9eb3d986 "}
{"level":"info","ts":"2020-08-30T06:17:38.228Z","caller":"awsutils/awsutils.go:356","msg":"Found sg-048f335b09a077c07, added to ipamd cache"}
{"level":"info","ts":"2020-08-30T06:17:38.228Z","caller":"awsutils/awsutils.go:362","msg":"Found 172.22.80.0/20, added to ipamd cache"}
{"level":"debug","ts":"2020-08-30T06:17:38.228Z","caller":"ipamd/ipamd.go:327","msg":"Using WARM_IP_TARGET 5"}
{"level":"debug","ts":"2020-08-30T06:17:38.228Z","caller":"ipamd/ipamd.go:337","msg":"Start node init"}
{"level":"info","ts":"2020-08-30T06:17:38.228Z","caller":"ipamd/ipamd.go:364","msg":"Setting up host network... "}
{"level":"debug","ts":"2020-08-30T06:17:38.228Z","caller":"networkutils/network.go:222","msg":"Trying to find primary interface that has mac : 06:49:f0:f3:ea:d6"}
{"level":"info","ts":"2020-08-30T06:17:38.228Z","caller":"wait/wait.go:133","msg":"Will attempt to clean up AWS CNI leaked ENIs after waiting 2m28s."}
{"level":"debug","ts":"2020-08-30T06:17:38.228Z","caller":"networkutils/network.go:222","msg":"Discovered interface: lo, mac: "}
{"level":"debug","ts":"2020-08-30T06:17:38.228Z","caller":"networkutils/network.go:222","msg":"Discovered interface: ens5, mac: 06:49:f0:f3:ea:d6"}
{"level":"info","ts":"2020-08-30T06:17:38.228Z","caller":"networkutils/network.go:222","msg":"Discovered primary interface: ens5"}
{"level":"debug","ts":"2020-08-30T06:17:38.228Z","caller":"ipamd/ipamd.go:364","msg":"Setting RPF for primary interface: net/ipv4/conf/ens5/rp_filter"}
{"level":"debug","ts":"2020-08-30T06:17:38.230Z","caller":"networkutils/network.go:249","msg":"Found the Link that uses mac address 06:49:f0:f3:ea:d6 and its index is 2 (attempt 1/5)"}
{"level":"debug","ts":"2020-08-30T06:17:38.238Z","caller":"networkutils/network.go:319","msg":"Setup Host Network: loading existing iptables nat SNAT exclusion rules"}
{"level":"debug","ts":"2020-08-30T06:17:38.257Z","caller":"ipamd/ipamd.go:364","msg":"Setup Host Network: iptables -N AWS-SNAT-CHAIN-0 -t nat"}
{"level":"error","ts":"2020-08-30T06:17:38.263Z","caller":"ipamd/ipamd.go:364","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 4: iptables: Resource temporarily unavailable.\n"}
{"level":"error","ts":"2020-08-30T06:17:38.263Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd init: failed to set up host network: host network setup: failed to add chain: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 4: iptables: Resource temporarily unavailable.\n"}

Update:

Just read up on the iptables manual, seems like --wait isn't behaving as one would expect.

-w, --wait [seconds]
Wait for the xtables lock. To prevent multiple instances of the program from running concurrently, an attempt will be made to obtain an exclusive lock at launch. By default, the program will exit if the lock cannot be obtained. This option will make the program wait (indefinitely or for optional seconds) until the exclusive lock can be obtained.

Feels like the request isn't being waited indefinitely.

@uthark
Copy link
Contributor Author

uthark commented Aug 30, 2020

I tried to run 1.7.0 — grpc server doesn't start, so readiness probe fails.

@mogren
Copy link
Contributor

mogren commented Sep 2, 2020

@uthark do you have any logs showing why it didn't start up? Did you apply the new config, or just update the image tag?

We have merged a number of fixes lately that will be available in v1.7.2 that should resolve the PrivateIpAddressLimitExceeded issue. #1174 specifically.

@uthark
Copy link
Contributor Author

uthark commented Sep 2, 2020

New config, with init container. It looks like it starts, but grpc healthcheck fails. Initial delay is configured to 90 seconds.

@mogren
Copy link
Contributor

mogren commented Sep 4, 2020

Ok, thanks @uthark. I wonder if it could get stuck on iptables or creating routes. I'd be happy to take a look at the ipamd log when this happens.

Also we have a release candidate out with the fixes mentioned above v1.7.2-rc1

@mogren
Copy link
Contributor

mogren commented Sep 10, 2020

@uthark While comparing v1.6.3 and v1.7.2-rc1 when scaling from 0 to 5000 pods to 0 again, we saw 206 Client.PrivateIpAddressLimitExceeded on v1.6.3, but 0 on v1.7.2-rc1. Closing this issue since #1174 has resolved it.

@mogren mogren closed this as completed Sep 10, 2020
@uthark
Copy link
Contributor Author

uthark commented Sep 10, 2020

Sounds good to me. I'll glve 1.7.2-rc a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants