Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-k8s: get cluster dns cidr from eks #1547

Merged
merged 1 commit into from May 4, 2021
Merged

aws-k8s: get cluster dns cidr from eks #1547

merged 1 commit into from May 4, 2021

Conversation

webern
Copy link
Member

@webern webern commented Apr 30, 2021

Issue number:

Closes #1197

Description of changes:

We previously assumed that an EKS cluster DNS IP was either 10.100.0.10 or 172.20.0.10. This may be incorrect when the cluster has a custom serviceIPv4CIDR setting.

Now we get the serviceIPv4CIDR from EKS describe-cluster so that we can calculate the correct cluster DNS IP in the presence of serviceIPv4CIDR.

Notes

  • this requires rusoto v0.46 because the serviceIPv4CIDR is fairly new to the API
  • rusoto v0.46 relies on tokio v1
  • We needed to switch from reqwest::blocking to reqwest async for the IMDS calls because we need a runtime for rusoto.

Testing done:

Reproducing the issue before fixing it:

After reading this

  • I created a cluster with a custom CIDR: 10.31.32.0/24.
  • I ran a production 1.18 AMI and verified that it had the wrong Cluster DNS: 10.100.0.10.
  • I ran a pod like this (there was some nslookup bug with busybox so I used ubuntu and installed dnsutils on the healthy pod. I couldn't apt-get install anything on the unhealthy pod which further proves DNS was broken. For the unhealthy pod I used busybox):
    • kubectl run -i -t ubuntu --image=ubuntu --restart=Never --command=true -- bash
  • Inside the pod I ran nslookup kubernetes.default, which returned
/ # nslookup kubernetes.default
;; connection timed out; no servers could be reached

In a healthy, non-custom-CIDR cluster, the above produces:

root@ubuntu:/# nslookup kubernetes.default
Server:   10.100.0.10
Address:  10.100.0.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.100.0.1

Thus we have proven that Bottlerocket does not work correctly with a custom serviceIPv4CIDR.

Testing this PR

Custom CIDR Cluster

  • I ran a node in a cluster with serviceIPv4CIDR: '10.31.32.0/24'
  • I logged in and checked the Cluster DNS IP
    • apiclient -u /settings?keys=settings.kubernetes.cluster-dns-ip
    • {"kubernetes":{"cluster-dns-ip":"10.31.32.10"}}
  • I ran a pod:
    • kubectl run -i -t ubuntu --image=ubuntu --restart=Never --command=true -- bash
    • nslookup kubernetes.default

The result shows that cluster DNS resolution is working:

Server:   10.31.32.10
Address:  10.31.32.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.31.32.1

Non-Custom CIDR Cluster

  • I ran a node in a cluster with no serviceIPv4CIDR configuration.
  • I logged in and checked the Cluster DNS IP
    • apiclient -u /settings?keys=settings.kubernetes.cluster-dns-ip
    • {"kubernetes":{"cluster-dns-ip":"10.100.0.10"}}
  • I ran a pod:
    • kubectl run -i -t ubuntu --image=ubuntu --restart=Never --command=true -- bash
    • nslookup kubernetes.default

The result shows that cluster DNS resolution is working:

Server:   10.100.0.10
Address:  10.100.0.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.100.0.1

Without Describe Cluster Permissions

  • I ran a node in a cluster with no serviceIPv4CIDR configuration and with DescribeCluster removed from the Node's instance role.
  • I logged in and checked some tracing in the journal.
    • Saw this: Unable to determine CIDR from EKS, falling back to default cluster DNS IP: Error describing cluster: Request ID: Some("9753370c-9443-455c-91b4-a182491bcf1e") Body: {"message":"User: arn:aws:sts::xxx:assumed-role/ [...] is not authorized to perform: eks:DescribeCluster on resource: [..."}
    • apiclient -u /settings?keys=settings.kubernetes.cluster-dns-ip
    • {"kubernetes":{"cluster-dns-ip":"10.100.0.10"}}
  • I ran a pod:
    • kubectl run -i -t ubuntu --image=ubuntu --restart=Never --command=true -- bash
    • nslookup kubernetes.default

The result shows that cluster DNS resolution is working despite failure to call the EKS API and describe-cluster:

Server:   10.100.0.10
Address:  10.100.0.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.100.0.1

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from ad82670 to 99317fb

Add Bottlerocket API call to get the AWS region before working on the rebase to pull in #1551.

@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from 99317fb to b29a1cb

Rebase.

@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from b29a1cb to d6cfe90

Fix an oops (region URI was wrong)

@webern
Copy link
Member Author

webern commented May 4, 2021

Retest checks out.

sources/api/pluto/src/api.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/eks.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Show resolved Hide resolved
sources/api/pluto/src/eks.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from d6cfe90 to 0745e06

I think this addresses all of @tjkirch 's suggestion except for using the Model struct, which I will address separately.

@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from 0745e06 to 7325837

Use the API via strongly-typed settings instead of poking into a serde::json object. This requires conditional compilation to preserve the usability of the sources workspace.

sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/main.rs Outdated Show resolved Hide resolved
@webern
Copy link
Member Author

webern commented May 4, 2021

Retest checks out with 7325837. CI failure is a flake but not re-running because there are some small things to push.

@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from 7325837 to 38fa191

Fix some nits.

sources/api/pluto/src/api.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/api.rs Outdated Show resolved Hide resolved
sources/api/pluto/src/api.rs Outdated Show resolved Hide resolved
@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from 38fa191 to a821017

Additional cleanups.

@etungsten etungsten self-requested a review May 4, 2021 22:16
We previously assumed that an EKS cluster DNS IP was either 10.100.0.10
or 172.20.0.10. This may be incorrect when the cluster has a custom
serviceIPv4CIDR setting.

Now we get the serviceIPv4CIDR from EKS describe-cluster so that we can
calculate the correct cluster DNS IP in the presence of serviceIPv4CIDR.
@webern
Copy link
Member Author

webern commented May 4, 2021

webern force-pushed the webern:dns branch from a821017 to 55ada33

Spelling fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DNS Cluster IP determination inconsistent with EKS AL2 AMIs
3 participants