Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pluto doesn't handle proxy credentials in https_proxy #3525

Closed
sonalita opened this issue Oct 13, 2023 · 31 comments · Fixed by #3639
Closed

pluto doesn't handle proxy credentials in https_proxy #3525

sonalita opened this issue Oct 13, 2023 · 31 comments · Fixed by #3639
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW type/bug Something isn't working

Comments

@sonalita
Copy link

sonalita commented Oct 13, 2023

We've been happily running on Kubernetes 1.25 for several months using bottlerocket images but we are now in the process of upgrading our clusters. However, after a suuccessful upgrade of the EKS managed control plane to 1.26, new nodes are failing to join the node goup.

This seems similar to 3064 to which my colleague Paul asked a question on -a couple of days ago but not quite the same problem.

Any advice on what the issue may be and how we can fix it please?

Image I'm using:
bottlerocket-aws-k8s-1.26-x86_64-v1.15.1-264e294c

What I expected to happen:
Instances launch and join the EKS nodegroup

What actually happened:
Instances launch but fail to join the nodegroup, The System log shows the following error:

[ OK ] Finished Bottlerocket userdata configuration system.
Starting User-specified setting generators...
Unmounting EFI System Partition Automount...
[ OK ] Unmounted EFI System Partition Automount.
[ 303.392308] sundog[1343]: Setting generator 'pluto private-dns-name' failed with exit code 1 - stderr: Timed out retrieving private DNS name from EC2: deadline has elapsed
[FAILED] Failed to start User-specified setting generators

We do also see this as well in the console log but not sure if this is relevant to our issue - but it is different to what we see on a 1.25 control plane cluster:

[ OK ] Finished Generate network configuration.
[ 1.708363] migrator[1266]: Data store does not exist at given path, exiting (/var/lib/bottlerocket/datastore/current)
[ OK ] Finished Bottlerocket data store migrator.

How to reproduce the problem:
Our VPC has two CIDRs, a secondary CIDR is used for our pod network

@sonalita sonalita added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Oct 13, 2023
@gthao313
Copy link
Member

@sonalita Thanks for opening the ticket. We will investigate on it and try to reproduce it.

@patkinson01
Copy link

patkinson01 commented Oct 14, 2023

Thanks @gthao313!

Few other pieces of information which may be relevant here:

The VPC where the cluster is running has a non-standard DHCP option set - we use our own nameservers and not AmazonProvidedDNS. domain-name is set to: eu-west-1.compute.internal

We reduce IMDSv2 hop limit to 1 (I've just tried increasing to 2 but didn't fix it)

We also set https-proxy and no-proxy in the userdata
no-proxy = [ "192.168.0.0/16", "10.0.0.0/8", "100.64.0.0/10", "localhost", "127.0.0.1", "169.254.169.254", ".compute.internal", ".cluster.local.", ".cluster.local", ".svc", ".eks.amazonaws.com", ".s3.eu-west-1.amazonaws.com", ".s3.dualstack.eu-west-1.amazonaws.com" ]

@gthao313 gthao313 added the area/kubernetes K8s including EKS, EKS-A, and including VMW label Oct 17, 2023
@gthao313
Copy link
Member

@patkinson01 @sonalita Sorry for late reply. I was trying to test it and unable to reproduce it. My approach was to create 1.25 EKS cluster with a nodegroup which has some 1.25 nodes, then update the eks cluster and nodegroup to 1.26 version. Both of them were going well on my test. To narrow down the issue, I need your help on the test and questions.

Can you launch few new 1.26 nodes to your eks cluster to validate if they are able to join the cluster?

Were you aware of any newtwork setup changed during the upgrade?

Thanks!

@patkinson01
Copy link

Hi @gthao313 - the first scenario worked for us too, the 1.25 upgrade to 1.26 step worked. The issue comes when we destroy the nodegroup and try to create a new nodegroup with new 1.26 nodes.

Nothing else should have changed in the config - everything is 'as code' and pushed out via Terraform.

@yeazelm
Copy link
Contributor

yeazelm commented Oct 25, 2023

We also set https-proxy and no-proxy in the userdata
no-proxy = [ "192.168.0.0/16", "10.0.0.0/8", "100.64.0.0/10", "localhost", "127.0.0.1", "169.254.169.254", ".compute.internal", ".cluster.local.", ".cluster.local", ".svc", ".eks.amazonaws.com", ".s3.eu-west-1.amazonaws.com", ".s3.dualstack.eu-west-1.amazonaws.com" ]

I wanted to follow up that I noticed your no-proxy might need an additional value: .ec2.amazonaws.com which is what the node uses to determine its name to connect to the cluster. This change added the need to get this private name via our internal code in pluto as you see in the logs. Your node needs to be able to call EC2 to confirm its private name for joining the cluster. Can you try adding that (and ensuring the IAM policies you are adding to the node have this access too) and see if that resolves this issue?

@patkinson01
Copy link

Hi @yeazelm , thanks for the additional info and pointer! Can I just confirm that your suggestion to add .ec2.amazonaws.com to the no-proxy value is based on an assumption we have implemented an EC2 interface endpoint service within the VPC? Thanks :)

@yeazelm
Copy link
Contributor

yeazelm commented Oct 26, 2023

Hi @yeazelm , thanks for the additional info and pointer! Can I just confirm that your suggestion to add .ec2.amazonaws.com to the no-proxy value is based on an assumption we have implemented an EC2 interface endpoint service within the VPC? Thanks :)

Correct. You'll need an interface VPC endpoint for EC2.

@yeazelm
Copy link
Contributor

yeazelm commented Oct 30, 2023

It sounds like this might have resolved your issue, can you confirm if the endpoint and permissions solved your issue?

@yeazelm yeazelm removed type/bug Something isn't working status/needs-triage Pending triage or re-evaluation labels Oct 30, 2023
@sonalita
Copy link
Author

sonalita commented Oct 31, 2023

@yeazelm No, as Paul said above, our environment is quite tightly locked down so we are not able to add the endpoint and no-proxy change - we are able to reach our through our normal proxy. FYI we are also trying to find a solution too via AWS support. who had us configure the admin container. We were successfully able to do so on a 1.25 node and use sheltie to get at logs but unfortunately, with a 1.26 node, the instances are not even getting as far as deploying the Bottlerocket admin container.

@yeazelm
Copy link
Contributor

yeazelm commented Oct 31, 2023

For the external AWS cloud provider (which was added in 1.26, and the in-tree provider was removed in 1.27), here is a work around to try which might let the node come up. This could possibly work for 1.26 but will not work for 1.27 and later (due to the removal of the intree provider) and you will need the EC2 endpoint available to your nodes going forward on 1.27 and later. Nonethless, you might try this as a workaround until figure out a path to getting the EC2 endpoint sorted out.

set settings.kubernetes.cloud-provider to aws
set settings.kubernetes.hostname-override to an empty string to skip pluto timing out on the EC2 request

This essentially will revert to the old behavior on 1.25.

@sonalita
Copy link
Author

sonalita commented Nov 1, 2023

Hi @yeazelm
After adding those two settings, we now see a different error in the ec2 instance system log:

[  OK  ] Finished wicked managed network interfaces.
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
         Starting Bottlerocket userdata configuration system...
[    3.554275] early-boot-config[1325]: Error PATCHing '/settings?tx=bottlerocket-launch': Status 400 when PATCHing /settings?tx=bottlerocket-launch: Json deserialize error: Unable to deserialize into ValidLinuxHostname: Invalid hostname '': must only be [0-9a-z.-], and 1-253 chars long at line 1 column 1745
[FAILED] Failed to start Bottlerocket userdata configuration system.
See 'systemctl status early-boot-config.service' for details.

Our (redacted) userdata looks like this:

settings.kubernetes.cluster-name = 'xxx'
settings.kubernetes.api-server = 'https://xxx.gr7.eu-west-1.eks.amazonaws.com'
settings.kubernetes.cluster-certificate = 'xxx'
settings.kubernetes.cluster-dns-ip = '192.168.0.10'
settings.kubernetes.max-pods = 110
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup-image' = 'ami-03b30c03b4fd62ad5'
settings.kubernetes.node-labels.'eks.amazonaws.com/capacityType' = 'ON_DEMAND'
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateVersion' = '1'
settings.kubernetes.node-labels.'eks.amazonaws.com/nodegroup' = 'managed-ondemand-20231101134231957200000004'
settings.kubernetes.node-labels.'eks.amazonaws.com/sourceLaunchTemplateId' = 'lt-0ae7efeb7273a132c'
settings.kubernetes.node-labels.'bottlerocket.aws/updater-interface-version' = '2.0.0'
settings.kubernetes.cloud-provider = 'aws'
settings.kubernetes.hostname-override = ''
settings.network.no-proxy = ['192.168.0.0/16', '10.0.0.0/8', '100.64.0.0/10', 'localhost', '127.0.0.1', '169.254.169.254', , '.compute.internal', ', '.cluster.local.', '.cluster.local', '.svc', '.eks.amazonaws.com', '.s3.eu-west-1.amazonaws.com', '.s3.dualstack.eu-west-1.amazonaws.com', '.vpce.amazonaws.com']
settings.network.https-proxy = 'xxx'
settings.container-registry.credentials = [{registry = 'xxx', username = 'xxx', password = 'xxxx'}]
settings.host-containers.admin.enabled = true
settings.host-containers.admin.user-data = 'xxx'
settings.kernel.sysctl.'user.max_user_namespaces' = '0'
settings.kernel.sysctl.'vm.max_map_count' = '262144'
settings.kernel.sysctl.'net.ipv4.conf.all.send_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.default.send_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.all.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.default.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv6.conf.all.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv6.conf.default.accept_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.all.secure_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.default.secure_redirects' = '0'
settings.kernel.sysctl.'net.ipv4.conf.all.log_martians' = '1'
settings.kernel.sysctl.'net.ipv4.conf.default.log_martians' = '1'
settings.bootstrap-containers.bottle.source = 'xxx'
settings.bootstrap-containers.bottle.mode = 'once'
settings.bootstrap-containers.bottle.user-data = 'xxxx'
settings.updates.ignore-waves = true
settings.updates.seed = 0

The bootstrap userdata does not contain any hostname information

@yeazelm
Copy link
Contributor

yeazelm commented Nov 1, 2023

Ok, I had not tried setting the hostname to '' before asking you to try it. Sadly it makes sense that early-boot-config is not able to handle an empty string and fall back to the default. I'm sorry that didn't work. I'll do a bit more digging to see if I can find a way to get this workaround to work.

@etungsten
Copy link
Contributor

etungsten commented Nov 1, 2023

Hi @sonalita, you can set settings.kubernetes.hostname-override to any arbitrary non-empty hostname string to workaround the issue. kubelet will ignore the --hostname-override option if the AWS in-tree cloud provider is responsible for setting the node name. See kubernetes/kubernetes#64659.

@sonalita
Copy link
Author

sonalita commented Nov 2, 2023

Hi @etungsten That workaround was successful. We now have 1.26 nodes!
Thank you for your help! We do need a solution that will work for 1.27 and beyond.
I am goiing to pursue getting the EC2 interface endpoint service configured with our cloudops team but that may take a few days.

@jooh-lee
Copy link

jooh-lee commented Nov 4, 2023

Hi @etungsten and @yeazelm this seems to be a problem as well in 1.27. I have an EKS cluster on 1.27 and same deal had to recreate the nodegroup. On v.1.14 I had no issues with bringing up a Node on 1.27, with 1.16.0 the nodes do not come up at all. The instances do have have a connection to *.ec2.amazonaws.com. and i've tried setting up the admin container, but its not coming up at all. Is there a workaround for 1.27?

This is not a problem with k8s 1.28 and 1.16.0 of the bottlerocket ami.

The arch we're using is x86

@sonalita
Copy link
Author

sonalita commented Nov 7, 2023

Hi @yeazelm Unfortunately, we're still having issues with the workaround - if I kubectl describe a node, I do see that the hostname is set to "x" - the value I set in the toml file - and this seems to be causing some issues with some addon pods not starting properly and also the kubectl log command on any pod fails with a TLS error. We have confirmed that .ec2.amazonaws.com is reachable via our squid proxy so unless the bottlerocket boot process is not honouring the proxy settings, I'm being told that we should not need to add the VPC endpoint. I have tried with kubectl versions 1.25, 1.26 and 1.27 - so unfortunately although the nodes are joining the cluster, it is not stable and therefore unusable in a production environment.

@etungsten
Copy link
Contributor

etungsten commented Nov 8, 2023

Hi @sonalita, I'm currently wrapping up #3582 to help with the behavior you're seeing. It'll let you avoid having to specify an arbitrary hostname in hostname-override if you're using the aws cloud provider.. You would still need to specify the arbitrary hostname, but it won't be passed to kubelet to avoid the undesired behavior. The in-tree AWS cloud provider would manage the node name matching during registration. Once it merges, we'll be releasing the change in our next 1.16.1 release which should be happening early next week.

One more thing,

We have confirmed that .ec2.amazonaws.com is reachable via our squid proxy

The correct endpoint for the EC2 API should be of the form ec2.<aws-region>.amazonaws.com. So if you're trying to no proxy the EC2 endpoint, that should be the entry to put.

@etungsten
Copy link
Contributor

Hi @jooh-lee,

I believe what you're seeing is a different issue. There should be no difference between v1.14.0 and v1.16.0 when running in an K8s 1.27 cluster. If your admin container does not come up, that would point to a network issue. Can you please create a separate issue with details about your cluster environment and any relevant host configuration for us to track?

@bcressey
Copy link
Contributor

bcressey commented Nov 9, 2023

It'll let you avoid having to specify an arbitrary hostname in hostname-override if you're using the aws cloud provider.

@etungsten - after #3582, there's still a need to specify an arbitrary hostname, right? It just won't actually be rendered in the final config and won't affect kubelet behavior.

@etungsten
Copy link
Contributor

Ah right, that's correct. You would still need to specify an arbitrary value to skip pluto settings generation. I've edited my original comment.

@sonalita
Copy link
Author

Hi

We have some good news! The PR merge has resulted in a healthy 1.26 node group with all pods running and kubectl logs command is working!

Now we need to focus on a less tactical solution for 1.27 onwards.

To answer the question about the proxy config:

In our proxy config we allow *.amazonaws.com through the proxy with just a few entries in our NO_PROXY where we want to route directly namely:

• .eks.amazonaws.com (For internal EKS Control Plane API Endpoints)
• .s3.eu-west-1.amazonaws.com & .s3.dualstack.eu-west-1.amazonaws.com (for S3)
• .vpce.amazonaws.com (for PrivateLink endpoints)

Is the code definately respecting system proxy and no-proxy settings?

@sonalita
Copy link
Author

Hi team, sorry for the delays in replying. We were asked by AWS support to sheltie onto a node and run a describe-instances command. the aws command isn't on the path and a find / -name aws listed many results. The one in /var/lib/provisioning/v2/2.11.4/bin/aws sems to work but when I run /var/lib/provisioning/v2/2.11.4/bin/aws --region eu-west-1 ec2 describe-instances --debug it just hangs.

I'm attaching the debug output (with tokens redacted) for your perusal.

describe-instances.txt

@sonalita
Copy link
Author

Hi all

I realised that when I did my tests on Thursday, I may not have set the proxy variables correctly after issuing the sheltie command. I have repeated my tests this morning and can confirm that with the environment variables https_proxy, http_proxy and no_proxy set correctly, we can indeed successfully execute an aws ec2 describe-instances command.

The debug output is attached with security tokens etc. redacted as well as truncating most of the XML returned. The command I used was
aws --region eu-west-1 ec2 describe-instances --filters Name=image-id,Values=ami-004a21828789c1a10 --debug

For information, I’ve confirmed that the launch template for the instances has these settings. The values of which match what I set for https_proxy and no_proxy env vars in my test.

settings.network.no-proxy = [ ]
settings.network.https-proxy = '<our proxy url/credentials> '

Are these definitely set in the environment at the time you run the describe-instances command?

Hopefully this information will help you to debug the issue further.

@yeazelm
Copy link
Contributor

yeazelm commented Nov 27, 2023

Hey @sonalita, thanks for the updates! It sounds like there might be some nuance in how you are configuring your proxy. What was the change you needed to do to get it working? I wonder if we have additional work to do to handle what that change is?

Are these definitely set in the environment at the time you run the describe-instances command?

We pass them directly to pluto and have confirmed pluto respects these settings. pluto is what ends up calling the EC2 DescribeInstances API via the Rust SDK. It might be a matter of formatting these variables to ensure they work correctly.

@sonalita
Copy link
Author

sonalita commented Nov 27, 2023 via email

@sonalita
Copy link
Author

sonalita commented Dec 4, 2023

Copying the info I added to the AWS support case here for reference (I was asked for the console logs and userdata again)

Hi As requested, I'm attaching the console log for an instance attempting to join a Kubermetes 1.27 nodegroup and the sanitized userdata from its Launch template.
As you can see on line 178, we are still seeing the Pluto error. You can see we are setting proxy in the userdata, and have previously confirmed that on a running node, we can sheltie into the node via the admin container and successfully run a describe-instances command after setting the https_proxy and no_proxy environment variables on that session to match the userdata configuration.

br-1.16.1-k8s-1-27-userdata.txt
br-1.16.1-k8s-1-27-bootlog.txt

@etungsten
Copy link
Contributor

Hi @sonalita, thanks for the additional info.

In your previous comment you mentioned your https_proxy configuration contains proxy credentials. I'm assuming it's in the format of
https_proxy="http://username:password@proxy.com:80"

Currently pluto's proxy handling does not handle proxy credentials.

let mut proxy_uri = https_proxy.parse::<Uri>().context(UriParseSnafu {
input: &https_proxy,
})?;
.

I think that's the issue you're running into right now. Unfortunately I don't have a quick workaround for this at the moment. pluto needs to be taught how to extract and use proxy creds.

I'm gonna go ahead and update the title of this issue so we can track this more accurately.

@etungsten etungsten changed the title AWS EKS instances not joining nodegroup after upgrading to K8S 1.26 pluto doesn't handle proxy credentials in https_proxy Dec 4, 2023
@etungsten
Copy link
Contributor

One workaround that comes to mind is to use bootstrap containers to basically replace what pluto is trying to do.

Firstly, you need to skip pluto execution during boot by setting settings.kubernetes.hostname-override to a random non-empty string value in your userdata.
In the bootstrap container, you can call aws-cli to fetch the private DNS name for the instance through describe-instance and set settings.kubernetes.hostname-override to that value via apiclient. kubelet should then work as expected.

@etungsten etungsten added the type/bug Something isn't working label Dec 5, 2023
@sonalita
Copy link
Author

sonalita commented Dec 5, 2023

@etungsten Hi
SUCCESS!!!!!!!!!!! For documentation and to help others facing the same issue, here's what I did.

I added the hostname-override variable to the bottlerocket userdata and then I added the following to our bootstrap container run script:


    TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
    
    echo "PRI-DNS: Got token $TOKEN"
    
    INSTANCE_ID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN"  http://169.254.169.254/latest/meta-data/instance-id)
    
    echo "PRI-DNS: Got instance ID: $INSTANCE_ID"
    
    PRIVATE_DNS=$(aws ec2 --region "$REGION" describe-instances --instance-ids $INSTANCE_ID --query 
   'Reservations[*].Instances[*].{Instance:PrivateDnsName}' --output text)
   
    echo "PRI-DNS: Got PRIVATE_DNS: $PRIVATE_DNS"
   
    apiclient set settings.kubernetes.hostname-override=$PRIVATE_DNS

Any idea on when the pluto code will be updated and released?

@etungsten
Copy link
Contributor

Hi @sonalita,

We merged a potential fix for this issue and will be releasing it in the next Bottlerocket release next week.

@etungsten etungsten reopened this Dec 7, 2023
@atkins4aviva
Copy link

I am pleased to report that the new Bottlerocket AMI is working as expected without the bootstrap workaround.

Thankyou to all at AWS and Bottlerocket that have helped us resolve this issue. It's taken a while but we finally got there!

Thanks again,

  • Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants