-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pluto doesn't handle proxy credentials in https_proxy
#3525
Comments
@sonalita Thanks for opening the ticket. We will investigate on it and try to reproduce it. |
Thanks @gthao313! Few other pieces of information which may be relevant here: The VPC where the cluster is running has a non-standard DHCP option set - we use our own nameservers and not AmazonProvidedDNS. domain-name is set to: eu-west-1.compute.internal We reduce IMDSv2 hop limit to 1 (I've just tried increasing to 2 but didn't fix it) We also set https-proxy and no-proxy in the userdata |
@patkinson01 @sonalita Sorry for late reply. I was trying to test it and unable to reproduce it. My approach was to create 1.25 EKS cluster with a nodegroup which has some 1.25 nodes, then update the eks cluster and nodegroup to 1.26 version. Both of them were going well on my test. To narrow down the issue, I need your help on the test and questions. Can you launch few new 1.26 nodes to your eks cluster to validate if they are able to join the cluster? Were you aware of any newtwork setup changed during the upgrade? Thanks! |
Hi @gthao313 - the first scenario worked for us too, the 1.25 upgrade to 1.26 step worked. The issue comes when we destroy the nodegroup and try to create a new nodegroup with new 1.26 nodes. Nothing else should have changed in the config - everything is 'as code' and pushed out via Terraform. |
I wanted to follow up that I noticed your no-proxy might need an additional value: |
Hi @yeazelm , thanks for the additional info and pointer! Can I just confirm that your suggestion to add |
Correct. You'll need an interface VPC endpoint for EC2. |
It sounds like this might have resolved your issue, can you confirm if the endpoint and permissions solved your issue? |
@yeazelm No, as Paul said above, our environment is quite tightly locked down so we are not able to add the endpoint and no-proxy change - we are able to reach our through our normal proxy. FYI we are also trying to find a solution too via AWS support. who had us configure the admin container. We were successfully able to do so on a 1.25 node and use sheltie to get at logs but unfortunately, with a 1.26 node, the instances are not even getting as far as deploying the Bottlerocket admin container. |
For the external AWS cloud provider (which was added in 1.26, and the in-tree provider was removed in 1.27), here is a work around to try which might let the node come up. This could possibly work for 1.26 but will not work for 1.27 and later (due to the removal of the intree provider) and you will need the EC2 endpoint available to your nodes going forward on 1.27 and later. Nonethless, you might try this as a workaround until figure out a path to getting the EC2 endpoint sorted out. set This essentially will revert to the old behavior on 1.25. |
Hi @yeazelm
Our (redacted) userdata looks like this:
The bootstrap userdata does not contain any hostname information |
Ok, I had not tried setting the hostname to |
Hi @sonalita, you can set |
Hi @etungsten That workaround was successful. We now have 1.26 nodes! |
Hi @etungsten and @yeazelm this seems to be a problem as well in 1.27. I have an EKS cluster on 1.27 and same deal had to recreate the nodegroup. On v.1.14 I had no issues with bringing up a Node on 1.27, with 1.16.0 the nodes do not come up at all. The instances do have have a connection to *.ec2.amazonaws.com. and i've tried setting up the admin container, but its not coming up at all. Is there a workaround for 1.27? This is not a problem with k8s 1.28 and 1.16.0 of the bottlerocket ami. The arch we're using is x86 |
Hi @yeazelm Unfortunately, we're still having issues with the workaround - if I |
Hi @sonalita, I'm currently wrapping up #3582 to help with the behavior you're seeing. One more thing,
The correct endpoint for the EC2 API should be of the form |
Hi @jooh-lee, I believe what you're seeing is a different issue. There should be no difference between v1.14.0 and v1.16.0 when running in an K8s 1.27 cluster. If your admin container does not come up, that would point to a network issue. Can you please create a separate issue with details about your cluster environment and any relevant host configuration for us to track? |
@etungsten - after #3582, there's still a need to specify an arbitrary hostname, right? It just won't actually be rendered in the final config and won't affect kubelet behavior. |
Ah right, that's correct. You would still need to specify an arbitrary value to skip |
Hi We have some good news! The PR merge has resulted in a healthy 1.26 node group with all pods running and kubectl logs command is working! Now we need to focus on a less tactical solution for 1.27 onwards. To answer the question about the proxy config: In our proxy config we allow *.amazonaws.com through the proxy with just a few entries in our NO_PROXY where we want to route directly namely: • .eks.amazonaws.com (For internal EKS Control Plane API Endpoints) Is the code definately respecting system proxy and no-proxy settings? |
Hi team, sorry for the delays in replying. We were asked by AWS support to sheltie onto a node and run a describe-instances command. the aws command isn't on the path and a I'm attaching the debug output (with tokens redacted) for your perusal. |
Hi all I realised that when I did my tests on Thursday, I may not have set the proxy variables correctly after issuing the sheltie command. I have repeated my tests this morning and can confirm that with the environment variables https_proxy, http_proxy and no_proxy set correctly, we can indeed successfully execute an aws ec2 describe-instances command. The debug output is attached with security tokens etc. redacted as well as truncating most of the XML returned. The command I used was For information, I’ve confirmed that the launch template for the instances has these settings. The values of which match what I set for https_proxy and no_proxy env vars in my test. settings.network.no-proxy = [ ] Are these definitely set in the environment at the time you run the describe-instances command? Hopefully this information will help you to debug the issue further. |
Hey @sonalita, thanks for the updates! It sounds like there might be some nuance in how you are configuring your proxy. What was the change you needed to do to get it working? I wonder if we have additional work to do to handle what that change is?
We pass them directly to |
Hi.
When I did the tests on Thursday, I hadn;t realized that the sheltie command
wasn't setting proxies and I failed to check so all I did was to set the
https_proxy and no_proxy environment variables (export https_proxy=xxx
for example) to what they should be as per the launch template settings and
the aws cli commands started working.
" We pass them directly to pluto and have confirmed pluto respects these settings" But have they been set at
this point in the boot process? i.e. have settings.network.no-proxy and settings.network.https-proxy been actioned and https_proxy and no_proxy are set in the environment at this point?
…On Mon, 27 Nov 2023 at 17:38, Matthew Yeazel ***@***.***> wrote:
Hey @sonalita <https://github.com/sonalita>, thanks for the updates! It
sounds like there might be some nuance in how you are configuring your
proxy. What was the change you needed to do to get it working? I wonder if
we have additional work to do to handle what that change is?
Are these definitely set in the environment at the time you run the
describe-instances command?
We pass them directly to pluto and have confirmed pluto respects these
settings. pluto is what ends up calling the EC2 DescribeInstances API via
the Rust SDK. It might be a matter of formatting these variables to ensure
they work correctly.
—
Reply to this email directly, view it on GitHub
<#3525 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJIBMOM3RGDPLBTDK26MYTYGTFZLAVCNFSM6AAAAAA57I4BUOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGMYTKNJRHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
-- Steve
|
Copying the info I added to the AWS support case here for reference (I was asked for the console logs and userdata again) Hi As requested, I'm attaching the console log for an instance attempting to join a Kubermetes 1.27 nodegroup and the sanitized userdata from its Launch template. br-1.16.1-k8s-1-27-userdata.txt |
Hi @sonalita, thanks for the additional info. In your previous comment you mentioned your Currently bottlerocket/sources/api/pluto/src/proxy.rs Lines 66 to 68 in 40290da
I think that's the issue you're running into right now. Unfortunately I don't have a quick workaround for this at the moment. I'm gonna go ahead and update the title of this issue so we can track this more accurately. |
https_proxy
One workaround that comes to mind is to use bootstrap containers to basically replace what Firstly, you need to skip |
@etungsten Hi I added the hostname-override variable to the bottlerocket userdata and then I added the following to our bootstrap container run script:
Any idea on when the pluto code will be updated and released? |
Hi @sonalita, We merged a potential fix for this issue and will be releasing it in the next Bottlerocket release next week. |
I am pleased to report that the new Bottlerocket AMI is working as expected without the bootstrap workaround. Thankyou to all at AWS and Bottlerocket that have helped us resolve this issue. It's taken a while but we finally got there! Thanks again,
|
We've been happily running on Kubernetes 1.25 for several months using bottlerocket images but we are now in the process of upgrading our clusters. However, after a suuccessful upgrade of the EKS managed control plane to 1.26, new nodes are failing to join the node goup.
This seems similar to 3064 to which my colleague Paul asked a question on -a couple of days ago but not quite the same problem.
Any advice on what the issue may be and how we can fix it please?
Image I'm using:
bottlerocket-aws-k8s-1.26-x86_64-v1.15.1-264e294c
What I expected to happen:
Instances launch and join the EKS nodegroup
What actually happened:
Instances launch but fail to join the nodegroup, The System log shows the following error:
[ OK ] Finished Bottlerocket userdata configuration system.
Starting User-specified setting generators...
Unmounting EFI System Partition Automount...
[ OK ] Unmounted EFI System Partition Automount.
[ 303.392308] sundog[1343]: Setting generator 'pluto private-dns-name' failed with exit code 1 - stderr: Timed out retrieving private DNS name from EC2: deadline has elapsed
[FAILED] Failed to start User-specified setting generators
We do also see this as well in the console log but not sure if this is relevant to our issue - but it is different to what we see on a 1.25 control plane cluster:
[ OK ] Finished Generate network configuration.
[ 1.708363] migrator[1266]: Data store does not exist at given path, exiting (/var/lib/bottlerocket/datastore/current)
[ OK ] Finished Bottlerocket data store migrator.
How to reproduce the problem:
Our VPC has two CIDRs, a secondary CIDR is used for our pod network
The text was updated successfully, but these errors were encountered: