-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lightsail credentials fetching from IMDS is slow #2678
Comments
2 seconds does seem like a long time to refresh credentials - is the S3_REGION the same as the region the container is running in? Do you see the same 2 second delay when creating other service clients? We have an in progress PR (#2642) that does credential refreshes async - this won't help with the startup time, but should help with requests after that. |
The instance and S3 bucket are in the same region, as it is the requirement for Lightsail Resource access permission type. I don't use any other service with a similar access type apart from Lightsail object storage (S3). So, I can't create any other service clients, or at least can't test in a meaningful way. However, if you want to try scope down the issue, I am happy to try any code via the Rails console; you may test STS in isolation if you know the code to run. Please note the timezone difference and optimize test cases for the minimal roundtrip. I don't mind if it needs to be multiple times, but better fast than slow. Not sure if this is related: aws/aws-sdk-go#2972 Glad to hear about the improvement underway. |
Perhaps STS credential fetching is retried many times, resulting in delay? When you construct your credentials (which are you using?), you can pass in an STS client, with |
There are a couple of optimizations that we had in mind and a few that we've tackled:
The async refresh will help with long-tail latency like @alextwoods mentioned, but for initial latency cost, we saw that doing initializing on boot helped as well. |
One thing I do notice is that you aren't providing any credentials directly, and you're using the provider chain. Is it possible you're fetching credentials with IMDS, and if so, are requests failing? |
OK, here are some updates: http_wire_traceI tried The actual commands3_client = ::Aws::S3::Client.new(region: ENV['S3_REGION'], http_wire_trace: true)
test_object = ::Aws::S3::Object.new(ENV['S3_BUCKET'], 'testfile.bin', client: s3_client)
test_object.presigned_url(:get, expires_in: 5.seconds.to_i) But, I also identified that for the first presigning, the slowness is in
So I can, potentially, warm up the singleton on application start. IMDSAnd when I tried fetching directly from IMDS (from within the container)
I got the result instantaneously.
|
Since the first presigning is in client creation that likely means that the time is being taken by the CredentialProviderChain (which is attempting to figure out which type of credentials are available). And Sorry for the back and forth, but can you try: client_explicit = nil
puts ::Benchmark.measure {
client_explicit = Aws::S3::Client.new(region: ENV['S3_REGION']), credentials: InstanceProfileCredentials.new(http_debug_output: $stdout))
}
puts client_explicit.config.credentials.inspect
client_default = nil
puts ::Benchmark.measure {
client_default = Aws::S3::Client.new(region: ENV['S3_REGION']))
}
puts client_default.config.credentials.inspect |
Also good to see confirmation that the issue is likely not related to setting the hops correctly (based on the curl). |
I was suggesting wire trace on the STS client used for AssumeRoleCredentials. Of course it won't print anything for pre signing |
It's not clear what credentials are being used here. The issue post says the slowness is with STS, but I don't see any STS related logging that indicates it's being used. |
Luckily, I am still up today; it is almost 2 AM here. puts ::Benchmark.measure {
client_explicit = Aws::S3::Client.new(region: ENV['S3_REGION'], credentials: Aws::InstanceProfileCredentials.new(http_debug_output: $stdout))
} prints
then puts ::Benchmark.measure {
client_default = Aws::S3::Client.new(region: ENV['S3_REGION'])
} prints
then puts client_default.config.credentials.inspect prints
This is why I like Ruby; it makes debugging way easier than its peers. I don't know exactly if it STS; I just presumed; I did not utilize STS myself, just thought it is called internally by the client. I hope the output from the first one provides you with insights. |
The slowness looks to be from read timeout on InstanceProfileCredentials. Are you using a proxy? Do you need to configure hop limits? https://docs.aws.amazon.com/cli/latest/reference/ec2/modify-instance-metadata-options.html |
Just a quick update, I just read about the hop limiting mechanism today; I used IP without knowing there is such capability. However, ec2 bundled as a Lightsail instance does not own by my account; it is under Amazon's. I cannot interact with it with To summarize:
I may be able to help with the situation if I understand a bit more. |
I hacked it just for myself; basically, an iptables rule needs to be added on the Lightsail host. iptables -t mangle -A PREROUTING -s 169.254.169.254 -d 172.26.0.0/16 -m ttl --ttl-lt 2 -j TTL --ttl-inc 1
# the metadata server address is always `169.254.169.254`, and Lightsail instance addresses are always inside `172.26.0.0/16` Dangerous rules like this are only possible because the company put so much trust in me. My previous question is still stands. I might be able to come up with a solution that works for the general public. |
If you're using ec2 (even as part of lightsail) you should be able to modify instance metadata from anywhere, not just from the instance you are using. You have the ec2 instance id, and you have the AWS account that created the instance, right? |
Unsure. You are experiencing read timeouts and it eventually succeeds. That makes me less suspicious about hop limits. Either the instance is not available (unlikely) or the packets are lost during routing through the NAT? Edit: It looks like it's falling back to IMDS V1, so maybe it is hop limits. |
I technically am able to obtain the instance-id, but the owner of the ec2 is Amazon. To add some context: the Lightsail network can peer with the default VPC. When peered we will be able to see the account id of the peered network. The account with id 057072166157 is owned by Amazon. Whenever I have to interact with the resource from that account, I have to interact with the API provided by Amazon. IAM roles and credentials from my company won't work there unless through the Lightsail API. From further reading, it seems that if IMDSv2 fails, the SDK will fall back to IMDSv1, which does not implement such security measures. I haven't looked into how to implement it yet or what the interface will look like. But how about we add a user-supplied hint as an option somewhere to make the SDK try IMDSv1 first? From what I have read, IMDSv1 should be supported by Amazon indefinitely. |
Ah! You're right that it's falling back to IMDSv1 and succeeding. I think the solution here is that you do need to increase the hop limits on the ec2 instance. The SDKs do not allow for turning off IMDSv2 and that was deliberate. Even though the owner of the instance is Amazon, do you really not have access to that instance using an assumed role (provided by Lightsail) with permissions? Otherwise, I think this is an issue for Lightsail and not for the SDK. Could you maybe try AWS support in the console? |
I don't know how to try an assumed role yet, but I'll be surprised if that works; the whole point about Lightsail is that we don't have to deal with ec2 hassles. It is highly unlikely that AWS would let us directly mess with the underlying ec2 beneath Lightsail. But I'll try to learn more about it. It is not like I don't want to communicate, but apart from the service limit increase, AWS will charge customers for any other kinds of support tickets, including (obvious) bug reports. (They called it "premium support".) After the company spends a stupid load of money on bandwidth and some other products, they don't feel like spending percentage on top of it, especially to discuss issues we can solve on our end but benefit Amazon as a collective. Good works, as always |
|
I'm poking around in Lightsail to see what we can do. Let me see if I can make some noise internally. |
@midnight-wonderer I have a point of contact on the Lightsail team. They want to know if you're using your own docker container or using the lightsail containers feature (which leverages Fargate)? |
Thank you for the follow-up. |
I've talked to some members of the Lightsail team and also poked around a bit. Per my understanding as a background, Lightsail EC2 instances (like all EC2 instances) have linked IAM role (AmazonLightsailInstanceRole/instance-id). You can view it by calling Option 1) Lightsail will modify the metadata on your behalf if you can provide a customer ID, region, instance ID, and the value of the hop limit (depends on your container routing). The downside of this option, is that it's manual and new instances will not have your hop limit. If you want to do this, I can give you my email or some way to securely provide me that information. Option 2, preferred IMO) Migrate your containers to EC2 instances without Lightsail. You can export snapshots but you will have to wire up storage and other expected features. The Lightsail engineer said that your use case may be more complex than Lightsail wants to offer. If you are running your own containers, you may as well just run them on vanilla EC2 instances, and then you have full control over them. You can then use the AWS CLI to modify your instance metadata. Option 3) No architecture changes, and try to fail fast to use V1 credential fetching. It does seem like you expect your credentials from IMDS on the EC2 host. You can initialize an instance of |
Thanks for the offering on Lightsail reconfiguration; I really do appreciate it. Still, I prefer regular instances to special treatment, so people who maintain the servers after me would have an easier time figuring things out. Some of our services are running in a regular VPC; we still need IPSec peering and stuff. However, we are in the direction of another way around; we think EC2 is overly complicated for our workload and start moving over to Lightsail. Lightsail is actually doing a better job than our self-managed VPC setup on some aspects. Weighing the cost-benefit, I don't think that is the way to go for us. Option 3 seems to be the way to go for most people. If someone comes to this thread from Google, that is probably what you are looking for. I also want to add that you probably need to config Otherwise, TTL mangling is relatively safe too. The reason is that the condition I understand every party, including the Lightsail team who decide against more ec2 permissions. |
|
Confirm by changing [ ] to [x] below to ensure that it's a bug:
Describe the bug
AWS STS takes seconds to respond to credential requests.
This is only my assumption; check below for the observed behavior.
Gem name
aws-sdk-s3 (1.111.2)
Version of Ruby, OS environment
ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]
To Reproduce (observed behavior)
I am running a docker container on Lightsail which makes use of Lightsail object storage.
I granted access to the storage via "Resource access" by attaching Lightsail instances to the bucket.
I reused the S3 client properly:
On the very first S3 request it will take 2 seconds in my environment to presign an S3 URL.
This is also true for subsequent credential updates.
Here I capture the moment when
X-Amz-Credential
changes:The regular one takes only 8ms to presign and redirect.
The one with STS credential update takes 2023ms.
Please note that both use the same S3 client instance to presign.
Is this a performance bug? 2 seconds seems too long for me, even if it happens occasionally.
Expected behavior
Faster response from
aws-sdk-s3
.The text was updated successfully, but these errors were encountered: